Unbiased learning to rank (ULTR) with biased user behavior data has received
considerable attention in the IR community. However, how to properly evaluate and compare
different ULTR approaches has not been systematically investigated and there is no shared task
or benchmark that is specifically developed for ULTR. Therefore, we propose Unbiased
Learning to Ranking Evaluation Task (ULTRE) as a pilot task in NTCIR 16. In ULTRE, we design a
user-simulation based evaluation protocol and implement an online benchmarking
service for the training and evaluation of both offline and online ULTR models. We will also
investigate questions of ULTR evaluation, particularly whether and how different user simulation
models affect the evaluation results.
For any interested participants, please register through the NTCIR-16 registration page
We will follow the widely adopted simulation-based evaluation method. Based on a test
collection with manual relevance labels, we generate synthetic user clicks according to a user
simulation model (i.e. a click model) to train the ULTR model and use manual relevance labels
to evaluate the ranking performance. However, compared with previous studies that only use a
single click model, we plan to use the following user simulation models in this task:
●Position-Based Model (PBM): a click model that assumes the click probability of a search result only depends on its relevance and its ranking position.
●Dependent Click Model (DCM): a click model that is based on the cascade assumption that the user will sequentially examine the results list and find attractive results to click until she feels satisfied with the clicked result.
●User Browsing Model (UBM): a click model that assumes the examination probability on a search result depends on its ranking position and the distance to the last clicked result.
●Mobile Click Model (MCM): a click model that considers the click necessity bias (i.e. some vertical results can satisfy users’ information need without a click) in user clicks. We use this model because it achieved the best click prediction performance on Sogou’s click log among all a variety of click models.
Evaluation protocol for offline UTLR models
To evaluate the offline ULTR algorithms, we will follow the protocol used by previous research . We will first generate simulated click logs by running the user simulation model on a “production” ranker for all the training queries. The participants can then use the simulated click logs to train the ULTR model in an offline manner and use it to rank the results for the test queries. The ranking performance in the offline training setting will be measured by the evaluation metrics computed for the rankings of test queries.
Evaluation protocol for online UTLR models
To evaluate the online ULTR algorithms, we will implement an online service to simulate the online training process. The participant can upload the ranking lists for all the training and test queries and specify how many user sessions she wants to receive for this set of ranking lists. Then the online service will simulate the required number of incoming queries according to the empirical query frequency distribution and synthesize the clicks for the sampled queries, which can be used to train the ranker. This process can be repeated until a quota (e.g. 10k simulated sessions) is reached. This whole online training process mimics a more realistic scenario where the practitioners can deploy a ranking model, collect a certain amount of user feedback online during a period, and then use the feedback data to update the ranking model. The ranking performance in the online setting will also be measured on the test queries. As the participants may update the rankings for test queries during the training process, we can plot the changes in ranking performance.
We plan to use nDCG@5 based on 4-level human relevance labels
For more information, Please refer to the paper ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior , which details the evaluation framework and evaluation protocols in NTCIR16-ULTRE task.
The dataset in ULTRE task is constructed based on the Sogou-SRR, a public dataset for relevance
estimation and ranking in Web search. (http://www.thuir.cn/data-srr/)
Download ULTRE dataset
The ULTRE_dataset contains 1,200 unique queries with 10 successfully crawled results, 840 for training, 60 for validation and 300 for testing. In addition, a stratified sampling approach is used to ensure that the frequency of queries in training set is consistent with the one in the real logs. Table below shows the details of the dataset.
|Label||clicked(1) or not(0)||5-level relevance
annotations(0-4) (will not be released)
The files and directories contained in ULTRE dataset are shown below.
|File or Directory||Data Description|
|train/||The directory of the training data.|
|train/train.feature||The file of the ranking features for all training query-document pairs.|
|train/train.init_list||The file of ranked lists used to generate simulated clicks.|
|train/pbm/train.labels||The file of simulated clicks generated by Position-Based Model for each list in train.init_list.|
|train/dcm/train.labels||The file of simulated clicks generated by Dependent Click Model for each list in train.init_list.|
|train/ubm/train.labels||The file of simulated clicks generated by User Browsing Model for each list in train.init_list.|
|train/mcm/train.labels||The file of simulated clicks generated by Mobile Click Model for each list in train.init_list.|
|train/fusion/train.labels||The file of simulated clicks generated by the fusion of four click models (PBM,DCM,UBM,MCM) for each list in train.init_list. To be more specific, each click model deals with approximately 25% of the sessions in the initial list file.|
|train/LandingPage||The directory contains all html files for search results of each training query.|
|train/query_dict||The file that mapping the id of training queries to the original Chinese text of training queries. You can use this mapping file to find the corresponding html files given specific query_id in the init_list file.|
|valid/||The directory of the valid data.|
|valid/valid.feature||The file of the ranking features for all valid query-document pairs.|
|valid/valid.init_list||The file of candidate document lists for each valid query (ordered by doc_id).|
|valid/valid.labels||The file of human-annotated relevance labels for each document in valid.init_list.|
|valid/LandingPage||The directory contains all html files for search results of each valid query|
|valid/query_dict||The file that mapping the id of valid queries to the original Chinese text of valid queries|
|test/||The directory of the test data.|
|test/test.feature||The file of the ranking features for all test query-document pairs.|
|test/test.init_list||The file of candidate document lists for each test query (ordered by doc_id).|
|test/LandingPage||The directory contains all html files for search results of each test query|
|test/query_dict||The file that mapping the id of test queries to the original Chinese text of test queries|
Details about data formats in each file please see the description file in ULTRE dataset.
Offline unbiased learning-to-rank task
1) Each team should submit five runs for each model at one time, based on different kind of synthetic click data respectively. (corresponding to five label files in the train set)
2) The submission filename should be named
[TEAMNAME] helps the organizers to identify the specific participants.
[MODELNAME] means the name of the submitted model. (One team can develop several
[PBM/DCM/UBM/MCM/FUSION] means which kind of synthetic click data in train set is used
when training the submitted model.
eg. When the team RUCIR submit runs for model DLA, the submitted files should contain:
3) Run file format: The first line of the run file should be a brief description of the submitted
model. The other lines in the file should be of the form (the same format as file test.init_list):
[query_id]:[doc_id _for_the_1st_doc] [doc_id _for_the _2nd_doc]... [doc_id _for_the
Note that the run files should contain the ranked list for all test queries.
Online unbiased learning-to-rank task
Jiaxin Mao, firstname.lastname@example.org, Renmin University of China, China
Qingyao Ai, email@example.com, University of Utah, USA
Junqi Zhang, firstname.lastname@example.org, Tsinghua University, China
Tao Yang, email@example.com, University of Utah, USA
Yurou Zhao, firstname.lastname@example.org, Renmin University of China, China
Yiqun Liu, email@example.com, Tsinghua University, China
|July 15, 2021:||Dataset and simulated click logs release|
|August 15, 2021:||Registration due|
|Sep 1, 2021 - Dec 31, 2021:||Formal Run/Online evaluation|
|Feb 1, 2022:||Final evaluation result release|
|Feb 1, 2022:||Draft of task overview paper release|
|Mar 1, 2022:||Participant paper submission due|
|May 1, 2022:||Camera-ready paper submissions due|
|Jun 2022:||NTCIR-16 Conference & EVIA 2022 in NII, Tokyo, Japan|