Unbiased Learning to Rank Evaluation (ULTRE)

In NTCIR-16 pilot task ULTRE, we provide a shared benchmark and evaluation for ULTR (Unbiased Learning to Rank) models.

Introduction

Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to properly evaluate and compare different ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. Therefore, we propose Unbiased Learning to Ranking Evaluation Task (ULTRE) as a pilot task in NTCIR 16. In ULTRE, we design a user-simulation based evaluation protocol and implement an online benchmarking service for the training and evaluation of both offline and online ULTR models. We will also investigate questions of ULTR evaluation, particularly whether and how different user simulation models affect the evaluation results.

For any interested participants, please register through the NTCIR-16 registration page

Methodology

User Simulation

We will follow the widely adopted simulation-based evaluation method. Based on a test collection with manual relevance labels, we generate synthetic user clicks according to a user simulation model (i.e. a click model) to train the ULTR model and use manual relevance labels to evaluate the ranking performance. However, compared with previous studies that only use a single click model, we plan to use the following user simulation models in this task:
●Position-Based Model (PBM): a click model that assumes the click probability of a search result only depends on its relevance and its ranking position.
●Dependent Click Model (DCM): a click model that is based on the cascade assumption that the user will sequentially examine the results list and find attractive results to click until she feels satisfied with the clicked result.
●User Browsing Model (UBM): a click model that assumes the examination probability on a search result depends on its ranking position and the distance to the last clicked result.
●Mobile Click Model (MCM): a click model that considers the click necessity bias (i.e. some vertical results can satisfy users’ information need without a click) in user clicks. We use this model because it achieved the best click prediction performance on Sogou’s click log among all a variety of click models.

Evaluation Protocol

Evaluation protocol for offline UTLR models

To evaluate the offline ULTR algorithms, we will follow the protocol used by previous research . We will first generate simulated click logs by running the user simulation model on a “production” ranker for all the training queries. The participants can then use the simulated click logs to train the ULTR model in an offline manner and use it to rank the results for the test queries. The ranking performance in the offline training setting will be measured by the evaluation metrics computed for the rankings of test queries.

Evaluation protocol for online UTLR models

To evaluate the online ULTR algorithms, we will implement an online service to simulate the online training process. The participant can upload the ranking lists for all the training and test queries and specify how many user sessions she wants to receive for this set of ranking lists. Then the online service will simulate the required number of incoming queries according to the empirical query frequency distribution and synthesize the clicks for the sampled queries, which can be used to update the ranker. This process will be repeated until participants receive approximately the same amount of sessions in ULTRE dataset used in offline subtask.(i.e. 111911 simulated sessions). This whole online training process mimics a more realistic scenario where the practitioners can deploy a ranking model, collect a certain amount of user feedback online during a period, and then use the feedback data to update the ranking model. The ranking performance in the online setting will also be measured on the test queries. As the participants may update the rankings for test queries during the training process, we can plot the changes in ranking performance.

Evaluation metric:

We plan to use nDCG@5 based on 5-level human relevance labels

For more information, Please refer to the paper ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior , which details the evaluation framework and evaluation protocols in NTCIR16-ULTRE task.

Data

The dataset in ULTRE task is constructed based on the Sogou-SRR, a public dataset for relevance estimation and ranking in Web search. (http://www.thuir.cn/data-srr/)
Download ULTRE dataset(Update time: 2021.10.23. Please download the latest version of the dataset.)

Data statistics

The ULTRE_dataset contains 1,200 unique queries with 10 successfully crawled results, 840 for training, 60 for validation and 300 for testing. In addition, a stratified sampling approach is used to ensure that the frequency of queries in training set is consistent with the one in the real logs. Table below shows the details of the dataset.

Data Train Valid Test
Unique queries 840 60 300
Sessions 111911 60 300
Label clicked(1) or not(0) 5-level relevance
annotations(0-4)
5-level relevance
annotations(0-4) (will not be released)

Data Instructions

The files and directories contained in ULTRE dataset are shown below.

File or Directory Data Description
train/ The directory of the training data.
train/train.feature The file of the ranking features for all training query-document pairs.
train/train.init_list The file of ranked lists used to generate simulated clicks.
train/pbm/train.labels The file of simulated clicks generated by Position-Based Model for each list in train.init_list.
train/dcm/train.labels The file of simulated clicks generated by Dependent Click Model for each list in train.init_list.
train/ubm/train.labels The file of simulated clicks generated by User Browsing Model for each list in train.init_list.
train/mcm/train.labels The file of simulated clicks generated by Mobile Click Model for each list in train.init_list.
train/fusion/train.labels The file of simulated clicks generated by the fusion of four click models (PBM,DCM,UBM,MCM) for each list in train.init_list. To be more specific, each click model deals with approximately 25% of the sessions in the initial list file.
train/LandingPage The directory contains all html files for search results of each training query.
train/query_dict The file that mapping the id of training queries to the original Chinese text of training queries. You can use this mapping file to find the corresponding html files given specific query_id in the init_list file.
valid/ The directory of the valid data.
valid/valid.feature The file of the ranking features for all valid query-document pairs.
valid/valid.init_list The file of candidate document lists for each valid query (ordered by doc_id).
valid/valid.labels The file of human-annotated relevance labels for each document in valid.init_list.
valid/LandingPage The directory contains all html files for search results of each valid query
valid/query_dict The file that mapping the id of valid queries to the original Chinese text of valid queries
test/ The directory of the test data.
test/test.feature The file of the ranking features for all test query-document pairs.
test/test.init_list The file of candidate document lists for each test query (ordered by doc_id).
test/LandingPage The directory contains all html files for search results of each test query
test/query_dict The file that mapping the id of test queries to the original Chinese text of test queries

Details about data formats in each file please see the description file in ULTRE dataset.

Online Service

The ULTRE Online Service will generate simulated click logs based on ranking lists submitted by the ULTRE online subtask participants. Participants are expected to use the Online Service multiple times during the process of ranking model optimization until they have collected approximately the same amount of training sessions in the ULTRE dataset used in offline subtask. At each time, participants will obtain new training clicks and update their ranking models accordingly.

To see more details about the ULTRE Online Service, please download the following folder and read the API documentation.

Download ULTRE Online Service API documentation

Submission Form

Registered teams can upload their runs at NTCIR-16 ULTRE Submission Form.

Please strictly follow the instructions in Submission Format. Although you can submit runs for each model multiple times, only the latest submission is accepted as the official submission. Each team will have up to five official runs for each subtask (Since each team can develop up to five models in each subtask). Note that any submissions from non-registered teams or those after the deadline will be ignored.

Organizers

Jiaxin Mao, ​maojiaxin@gmail.com​, Renmin University of China, China
Qingyao Ai, ​aiqy@cs.utah.edu​, University of Utah, USA
Junqi Zhang, ​zhangjq17@mails.tsinghua.edu.cn​, Tsinghua University, China
Tao Yang, ​taoyang@cs.utah.edu​, University of Utah, USA
Yurou Zhao, ​yurouzhao2021@gmail.com​, Renmin University of China, China
Yiqun Liu, ​yiqunliu@tsinghua.edu.cn​, Tsinghua University, China

Schedule

July 15, 2021: Dataset and simulated click logs release
August 15, 2021: Registration due
Sep 1, 2021 - Dec 31, 2021: Formal Run/Online evaluation
Feb 1, 2022: Final evaluation result release
Feb 1, 2022: Draft of task overview paper release
Mar 1, 2022: Participant paper submission due
May 1, 2022: Camera-ready paper submissions due
Jun 2022: NTCIR-16 Conference & EVIA 2022 in NII, Tokyo, Japan

Supported by