Unbiased Learning to Rank Evaluation (ULTRE)

In NTCIR-16 pilot task ULTRE, we provide a shared benchmark and evaluation for ULTR (Unbiased Learning to Rank) models.

Introduction

Unbiased learning to rank (ULTR) with biased user behavior data has received considerable attention in the IR community. However, how to properly evaluate and compare different ULTR approaches has not been systematically investigated and there is no shared task or benchmark that is specifically developed for ULTR. Therefore, we propose Unbiased Learning to Ranking Evaluation Task (ULTRE) as a pilot task in NTCIR 16. In ULTRE, we design a user-simulation based evaluation protocol and implement an online benchmarking service for the training and evaluation of both offline and online ULTR models. We will also investigate questions of ULTR evaluation, particularly whether and how different user simulation models affect the evaluation results.

For any interested participants, please register through the NTCIR-16 registration page

Methodology

User Simulation

We will follow the widely adopted simulation-based evaluation method. Based on a test collection with manual relevance labels, we generate synthetic user clicks according to a user simulation model (i.e. a click model) to train the ULTR model and use manual relevance labels to evaluate the ranking performance. However, compared with previous studies that only use a single click model, we plan to use the following user simulation models in this task:
●Position-Based Model (PBM): a click model that assumes the click probability of a search result only depends on its relevance and its ranking position.
●Dependent Click Model (DCM): a click model that is based on the cascade assumption that the user will sequentially examine the results list and find attractive results to click until she feels satisfied with the clicked result.
●User Browsing Model (UBM): a click model that assumes the examination probability on a search result depends on its ranking position and the distance to the last clicked result.
●Mobile Click Model (MCM): a click model that considers the click necessity bias (i.e. some vertical results can satisfy users’ information need without a click) in user clicks. We use this model because it achieved the best click prediction performance on Sogou’s click log among all a variety of click models.

Evaluation Protocol

Evaluation protocol for offline UTLR models

To evaluate the offline ULTR algorithms, we will follow the protocol used by previous research . We will first generate simulated click logs by running the user simulation model on a “production” ranker for all the training queries. The participants can then use the simulated click logs to train the ULTR model in an offline manner and use it to rank the results for the test queries. The ranking performance in the offline training setting will be measured by the evaluation metrics computed for the rankings of test queries.

Evaluation protocol for online UTLR models

To evaluate the online ULTR algorithms, we will implement an online service to simulate the online training process. The participant can upload the ranking lists for all the training and test queries and specify how many user sessions she wants to receive for this set of ranking lists. Then the online service will simulate the required number of incoming queries according to the empirical query frequency distribution and synthesize the clicks for the sampled queries, which can be used to train the ranker. This process can be repeated until a quota (e.g. 10k simulated sessions) is reached. This whole online training process mimics a more realistic scenario where the practitioners can deploy a ranking model, collect a certain amount of user feedback online during a period, and then use the feedback data to update the ranking model. The ranking performance in the online setting will also be measured on the test queries. As the participants may update the rankings for test queries during the training process, we can plot the changes in ranking performance.

Evaluation metric:

We plan to use nDCG@5 based on 4-level human relevance labels

For more information, Please refer to the paper ULTRE framework: a framework for Unbiased Learning to Rank Evaluation based on simulation of user behavior , which details the evaluation framework and evaluation protocols in NTCIR16-ULTRE task.

Data

The dataset in ULTRE task is constructed based on the Sogou-SRR, a public dataset for relevance estimation and ranking in Web search. (http://www.thuir.cn/data-srr/)
Download ULTRE dataset

Data statistics

The ULTRE_dataset contains 1,200 unique queries with 10 successfully crawled results, 840 for training, 60 for validation and 300 for testing. In addition, a stratified sampling approach is used to ensure that the frequency of queries in training set is consistent with the one in the real logs. Table below shows the details of the dataset.

Data Train Valid Test
Unique queries 840 60 300
Sessions 111911 60 300
Label clicked(1) or not(0) 5-level relevance
annotations(0-4)
5-level relevance
annotations(0-4) (will not be released)

Data Instructions

The files and directories contained in ULTRE dataset are shown below.

File or Directory Data Description
train/ The directory of the training data.
train/train.feature The file of the ranking features for all training query-document pairs.
train/train.init_list The file of ranked lists used to generate simulated clicks.
train/pbm/train.labels The file of simulated clicks generated by Position-Based Model for each list in train.init_list.
train/dcm/train.labels The file of simulated clicks generated by Dependent Click Model for each list in train.init_list.
train/ubm/train.labels The file of simulated clicks generated by User Browsing Model for each list in train.init_list.
train/mcm/train.labels The file of simulated clicks generated by Mobile Click Model for each list in train.init_list.
train/fusion/train.labels The file of simulated clicks generated by the fusion of four click models (PBM,DCM,UBM,MCM) for each list in train.init_list. To be more specific, each click model deals with approximately 25% of the sessions in the initial list file.
train/LandingPage The directory contains all html files for search results of each training query.
train/query_dict The file that mapping the id of training queries to the original Chinese text of training queries. You can use this mapping file to find the corresponding html files given specific query_id in the init_list file.
valid/ The directory of the valid data.
valid/valid.feature The file of the ranking features for all valid query-document pairs.
valid/valid.init_list The file of candidate document lists for each valid query (ordered by doc_id).
valid/valid.labels The file of human-annotated relevance labels for each document in valid.init_list.
valid/LandingPage The directory contains all html files for search results of each valid query
valid/query_dict The file that mapping the id of valid queries to the original Chinese text of valid queries
test/ The directory of the test data.
test/test.feature The file of the ranking features for all test query-document pairs.
test/test.init_list The file of candidate document lists for each test query (ordered by doc_id).
test/LandingPage The directory contains all html files for search results of each test query
test/query_dict The file that mapping the id of test queries to the original Chinese text of test queries

Details about data formats in each file please see the description file in ULTRE dataset.

Online Service

TBA

Organizers

Jiaxin Mao, ​maojiaxin@gmail.com​, Renmin University of China, China
Qingyao Ai, ​aiqy@cs.utah.edu​, University of Utah, USA
Junqi Zhang, ​zhangjq17@mails.tsinghua.edu.cn​, Tsinghua University, China
Tao Yang, ​taoyang@cs.utah.edu​, University of Utah, USA
Yurou Zhao, ​yurouzhao2021@gmail.com​, Renmin University of China, China
Yiqun Liu, ​yiqunliu@tsinghua.edu.cn​, Tsinghua University, China

Schedule

July 15, 2021: Dataset and simulated click logs release
August 15, 2021: Registration due
Sep 1, 2021 - Dec 31, 2021: Formal Run/Online evaluation
Feb 1, 2022: Final evaluation result release
Feb 1, 2022: Draft of task overview paper release
Mar 1, 2022: Participant paper submission due
May 1, 2022: Camera-ready paper submissions due
Jun 2022: NTCIR-16 Conference & EVIA 2022 in NII, Tokyo, Japan

Supported by