Large-scale Validation of Counterfactual Learning Methods

A Test-Bed

Damien Lefortier	Adith Swaminathan	Xiaotao Gu	Thorsten Joachims	Maarten de Rijke
<dlefortier@fb.com>	<adith@cs.cornell.edu>	<gxt13@mails.tsinghua.edu.cn>	<tj@cs.cornell.edu>	<derijke@uva.nl>
University of Amsterdam	Cornell University	Tsinghua University	Cornell University	University of Amsterdam

Date: December 1, 2016

Overview

We provide a public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). The data comes from traffic logged by Criteo, a leader in the display advertising space. This dataset is hosted on Amazon AWS and is available to the public at https://s3-eu-west-1.amazonaws.com/reco-dataset/CriteoBannerFillingChallenge.tar.gz

A small sample of this data (~400 impressions, ~1MB) is available here.

The dataset has over 100 million display ad impressions, and is 35GB gzipped / 250GB raw. We hope this dataset will serve as a large-scale standardized test-bed for the evaluation of counterfactual learning methods. If you use the dataset for your research, please cite [1] and drop us a note on your research as well as the team at Criteo.

Data Description

Consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. Each ad has one of many banner types, which differ in the number of products they contain and in their layout. The task is to choose the products to display in the ad knowing the banner type, user context, and candidate ads, in order to maximize the number of clicks. The format of this data is:

example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1} ...
${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
...
${wasProductMClicked} exid:${exID} ${productFeatM_1}:${vM_1} ...

Each impression is represented by ${M+1} lines where ${M} is the number of candidate ads and the first line is a header containing summary information. The ${nbSlots} slots in a banner are labeled in order from left to right and from top to bottom. The first ${nbSlots} candidates correspond to the displayed products ordered by position. The logging policy stochastically fills the banner by first computing non-negative scores for all candidates, and then sampling without replacement from the multinomial distribution defined by these scores (i.e. a Plackett-Luce ranking model). The ${propensity} records the probability with which the displayed banner was sampled under this logging policy. There are 35 features. Display features include the user context and banner type, which are constant for all candidates in an impression. Each unique quadruplet of feature IDs < 1, 2, 3, 5 > correspond to a unique banner type. Features 1 and 2 are numerical, while all other features are categorical. Some categorical features are multi-valued (order does not matter). Example ID is increasing with time, allowing temporal slices for evaluation. Importantly, non-clicked examples were sub-sampled aggressively to reduce the dataset size and only a random 10% sub-sample of non-clicked impressions are logged.

Statistics

Over 103M+ ad impressions, with number of displayed products ${nbSlots} ranging between 1 and 6 included.
8500+ banner types with the top-10 banner types representing 30% of the total, top-50 about 65%, and the top-100 about 80% of the impressions.
There are 21M+ impressions for 1-slot banners, 35M+ for 2-slot, 23M+ for 3-slot, 7M+ for 4-slot, 3M+ for 5-slot and 14M+ for 6-slot banners.

A small sample of this data (~400 impressions, ~1MB) is available here.

Code

Download all helper scripts here: BLBF-DisplayAds.zip

Dependencies

All evaluation scripts are written in Python3 and were developed on a Linux machine.
Learning algorithms use Vowpal Wabbit and a Python3 implementation of POEM (included as a stand-alone in BLBF-DisplayAds.zip).

Recommended installation process:

Download and install Anaconda with Python3 (e.g. Python 3.5).
Ensure the following python packages are installed : Numpy, Scipy, Scikit-learn. With Anaconda, just use:
```
conda install [package]
```
Install Vowpal Wabbit. On Windows, use cygwin and follow these instructions. In the instructions below, vw is assumed to be executable from the current working directory. To achieve this in Linux, after compiling Vowpal Wabbit, simply run
```
sudo make install
```
Otherwise, simply replace every invocation of vw in the scripts below with the full path to the Vowpal Wabbit binary.
Download the Criteo dataset from https://s3-eu-west-1.amazonaws.com/reco-dataset/CriteoBannerFillingChallenge.tar.gz. Untar it. On Linux, just use:
```
tar -zxvf CriteoBannerFillingChallenge.tar.gz
```
Optionally, you can re-compress using gzip since all the scripts support reading/writing gzippped files through the zlib library:
```
gzip CriteoBannerFillingChallenge.txt
```
Download and unzip BLBF-DisplayAds.zip, and navigate to BLBF-DisplayAds/Scripts/.

Workflow

parser.py: This reads the Criteo dataset in either .txt or .txt.gz format, and prints statistics (see Tables 1, 2, 3, 4 in the [paper]) and also creates a train-validate-test split of all 1-slot banner impressions in Vowpal Wabbit input format.
```
python parser.py [criteo_data_file] [vw_output_prefix] <compressed_output?> <click_encoding> <no-click_encoding>
```
Click|No-click is encoded by default as 0.001 | 0.999, since Vowpal Wabbit expects costs for candidate actions in the range [0, 1]. The train-validate-test split files for 1-slot banners in Vowpal Wabbit format are, by default, uncompressed.
latexify.py: This takes the stdout output of parser.py and pretty prints it in Latex-format (this is how Tables 2, 3, 4 in the [paper] were populated).
```
python latexify.py [parser_log]
```
vw_baselines.sh: This shell script requires the train-validate-test files generated by parser.py. Usage is
```
./vw_baselines.sh <vw_file_prefix> <no-click_encoding>
```
Use the same cmdline arguments when running vw_baselines.sh as you did when generating the files using parser.py. If no command line arguments are provided, vw_file_prefix is set to vw and no-click_encoding is set to 0.999 by default. This script trains Vowpal Wabbit on the training set (either .txt or .txt.gz format) with different hyper-parameters, and using different reduction approaches. Then, it generates predictions on the validation and test sets using these trained models. Training is done using
```
vw -d [train_file] -c --compressed --save_resume -P 500000 --holdout_off --sort_features --noconstant 
--hash all -b 24 -f [model_file] --cb_adf --cb_type [dm/dr/ips]
```
See Vowpal Wabbit Wiki for more details, or use
```
vw --cb_adf -h
```
--cb_type dm corresponds to Regression, --cb_type dr is DRO and --cb_type ips is IPS in the paper.
scorer.py: This implements the IPS and SNIPS estimate with sub-sampling correction, and their importance sampling diagnostics as derived in Section 3 of the [paper]. The scorer takes predicted scores from a Vowpal Wabbit model
```
vw -d [test_file] -c --compressed -P 500000 --holdout_off -i [model_file] -t --rank_all -p [predictions_file]
```
and reports performance of a deterministic policy [that picks the lowest score candidate] as well as a stochastic policy [that uses a multinomial over candidates using exp(-score)]. Usage is:
```
python scorer.py [predictions_file] [test_file] [no-click_encoding]
```
This script accounts for sub-sampling of non-clicked impressions, as well as the custom encoding of click information introduced by parser.py.
POEM_learn.py: This implements the POEM baseline. Usage is
```
python POEM_learn.py -i [train_file] -o [model_prefix]
```
Use -h to see additional hyper-parameters that can be set from the command-line. POEM_learn.py generates models [model_prefix]_[epochNum].npz encoding weights for a linear scorer after every mini-batch in each epoch, and also generates a feature dictionary at [model_prefix].features. POEM_learn.py uses two helper scripts: Dataset.py (to process the train_file in .txt or .txt.gz format) and Instance.py (to compute per-instance estimates and gradients). Dataset.py uses a naive single-pass load-everything-in-memory approach, and hence, has a large memory footprint (~36GB). Size of feature dictionary is also set to a hard-coded upper limit, the next release will support feature hashing like in Vowpal Wabbit.
POEM_predict.py: This takes the model and feature dictionary dumped by POEM_learn.py and an input file in Vowpal Wabbit format, and outputs predictions in Vowpal Wabbit format.
```
python POEM_predict.py -i [test_file] -m [model.npz] -f [model.features] -o [predictions_file]
```
Typical workflow uses this output predictions_file as input to scorer.py.

References

[1] D. Lefortier and A. Swaminathan and X. Gu and T. Joachims and M. de Rijke. Large-scale Validation of Counterfactual Learning Methods: A Test-Bed, NIPS Workshop on "Inference and Learning of Hypothetical and Counterfactual Interventions in Complex Systems", 2016. [arXiv] [paper] [poster].