Large-scale Validation of Counterfactual Learning MethodsA Test-BedDate: December 1, 2016 |
We provide a public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF).
The data comes from traffic logged by Criteo, a leader in the display advertising space.
This dataset is hosted on Amazon AWS and is available to the public at
https://s3-eu-west-1.amazonaws.com/reco-dataset/CriteoBannerFillingChallenge.tar.gz
A small sample of this data (~400 impressions, ~1MB) is available here.
The dataset has over 100 million display ad impressions, and is 35GB gzipped / 250GB raw. We hope this dataset will serve as a large-scale standardized test-bed for the evaluation of counterfactual learning methods. If you use the dataset for your research, please cite [1] and drop us a note on your research as well as the team at Criteo.
Consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase.
Each ad has one of many banner types, which differ in the number of products they contain and in their layout.
The task is to choose the products to display in the ad knowing the banner type, user context, and candidate ads,
in order to maximize the number of clicks.
The format of this data is:
example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1} ... ${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ... ... ${wasProductMClicked} exid:${exID} ${productFeatM_1}:${vM_1} ...
Each impression is represented by ${M+1} lines where ${M} is the number of candidate ads and the first line is a header containing summary information. The ${nbSlots} slots in a banner are labeled in order from left to right and from top to bottom. The first ${nbSlots} candidates correspond to the displayed products ordered by position. The logging policy stochastically fills the banner by first computing non-negative scores for all candidates, and then sampling without replacement from the multinomial distribution defined by these scores (i.e. a Plackett-Luce ranking model). The ${propensity} records the probability with which the displayed banner was sampled under this logging policy. There are 35 features. Display features include the user context and banner type, which are constant for all candidates in an impression. Each unique quadruplet of feature IDs < 1, 2, 3, 5 > correspond to a unique banner type. Features 1 and 2 are numerical, while all other features are categorical. Some categorical features are multi-valued (order does not matter). Example ID is increasing with time, allowing temporal slices for evaluation. Importantly, non-clicked examples were sub-sampled aggressively to reduce the dataset size and only a random 10% sub-sample of non-clicked impressions are logged.
Download all helper scripts here: BLBF-DisplayAds.zip
All evaluation scripts are written in Python3 and were developed on a Linux machine.
Learning algorithms use Vowpal Wabbit
and a Python3 implementation of POEM (included as a stand-alone in BLBF-DisplayAds.zip).
conda install [package]
vw
is assumed to be executable from the current working directory. To achieve this in Linux, after compiling Vowpal Wabbit, simply run
sudo make installOtherwise, simply replace every invocation of
vw
in the scripts below with the full path to the Vowpal Wabbit binary.tar -zxvf CriteoBannerFillingChallenge.tar.gzOptionally, you can re-compress using gzip since all the scripts support reading/writing gzippped files through the zlib library:
gzip CriteoBannerFillingChallenge.txt
parser.py
: This reads the Criteo dataset in either .txt or .txt.gz format, and prints statistics (see Tables 1, 2, 3, 4 in the [paper])
and also creates a train-validate-test split of all 1-slot banner impressions in Vowpal Wabbit input format.
python parser.py [criteo_data_file] [vw_output_prefix] <compressed_output?> <click_encoding> <no-click_encoding>
Click|No-click
is encoded by default as 0.001 | 0.999
, since Vowpal Wabbit expects costs for candidate actions in the range [0, 1]
. The train-validate-test split files for 1-slot banners in Vowpal Wabbit format are, by default, uncompressed.
latexify.py
: This takes the stdout output of parser.py
and pretty prints it in Latex-format (this is how Tables 2, 3, 4 in the [paper] were populated).
python latexify.py [parser_log]
vw_baselines.sh
: This shell script requires the train-validate-test files generated by parser.py
. Usage is
./vw_baselines.sh <vw_file_prefix> <no-click_encoding>Use the same cmdline arguments when running
vw_baselines.sh
as you did when generating the files using parser.py
. If no command line arguments are provided, vw_file_prefix
is set to vw
and no-click_encoding
is set to 0.999
by default.
This script trains Vowpal Wabbit on the training set (either .txt or .txt.gz format) with different hyper-parameters,
and using different reduction approaches.
Then, it generates predictions on the validation and test sets using these trained models. Training is done using
vw -d [train_file] -c --compressed --save_resume -P 500000 --holdout_off --sort_features --noconstant --hash all -b 24 -f [model_file] --cb_adf --cb_type [dm/dr/ips]See Vowpal Wabbit Wiki for more details, or use
vw --cb_adf -h
--cb_type dm
corresponds to Regression, --cb_type dr
is DRO and --cb_type ips
is IPS in the paper.scorer.py
: This implements the IPS and SNIPS estimate with sub-sampling correction, and their importance sampling diagnostics as derived in Section 3
of the [paper]. The scorer takes predicted scores from a Vowpal Wabbit model
vw -d [test_file] -c --compressed -P 500000 --holdout_off -i [model_file] -t --rank_all -p [predictions_file]and reports performance of a deterministic policy [that picks the lowest score candidate] as well as a stochastic policy [that uses a multinomial over candidates using exp(-score)]. Usage is:
python scorer.py [predictions_file] [test_file] [no-click_encoding]This script accounts for sub-sampling of non-clicked impressions, as well as the custom encoding of click information introduced by
parser.py
.POEM_learn.py
: This implements the POEM baseline. Usage is
python POEM_learn.py -i [train_file] -o [model_prefix]Use
-h
to see additional hyper-parameters that can be set from the command-line. POEM_learn.py
generates models [model_prefix]_[epochNum].npz
encoding weights for a linear scorer after every mini-batch in each epoch, and also generates a feature dictionary at [model_prefix].features
. POEM_learn.py
uses two helper scripts: Dataset.py
(to process the train_file in .txt or .txt.gz format)
and Instance.py
(to compute per-instance estimates and gradients). Dataset.py
uses a naive single-pass load-everything-in-memory approach, and hence, has a large memory footprint (~36GB). Size of feature dictionary is also set to a hard-coded upper limit, the next release will support feature hashing like in Vowpal Wabbit.POEM_predict.py
: This takes the model and feature dictionary dumped by POEM_learn.py
and an input file in Vowpal Wabbit format, and outputs predictions in Vowpal Wabbit format.
python POEM_predict.py -i [test_file] -m [model.npz] -f [model.features] -o [predictions_file]Typical workflow uses this output
predictions_file
as input to scorer.py
.[1] D. Lefortier and A. Swaminathan and X. Gu and T. Joachims and M. de Rijke. Large-scale Validation of Counterfactual Learning Methods: A Test-Bed, NIPS Workshop on "Inference and Learning of Hypothetical and Counterfactual Interventions in Complex Systems", 2016. [arXiv] [paper] [poster].