Revision: 24
Last updated: 9 Aug 2011
This software package implements a supervised learning approach to training submodular scoring functions for extractive multi-document summarization based on structural SVM framework. Resulting large-margin method directly optimizes for ROGUE-1 F score. The method is based on the sentence pairwise model described in [1]. The short name “sfour” represents four words starting with the letter S: structural, SVM, submodular, summarization.
The summarization method is implemented using svm-struct framework developed by Thorsten Joachims and available at http://svmlight.joachims.org/svm_struct.html.
The contents of archive are as follows:
-
binaries/
(precompiled binaries)
- code/ (source code directory)
- scripts/ (some shared internal scripts)
-
duc03/ (data
directory for DUC ’03 dataset)
- duc04/ (data directory for DUC ’04 dataset)
- toyset/ (a small, read-to-run toy example)
Download is available here: http://www.cs.cornell.edu/~rs/sfour/sfoursrc.tgz.
If you want to compile the source code on your own you only have to download the archive and run make inside the code/ directory. This will produce two files, svm_sfour_learn and svm_sfour_classify.
Alternatively, you can use the provided binaries in the binaries/ directory.
The
archive includes a very small toy example in the toyset/ directory. This dataset is synthetic and included only to
provide a quick and easy way of running the code for the first time. It
contains three documents (each one is actually a paper closely related to this
work), each one with a single manual summary (i.e. the paper’s abstract). There
was some minimal preprocessing done to eliminate the most of the junk (a side
product of format conversion) and retain only sensible words.
In a
few easy steps anyone can train a model and then predict a summary using the
provided data.
1.
Copy
binaries (precompiled from binaries/
or your own from running make in code/) into toyset/exec/ directory.
2.
Train
the model by running
$ ./svm_sfour_learn
-c 1 -e 0.01 -w 0 trainidx mdl
inside toyset/exec/ directory. This
will use training examples listed in trainidx
and save the model as mdl using C
value equal to 1.
3.
Summarize
the document listed in testidx using
previously trained model mdl by
running
$ ./svm_sfour_classify
testidx mdl out
4.
Performance
is reported at the end of prediction in the line reading "Average loss on test set: 0.xxxx"
which states average loss between the prediction and the best possible greedily
selected summary sentences using the full knowledge.
More
details are given in toyset/HOWTO.txt
file.
The code was developed to work with DUC '03 & '04 datasets from http://duc.nist.gov/. We provide scripts for converting them into appropriate input format used in our code.
To use the code with those datasets and run the demo script you have to obtain the following files:
- duc2003.breakSent.tar.gz (script for breaking articles into sentences, distributed with DUC '03 dataset)
- duc03.results.data.tar.gz (the DUC '04 dataset)
- duc04.results.data.tar.gz (the DUC '03 dataset)
- ROUGE-1.5.5 evaluation scripts from http://berouge.com/
Then place the archives in the subdirectory corresponding to the target dataset and the ROUGE software in the same directory as unpacked archive with our code and run the appropriate script to preprocess the data.
The code expects certain data files in specific locations relative to the executable and in a predefined format. To see an example simply convert one DUC dataset using the provided scripts.
The inputfile passed as an argument to the executable should contain a list of document sets used as training examples or, in case of prediction, test examples. It has to contain a list of data files with the path ../data/svm[0-9]+ (one per line). If you're using data with headlines then files ../data/hdr_d[0-9]+t (numbers matching with data files) contain one headline per line.
Data files (../data/svm[0-9]+) have to be in the following format:
<class> 1:<length> 2:<articleNo> 3:<lineNo> <wordID>:<count> …
Each line is one input sentence. Class 0 represents a sentence from the dataset and classes 1-4 sentences from the manual training summaries. Selection of wordID is arbitrary except for reserved numbers 0-10 which should not be used. The length field should contain the length of the sentence in characters (used for calculating the remaining budget), articleNo is the index (starting from one in an arbitrary order) of the article in the dataset from which is this sentence and lineNo is the line number (each sentence counts as one line) of the sentence in the source article.
Furthermore, a few additional files are required:
- ../data/wmap.str with format "<wordID> <wordString>" providing the mapping between ID and corresponding word strings (one entry per line)
- ../data/stops containing a list of stop-words (one per line)
- ../data/CFs.str with format "<wordID> <wordFreq>" listing total word frequencies for the document set (one per line).
The easiest way to immediately try our method is to look at the toy example described previously. Alternatively, you can run the binaries manually on your dataset (which should be located in ../data) using the following syntax:
$ svm_sfour_learn -c <C-value> -e 0.01 -w 0 inputfile modelfile
$ svm_sfour_classify inputfile modelfile outputfile
The outputfile will contain a list of selected sentence numbers (starting from zero) for each entry in inputfile. The argument “-w 0” is required during the training because it selects n-slack algorithm, which works the best in this setting. Parameter “-e 0.01” selects the precision of the solution (the suggested value worked well in our experiments). There are also other options (inherited from the svm-struct package) and are explained if one runs the executables without any arguments.
[1] Large-Margin Training of Submodular Summarization Methods