An informal explanation of "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment"

This is a draft of an informal explanation of how our algorithm learns the knowledge it needs to generate sentence-level paraphrases. Many technical details have been omitted, and some issues have been oversimplified for the sake of clarity. For full details, please see the paper, "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment", Regina Barzilay and Lillian Lee, Proceedings of HLT-NAACL 2003, pp. 16--23.


This is a late parrot! It's a stiff! Bereft of life, it rests in peace! If you hadn't nailed him to the perch he would be pushing up the daisies! Its metabolical processes are of interest only to historians! It's hopped the twig! It's shuffled off this mortal coil! It's rung down the curtain and joined the choir invisible! This is an EX-PARROT! --- Monty Python, "Pet Shop"

Our learning algorithm is a "statistical method" in the sense current
in the natural language processing (NLP) community, that is, it
acquires linguistic information from patterns in real language data
like newspaper text.

To set the stage, although a feature of our work is that it
functions on comparable corpora, let's first consider the strictly
easier case of a parallel corpus (the "Rosetta stone" case).  (Our
algorithm works on parallel corpora as well.)  Let the
two corpora be A and B.  In a parallel corpus, essentially every
sentence in A lines up with (is "aligned" with) exactly one sentence
in B, and vice versa.  For example (I've numbered the sentences for
reference):

A1) Mary Smith visited a couple of cities last week during her campaign for state governor.     
A2) In Rochester, she talked about gun control.		     
A3) In Syracuse, she talked about campaign financing.		     
A4) In Ithaca, she spoke about unemployment.
A5) Bob Jones visited the same cities a week later.				     

B1) Last week, Mary Smith toured several cities during her gubernatorial campaign.            
B2) She spoke to a Rochester audience regarding the issue of gun control.
B3) In an address in the Salt City, she took on the issue of           
    campaign financing.                  
B4) She spoke to an Ithaca audience regarding its joblessness rate.
B5) The previous week, Bob Jones barnstormed in the very same cities.

In a parallel corpus, the corresponding sentences are paraphrases of
each other.  This makes the job of extracting paraphrases relatively
easier.  For instance, looking at the pair of sentences (A2, B2), by
assuming that "gun control" in A2 should be aligned to "gun control"
in B1, one can infer that "In Rochester, she talked about" must have the
same meaning as "She spoke to a Rochester audience regarding the issue
of".

But, we don't want the system to learn *just* that 
    "In Rochester, she talked about" 
is a paraphrase of 
    "She spoke to a Rochester audience regarding the issue of" 
because that isn't enough information to let us
paraphrase many new sentences (that is, this information isn't enough
to let the system generalize).  For instance, a hypothetical system
based just on this sort of information could take the new sentence 
    "In Rochester, she talked about racism" 
and paraphrase it to 
    "She spoke to a Rochester audience regarding the issue of racism", 
BUT, given the sentence, 
    "In Cleveland, she talked about taxes" 
it would be *unable* to paraphrase this to
    "She spoke to a Cleveland audience regarding the issue of taxes"
because "In Cleveland ..." doesn't match "In Rochester ..." as far as the
hypothetical system knows. What we really want our system to learn is
that a sentence of the form 
    "In X, Y talked about Z"
can be paraphrased as
    "Y spoke to a X audience regarding the issue of Z"
where X, Y, and Z, the "arguments", can be anything.

The first step is to learn that "In X, Y talked about Z", where X, Y,
and Z are arguments, is a valid linguistic pattern, whereas "X Rochester
she Y Z gun control" isn't.  We do this on each of corpus A and B
separately.  An informal description of how our algorithm does so is
follows.

(a) sentence clustering: try to collect data for inferring templates
by using a clustering algorithm to identify
sentences that look very alike on a word-by-word level (since the
system doesn't really know anything about English, it can only look
for these simplistic kinds of patterns at first).  The output of this
step might be that within corpus A, sentences A2, A3, and A4 form a
cluster; it might also find that within corpus B, sentences B2 and B4
form a cluster.  

(b) inducing patterns.  Take the cluster {A2,A3,A4}.  

A2) In Rochester, she talked about gun control.		     
A3) In Syracuse, she talked about campaign financing.		     
A4) In Ithaca, she spoke about unemployment.

Clearly, they all have in common the bits "In", and so all the "In"s
should be aligned (that is, considered to be equivalent); similarly
for all the "she"s and "about"s. This can be represented by a
"lattice" computed by multiple-sequence alignment, an algorithm that
determines the optimal way to match up portions of related sequences
of items such as words.  The output lattice might be like this (and
I'm trying to represent a graphical concept in text; I hope this is
clear):

    In [Rochester/Syracuse/Ithaca] she [talked/spoke] about
    [gun control/campaign financing/unemployment].

Where are the arguments?  The algorithm decides that the parts that
exhibit a lot of differences between the sentences in the cluster must
be arguments, since arguments, after all, are the parts of a sentence
that can vary.  In our example, we would then get the "template":

   In X  she [talked/spoke] about Z.

Note that in the process, the system has learned that "talked" and
"spoke" are paraphrases, and it has done so from the *single* corpus
A.  This part of the algorithm is what can be accomplished without a
"Rosetta Stone".

Running the same process on the cluster {B2,B4}, 

B2) She spoke to a Rochester audience regarding the issue of gun control.
B4) She spoke to an Ithaca audience regarding its joblessness rate.

we might get the template 

   She spoke to [a/an] X audience regarding Z.

Again, this is information gotten from just a single corpus, just one
of the halves of the Rosetta stone.

But, now remember that (for the initial sake of argument) we are
dealing with a parallel corpus, so we know that sentence A2 and B2 are
paraphrases, and that sentence A4 and B4 are paraphrases.  This
implies that templates derived from A2 should represent paraphrases of
templates derived from B2, and similarly for A4 and B4.  So, the
system infers that sentences of the form "In X she [talked/spoke]
about Z" can be paraphrased by sentences of the form "She spoke to [a/an]
X audience regarding Z."  And Bob's your uncle.

Now, the added bit of complexity.  Parallel corpora are very nice;
we've just seen how they facilitate the paraphrase-acquisition task,
for instance.  However, we all know that there's no free lunch:  the
problem is that parallel corpora are relatively hard to come by.
After all, why would one expect to be able to find documents that can
be matched up sentence by sentence?  (Many examples of parallel
corpora have been explicitly constructed by "translating" a corpus
sentence by sentence.)  Since there aren't that many domains of
discourse for which parallel corpora exist (the usual examples are
government proceedings for countries where records are mandated to be
bi- or multi-lingual), paraphrase-acquisition methods that rely on
parallel corpora are consequently limited to relatively few domains.  

Comparable corpora, on the other hand, are in comparison quite common,
collections of newswire from different agencies being one example of a
plentiful source.  A small example might consist of C and D below:

C1) Bob Jones spent late November barnstorming around New York State.
C2) Mary Smith visited a couple of cities last week during her campaign for state governor.     
C3) In Rochester, she talked about gun control.		     
C4) In Syracuse, she talked about campaign financing.		     
C5) In Ithaca, she spoke about unemployment.

D1) Last week, Mary Smith toured several cities during her gubernatorial campaign.
D2) She spoke to a Ithaca audience regarding its joblessness
rate.
D3) Several times, she was interrupted by enthusiastic cheers.
D4) In an address in Syracuse, she took on the issue of campaign
financing.                  
D5) Her listeners there were wildly supportive as well. 
D6) She spoke to a Rochester audience regarding the issue of gun control.
D7) But this time, stony silence greeted her.
D8) Her opponent, Bob Jones, had a much warmer reception when he
addressed the topic in his earlier sweep through the town.

Note that the topics are presented in a different order, and that some
sentences have no corollary in the other corpus.  However, there are
pairs of sentences from C and D that do match up.  This can be thought
of as a "Rosetta stones broken in half" situation.  So, what our
algorithm does is simply to attempt to identify those sentence pairs
that can be aligned --- gluing together the matching halves --- so
that then the techniques for parallel corpora, described above, can be
applied to these pairs.


Back links: Lillian Lee's home page or papers page or the "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment" page.