This is a late parrot! It's a stiff! Bereft of life, it rests in
peace! If you hadn't nailed him to the perch he would be pushing up
the daisies! Its metabolical processes are of interest only to
historians! It's hopped the twig! It's shuffled off this mortal coil!
It's rung down the curtain and joined the choir invisible! This is
an EX-PARROT! --- Monty Python, "Pet Shop"
Our learning algorithm is a "statistical method" in the sense current in the natural language processing (NLP) community, that is, it acquires linguistic information from patterns in real language data like newspaper text. To set the stage, although a feature of our work is that it functions on comparable corpora, let's first consider the strictly easier case of a parallel corpus (the "Rosetta stone" case). (Our algorithm works on parallel corpora as well.) Let the two corpora be A and B. In a parallel corpus, essentially every sentence in A lines up with (is "aligned" with) exactly one sentence in B, and vice versa. For example (I've numbered the sentences for reference): A1) Mary Smith visited a couple of cities last week during her campaign for state governor. A2) In Rochester, she talked about gun control. A3) In Syracuse, she talked about campaign financing. A4) In Ithaca, she spoke about unemployment. A5) Bob Jones visited the same cities a week later. B1) Last week, Mary Smith toured several cities during her gubernatorial campaign. B2) She spoke to a Rochester audience regarding the issue of gun control. B3) In an address in the Salt City, she took on the issue of campaign financing. B4) She spoke to an Ithaca audience regarding its joblessness rate. B5) The previous week, Bob Jones barnstormed in the very same cities. In a parallel corpus, the corresponding sentences are paraphrases of each other. This makes the job of extracting paraphrases relatively easier. For instance, looking at the pair of sentences (A2, B2), by assuming that "gun control" in A2 should be aligned to "gun control" in B1, one can infer that "In Rochester, she talked about" must have the same meaning as "She spoke to a Rochester audience regarding the issue of". But, we don't want the system to learn *just* that "In Rochester, she talked about" is a paraphrase of "She spoke to a Rochester audience regarding the issue of" because that isn't enough information to let us paraphrase many new sentences (that is, this information isn't enough to let the system generalize). For instance, a hypothetical system based just on this sort of information could take the new sentence "In Rochester, she talked about racism" and paraphrase it to "She spoke to a Rochester audience regarding the issue of racism", BUT, given the sentence, "In Cleveland, she talked about taxes" it would be *unable* to paraphrase this to "She spoke to a Cleveland audience regarding the issue of taxes" because "In Cleveland ..." doesn't match "In Rochester ..." as far as the hypothetical system knows. What we really want our system to learn is that a sentence of the form "In X, Y talked about Z" can be paraphrased as "Y spoke to a X audience regarding the issue of Z" where X, Y, and Z, the "arguments", can be anything. The first step is to learn that "In X, Y talked about Z", where X, Y, and Z are arguments, is a valid linguistic pattern, whereas "X Rochester she Y Z gun control" isn't. We do this on each of corpus A and B separately. An informal description of how our algorithm does so is follows. (a) sentence clustering: try to collect data for inferring templates by using a clustering algorithm to identify sentences that look very alike on a word-by-word level (since the system doesn't really know anything about English, it can only look for these simplistic kinds of patterns at first). The output of this step might be that within corpus A, sentences A2, A3, and A4 form a cluster; it might also find that within corpus B, sentences B2 and B4 form a cluster. (b) inducing patterns. Take the cluster {A2,A3,A4}. A2) In Rochester, she talked about gun control. A3) In Syracuse, she talked about campaign financing. A4) In Ithaca, she spoke about unemployment. Clearly, they all have in common the bits "In", and so all the "In"s should be aligned (that is, considered to be equivalent); similarly for all the "she"s and "about"s. This can be represented by a "lattice" computed by multiple-sequence alignment, an algorithm that determines the optimal way to match up portions of related sequences of items such as words. The output lattice might be like this (and I'm trying to represent a graphical concept in text; I hope this is clear): In [Rochester/Syracuse/Ithaca] she [talked/spoke] about [gun control/campaign financing/unemployment]. Where are the arguments? The algorithm decides that the parts that exhibit a lot of differences between the sentences in the cluster must be arguments, since arguments, after all, are the parts of a sentence that can vary. In our example, we would then get the "template": In X she [talked/spoke] about Z. Note that in the process, the system has learned that "talked" and "spoke" are paraphrases, and it has done so from the *single* corpus A. This part of the algorithm is what can be accomplished without a "Rosetta Stone". Running the same process on the cluster {B2,B4}, B2) She spoke to a Rochester audience regarding the issue of gun control. B4) She spoke to an Ithaca audience regarding its joblessness rate. we might get the template She spoke to [a/an] X audience regarding Z. Again, this is information gotten from just a single corpus, just one of the halves of the Rosetta stone. But, now remember that (for the initial sake of argument) we are dealing with a parallel corpus, so we know that sentence A2 and B2 are paraphrases, and that sentence A4 and B4 are paraphrases. This implies that templates derived from A2 should represent paraphrases of templates derived from B2, and similarly for A4 and B4. So, the system infers that sentences of the form "In X she [talked/spoke] about Z" can be paraphrased by sentences of the form "She spoke to [a/an] X audience regarding Z." And Bob's your uncle. Now, the added bit of complexity. Parallel corpora are very nice; we've just seen how they facilitate the paraphrase-acquisition task, for instance. However, we all know that there's no free lunch: the problem is that parallel corpora are relatively hard to come by. After all, why would one expect to be able to find documents that can be matched up sentence by sentence? (Many examples of parallel corpora have been explicitly constructed by "translating" a corpus sentence by sentence.) Since there aren't that many domains of discourse for which parallel corpora exist (the usual examples are government proceedings for countries where records are mandated to be bi- or multi-lingual), paraphrase-acquisition methods that rely on parallel corpora are consequently limited to relatively few domains. Comparable corpora, on the other hand, are in comparison quite common, collections of newswire from different agencies being one example of a plentiful source. A small example might consist of C and D below: C1) Bob Jones spent late November barnstorming around New York State. C2) Mary Smith visited a couple of cities last week during her campaign for state governor. C3) In Rochester, she talked about gun control. C4) In Syracuse, she talked about campaign financing. C5) In Ithaca, she spoke about unemployment. D1) Last week, Mary Smith toured several cities during her gubernatorial campaign. D2) She spoke to a Ithaca audience regarding its joblessness rate. D3) Several times, she was interrupted by enthusiastic cheers. D4) In an address in Syracuse, she took on the issue of campaign financing. D5) Her listeners there were wildly supportive as well. D6) She spoke to a Rochester audience regarding the issue of gun control. D7) But this time, stony silence greeted her. D8) Her opponent, Bob Jones, had a much warmer reception when he addressed the topic in his earlier sweep through the town. Note that the topics are presented in a different order, and that some sentences have no corollary in the other corpus. However, there are pairs of sentences from C and D that do match up. This can be thought of as a "Rosetta stones broken in half" situation. So, what our algorithm does is simply to attempt to identify those sentence pairs that can be aligned --- gluing together the matching halves --- so that then the techniques for parallel corpora, described above, can be applied to these pairs.