We address the text-to-text generation problem of sentence-level paraphrasing --- a phenomenon distinct from and more difficult than word- or phrase-level paraphrasing. Our approach applies multiple-sequence alignment to sentences gathered from unannotated comparable corpora: it learns a set of paraphrasing patterns represented by word lattice pairs and automatically determines how to apply these patterns to rewrite new sentences. The results of our evaluation experiments show that the system derives accurate paraphrases, outperforming baseline systems.
Press mentions
Regina Barzilay of MIT and Lillian Lee of Cornell University have developed a computer program that can automatically paraphrase English sentences: The program culls text from online news services on particular topics, determines distinguishing sentence patterns in these clusters, and employs these patterns to generate new sentences that convey the same message with different wording. Potential applications for such a tool include report summarization, document checking for repetition or plagiarism, and a way for authors to automatically rewrite their prose to readers of different backgrounds, which Lee describes as a "style dial." Kevin Knight of the University of Southern California remarks that the program may even be able to help facilitate machine translation. Barzilay and Lee tested the program by having a computer categorize Agence France-Presse and Reuters articles according to subject, and then look for sentence clusters possessing similar words and phrases; the researchers used a genetic analysis technique to ascertain patterns within the sentence groupings, and these patterns were compiled through statistical testing. Once the computer is given a new sentence and instructed to paraphrase it, the sentence is compared to the sentence pattern database, and is revised through substitution of stored phrases. The researchers explain that the program is relatively competent at rewriting news service text, which is typically Spartan. "The writing the program used had to provide variation in wording, but not too much variation," Lee notes. Barzilay reports that the system performed well at paraphrasing short articles, but ran into difficulty with longer articles marked by more idiosyncratic prose.
@inproceedings{Barzilay+Lee:03a, author = {Regina Barzilay and Lillian Lee}, title = {Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment}, year = {2003}, pages = {16--23}, booktitle = {Proceedings of HLT-NAACL} }
This paper is based upon work supported in part by the National Science Foundation under ITR/IM grant IIS-0081334 and a Sloan Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Sloan Foundation.