Tuesdays and Thursday 1:25-2:40, Stimson G01 (Zoom link available on request)
This course covers selected advanced topics in natural language processing (NLP) and/or
information retrieval, with a conscious attempt to avoid topics covered by other Cornell courses.
Hence:
Students seeking a general introduction to NLP should take CS 4740 ("Introduction
to Natural Language Processing) or CS 4744 ("Computational Linguistics")
instead.
Students interested purely in language simply as an application domain for machine learning
should consider other courses instead: Significant portions of CS6740/IS6300
will be devoted to modeling language phenomena formally in ways that (to date)
are not machine-learning oriented.
If you're looking for something other than lecture content and have javascript enabled, click on the appropriate tab above.
The tabs may take a little time to come up.
Prerequisites, enrollment, related classes
Prerequisites All of the following: CS 2110
or equivalent programming experience;
a course in artificial intelligence or any relevant subfield (e.g., machine learning, NLP, information retrieval,
Cornell CS courses numbered 47xx or 67xx);
proficiency with using machine learning tools
(e.g., fluency with training a classifier and assessing its performance using cross-validation).
Enrollment Enrollment is open on Student Center to PhD and MS students (although those who do not meet the prerequisites should not take this class).
Other students interested in gaining permission to enroll: please contact
Prof. Lee after lecture
on Tuesday, September 3rd. (Before that date, I won't have enough information
on the number of students to be able to make enrollment allowances.)
Try to attend the first two lectures if you can, but if you are shopping other
courses meeting at the same time, it's OK to miss one or both of the first
two CS6740 lecture times. You will be be responsible for making up the material
on your own, but some form of notes or slides will be posted.
Auditing is an option for those permitted to enroll: the only requirement
is to sign up on Student Center for the "Audit" option as Grade Basis, and
there is no coursework or attendance requirement to earn the audit credit.
Students already actively engaged in thesis research should thus choose the
"Audit" grade basis.
Remote attendance is possible; please contact me for a Zoom link
(contact information listed on the "Administrative info" tab).
Formal models of language, parsing complexity: Tree-adjoining grammar, and perhaps also combinatory categorial grammar
Joshi, Aravind K., Leon S. Levy, and Masako Takahashi. Tree adjunct grammar. Journal of Computer and System Sciences 10(1):136–163 (1975).
Joshi, Aravind, , K. Vijay-Shanker and David Weir. 1991. The convergence of mildly context-sensitive grammar formalisms. In: Wasow, T., Sells, P. and Shieber, S. (eds.), Foundational Issues in Natural Language Processing. [Technical report version]
CMS pagehttps://cmsx.cs.cornell.edu.
Site for submitting assignments, unless otherwise noted.
You may find this graphically-oriented guide to common operations useful: see how to replace a prior submission (point 1), how to tell if CMS successfully received your files (point 2), how to form a group (point 4).
Office hours and contact info
See Prof. Lee's homepage and scroll to the section on "Contact and availability info".
Coursework
One administrative-info fill-in assignment
Roughly 6 assignments (about 1 per topic) probably involving some implementation
Potentially some in-class presentations/discussion, possibly but not necessarily in conjuction with the
assignments
Possibly a (in-class or take-home) preliminary exam depending on whether there seems to be a need for such assessment
Take-home final exam (see lecture schedule for due date)
Resources
Cornell's Passkey
for your web browser: "If you find yourself on a web page that has access
restrictions, click on the bookmarklet icon and you will be redirected to
the Cornell Web log-in screen to check for your valid Cornell affiliation.
You will be automatically led to the page you were trying to read, this
time recognized for your right to gain access to the library's licensed
resources."
#2 Sep 3: Motivation for Tree Adjoining Grammars: introduction to sentential structure
Assignments/announcements
Those wishing to enroll but need a PIN: please email Prof. Lee with your name and netID by noon on Thursday if you can (by Tuesday evening is preferable)
#3 Sep 5: CFGs and long-distance dependencies; tree substitution grammars as a way to lexicalize CFGs
Assignments/announcements
Everyone (including auditors and those not yet enrolled): please complete the CS 6740 "administrative matters" quiz on CMS, https://cmsx.cs.cornell.edu, deadline Mon Sept 9, 11:59pm. Enrollment permissions will be decided in
part by the information furnished as quiz answers.
So, being on CMS does not mean you have been enrolled in the class!
If you don't see "CS 6740" when you log in to CMS or can't log in, please
email Prof. Lee with your name and netID.
Section 8 "Linguistic relevance" of Aravind K. Joshi and Yves Schabes, 1996, "Tree adjoining grammars", which is chapter 2 of Handbook of Formal Languages: Vol 3, Beyond Words, ed. G. Rozenberg and A. Salomaa. (link requires logging in with your Cornell NetID)
Tentative sketch of first "real" assignment, due sometime between Sep 19 and 24: spend X hours (where I will specify X) implementing a representation of tree-adjoining grammars, allowing one to specify a TAG (that is, you should not hard-code a specific TAG), and, given a partial derivation tree (which you'll need to represent) and an elementary tree, determine whether the elementary tree can legally be substituted into by/adjoined into the corresponding derived tree. Write a description of your ideas and any challenges you faced. Be prepared to discuss your efforts in class.
You may not arrive at a really functional implementation; I'm just looking for a good-faith effort.
#4 Sep 10: Tree grammars: tree substitution grammars and tree adjoining grammars
Assignments/announcements
Assignment 1 is due September 19 12:00 P.M. (noon), but you can continue resubmitting on CMS
(Lillian will set up CMS by the night of September 11th)
until noon Monday the 23rd. You should spend a minimum of 10 hours and a maximum of
13 hours coding by the September 19 deadline; you're not obligated to do any more
coding after that. Along with a zip file of your code,
submit an informal writeup (PDF) describing your design decisions.
We'll discuss our experiences together on the lecture of Sep 24th.
Please work by yourselves until the September 19th deadline; after that I'll open
up some sort of discussion site to allow for collaboration.
Geoff Pullum, 1986. Footloose and context-free. Natural Language and Linguistic Theory 4, pp. 283--289. Reprinted in The Great Eskimo Vocabulary Hoax, U. of Chicago Press, 1991.
Stuart Shieber, 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8:333-343.
#6 Sep 17: More linguistic modeling with TAGs: modeling feature constraints
Anne Abeillé and Yves Schabes, 1989. Parsing idioms in lexicalized TAGs. Fourth Conference of the European Chapter of the Association for Computational Linguistics (EACL '89).
Assignment 1 addendum: post to CampusWire (join code given in class) some short description
of and/or motivation for your test cases for assignment 1.
Optional but encouraged: mention any questions you have for me or your fellow students.
Reading for next lecture or two: Mark Steedman (draft of November 1, 1996), A Very Short Introduction to CCG.
Also, skim Steedman's 2018 lifetime achievement award address, The Lost Combinator,
printed in Computational Linguistics span> 44(4).
You may find section 6, "CCG in the age of deep learning", an interesting reflection
I am tentatively planning a small CCG-based assignment to be released either next Tuesday or next Thursday (on which, recall, there is no lecture). You would have a week to complete it once it is released.
Eigner, Fabienne Sophie. 2007. Section 2.1 of Combinatory Categorial Grammar contains a description of a CCG for
the copy language. [slides (in German)]
#13 Oct 10: Concluding discussion on syntactic (and a bit of semantic) modeling
Assignments/announcements
A2 due tomorrow at noon.
Reading for next time (light stuff for Fall Break ...):
Alon Halevy, Fernando Pereira, Peter Norvig. 2009.
The unreasonable effectiveness of data.
IEEE Intelligent Systems 24:8--12. (The usage of the word "deep"
reads ironically these days.)
For the "hit the nail on the head" origins: Dmitrij Dobrovol’skij and Elisabeth Piirainen. 2010. Idioms: Motivation and etymology. Yearbook of Phraseology 1(1):73-96.
Oct 15: No class — Fall Break
#14 Oct 17: The dataset landscape:
today and how we got here.
Tentative plan for third assignment: try out "inoculating by fine-tuning",
perhaps in a domain of your own choice. Time span: a week or a week and a half after the assignment is formalized (probably next Thursday)
What is the purpose of data? One reason is evaluation, as the Penn Treebank paper said. I mention the PTB because in the "old days", it was in some sense the canonical dataset. Here's an example results table (EMNLP 2011):
Recall the landscape: new data introduced, then "solved" or "broken".
There are now even papers about algorithms that are meant to
withstand
certain kinds of "breaks" in datasets, e.g.,
Robin Jia, Aditi Raghunathan, Kerem Göksel, Percy Liang, 2019. Certified robustness to adversarial word substitutions. EMNLP.
Demos for two NLP tasks (we can try to break the algorithm in class)
Sentiment analysis demo at AllenNLP. Task considered in the "Build it Break it" data, and so will probably be an option for A3, since there's "regular" training data and "challenge" test sets.
Textual Entailment
demo at AllenNLP. A simple version of this task is considered in the "breaking" Levy et al. 2015 paper
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira,
Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine Learning 27(1-2): 151-175.
If you cannot attend the colloquium, please see the video, which can
be accessed via NetID login here, and which should be posted by a few days after the talk.
Speaker abstract, and bio: It is common to hear that certain natural language processing (NLP) tasks have been "solved". These claims are often misconstrued as being about general human capabilities (e.g., to answer questions, to reason with language), but they are always actually about how systems performed on narrowly defined evaluations. Recently, adversarial testing methods have begun to expose just how narrow many of these successes are. This is extremely productive, but we should insist that these evaluations be *fair*. Has the model been shown data sufficient to support the kind of generalization we are asking of it? Unless we can say "yes" with complete certainty, we can't be sure whether a failed evaluation traces to a model limitation or a data limitation that no model could overcome. In this talk, I will present a formally precise, widely applicable notion of fairness in this sense. I will then apply these ideas to natural language inference by constructing challenging but provably fair artificial datasets and showing that standard neural models fail to generalize in the required ways; only task-specific models are able to achieve high performance, and even these models do not solve the task perfectly. I'll close with discussion of what properties I suspect general-purpose architectures will need to have to truly solve deep semantic tasks.
(joint work with Atticus Geiger, Stanford Linguistics) Bio: Christopher Potts is Professor of Linguistics and, by courtesy, of Computer Science, at Stanford, and Director of the Center for the Study of Language and Information (CSLI) at Stanford. In his research, he develops computational models of linguistic reasoning, emotional expression, and dialogue. He is the author of the 2005 book The Logic of Conventional Implicatures as well as numerous scholarly papers in linguistics and natural language processing.
#23 Nov 19: Evaluation by/of textual inference (aka entailment)
Zaenen, Annie, Lauri Karttunen, and Richard Crouch. 2005. Local textual inference: Can it be defined or circumscribed? In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment d, 31–36. Ann Arbor, Michigan: Association for Computational Linguistics.
#24 Nov 21: No class — LL traveling to NDS Symposium in NY
#25 Nov 26: Explicit semantic representations; intro to AMR
Assignments/announcements
Final exam: take-home, to be worked on individually, released Tuesday Dec 10th (watch your mail).
No class Tuesday Dec 10th (ACL deadline recovery)
Tentative plan for assignment A4: light,released sometime Tuesday Dec 3, due Dec 10th.
Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 178–186. Sofia, Bulgaria: Association for Computational Linguistics.
#27 Dec 5: AMR parsing: Zhang, Ma, Duh and van Durme, EMNLP 2019
Assignments/announcements
Final take-home due date of Thursday Dec 19, 4:30pm.
A4 due time moved to Tuesday, Dec 10, 11:59 PM (extra 12 hours), with the usual lecture time on the 10th converted to optional office hours, in the usual classroom.
Groschwitz, Jonas, Matthias Lindemann, Meaghan Fowlie, Mark Johnson, and Alexander Koller. 2018. AMR dependency parsing with a typed semantic algebra. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1831–1841. Melbourne, Australia: Association for Computational Linguistics.