CS 4740: Natural Lanugage Processing (Spring 2025, undergraduate)

Cross-listed as COGST 4740 / LING 4474 / CS 5740

Instructors: Claire Cardie, Tanya Goyal

TAs: Wayne Chen, Son Tran, Oliver Li, Tejal Nair, Emily Wang, Cory Phillips, Alkim Arguz, Pun Chaixanien, Majd Aldaye, Brandon Li, Yuqing Wu, Ellie Dawson, Sean Cavalieri, Carter Larsen

Course Administrative Assistant: Amy Finch Elser (email: ahf42@cornell.edu)

Lecture: MW 1.25 p.m. - 2.40 p.m., Olin Hall 155

Office Hours and communication with staff:

Instructor Office Hours — Claire Cardie (11 a.m - 12 noon Monday, Gates 417), Tanya Goyal (11 a.m. - 12 noon Friday, Gates 441A) or by appointment (emails: cardie@cs.cornell.edu, tanyagoyal@cornell.edu)

TA Office Hours — (will start from Jan 27, 2025 onwards) Monday Oliver Li (10 a.m. - 11 a.m.; Rhodes 408) Tuesday Alkim Arguz (10 a.m. - 11 a.m.; Rhodes 408), Sean Cavalieri (12 noon - 1 p.m.; Rhodes 574) Wednesday Carter Larsen (5 p.m. - 6 p.m.; Rhodes 400), Tejal Nair (6 p.m. - 7 p.m.; Rhodes 408) Thursday Pun Chaixanien (11.35am - 12.35pm; Rhodes 406), Majd Aldaye (1.30 p.m. - 2.30 p.m.; Rhodes 402), Ellie Dawson (4.30 p.m. - 5.30 p.m.; Rhodes 406), Cory Phillips (5.30 p.m. - 6.30 p.m.; Rhodes 408) Friday Brandon Li (5 p.m. - 6 p.m.; Rhodes 408) Sunday Wayne Chen (10 a.m. - 11 a.m.; Rhodes 408)

[Link to course syllabus and schedule here]

Description

This course is an introduction to modern natural language processing (NLP). Today, NLP is at the heart of many exciting technologies including the widely popular large language models (LLMs) like ChatGPT and Claude. This course will cover core problems in traditional NLP, such as sequence tagging of part-of-speech tags for sentences or identifying named entities in text using approaches like hidden markov models and early neural methods for NLP like recurrent neural networks (RNNs). The second half of this course will lay the foundation for understanding how frontier LLMs like ChatGPT are built. We will cover transformer model architectures that form the backbone of these models, training recipes used, and other ingredients that contribute to building these powerful systems. We will also cover topics related to factualily, retrieval-augmented language models, efficiency, etc. At the end of this course, students should be in a position to understand and critique recent research in NLP.

Skip to the relevant section of this page:

Resources (Schedule, Textbooks, Assignments, etc.)

Prerequisites

Course Policies
- Grading
- Late Submission Policy
- Policy on use of Generative AI
- S/U enrollment and Auditing

SDS accomodations

Resources

Schedule: The course schedule (including lectures materials and assignments) is provided here.

Textbooks: We will follow Jurafsky and Martin, Speech and Language Processing, 3rd edition (draft). Free online version is available here.

Assignments: You will submit assignments using Gradescope. For coding parts of the assignments, you will use colab.

Prerequisites

Strong programming skills are important. Three semesters of programming classes are strongly recommended (e.g., completion of CS3110). CS2110 may suffice if you individually could have successfully and easily completed the assignments by yourself.

Python experience. Pytorch experience (as through CS4780) not required but some students report it being very helpful.

Comfort with elementary probability.

Clear understanding of matrix and vector operations.

Familiarity with differentiation.

You will be asked to complete HW0 (see the schedule page) to test these prerequisites. If you find yourself struggling with HW0, please talk to the course staff to discuss if this course is appropriate for you.

Course Policies

Note: We reserve the right to make necessary changes to any policy on this page if it would jeopardize the smooth running of the course. We aim to avoid making alterations, and will try to be as transparent as possible about key changes, e.g., by posting to Ed Discussions.

Grading:

Assignments (67%): This course will consist of 1 review assignment and 4 programming assignments (with possible milestones).
1. Review assignment / HW0 (3%): This is designed to test whether you have the necessary pre-requisite knowledge for this course. You must do this assignment individually and it should take you less than 2.5 hours to complete. You are required to submit this assignment although we will not check the correctness of this assignment. All students who turn in a reasonable submission will get the full grade on this. Importantly, if you have difficulty completing any part of this assignment, please contact the course staff to discuss whether this course is appropriate for you.
1. Full homework assigments (4 assignments, 16% each): These will be primarily coding assignments including (sometimes) non-coding components. We expect these to take tens of hours each. You can do these assignments individually or in groups of 2; all students in a group will recieve the same grade. We strongly encourage you to do these assignments in a group.
  (Only CS 5750) Students enrolled in CS5740 must complete an additional component for each 4740 homework individually. These components are graded as "satisfactory", "borderline", and "unsatisfactory". If a student receives two "borderline"s or one "unsatisfactory" among the four homeworks, we reserve the right to lower the student's letter grade as computed for 4740 by the equivalent of a "level", for example, from a B to a B-.

Exams (midterm 16%, final 16%): These will test individual conceptual knowledge. ~~To receive a C- or above in the course, students must receive at least a C- on both exams.~~[03/17/2024]: Course policy changed wrt exams. Please see #417 on Ed.

Course evaluation (1%): We will assign 1% of the grade for filling out the course evaluation form.

Collaboration policies:

Groups of two are allowed on all assignments except the review assignment (HW0) and the CS5740 add-on assignments. You can partner will anyone in the class (irrespective of registration to the undergraduate or graduate level of the course, or letter grade or S/U enrollment) but please discuss time and effort expectations with your prospective partner.

You do not have to have the same partner on each assignment.

Until all students’ submissions have been posted for the assignments, you must never consult any other groups’ written submission or code in any form. You are allowed to discuss conceptual doubts (e.g. how does the viterbi algorithm work?) but all the code and written report you submit must be your own (or your groups’). Additionally, you must not copy paste solutions from external sources like Stack Overflow or ChatGPT. Using these resources for debugging is allowed.

Please refer to the policy on the use of Generative AI (e.g. chatGPT, CoPilot, Claude, etc.).

Late Submission Policy:

Each student is given 5 slip days to use throughout the course for the homeworks. You are allowed to use a maximum of 2 slip days per homework assignment. An example use of slip days is: 2 slip days for HW1, 1 slip days for HW2, 2 slip days for HW3 and none for HW4.

After the slip days are exhausted, you will incur a penalty of 10% on your grade for each additional day beyond the deadline for the assignment, upto 2 days; we will not accept the assignment after that. We will count slip days for each student individually. If you are working in a group, members may incur different penalties for late submission depending on their individual slip day balance. Please co-ordinate appropriately with your partner.

You do not need to use slip days if you are sick or for other extenuating circumstances. Please email the instructors for these cases.

Accommodations for Students with Disabilities:

Your access in this course is important to us. Please give us [the instructors, the administrative assistant] your Student Disability Services (SDS) accommodation letter early in the semester so that we have adequate time to arrange your approved academic accommodations. If you need an immediate accommodation, please send an email message to the instructors (cardie@cs.cornell.edu, tanyagoyal@cornell.edu), administrative assistant (Amy Finch Elser; email: ahf42@cornell.edu) and SDS at sds_cu@cornell.edu.

Other misc items:

Enrollment options other than letter grades:
1. S/U enrollment: We will grade all your assignments and exams as if you were taking this course for a letter grade. We will convert this final letter grade to S/U using this policy: C- of better → S, D+ or below → U.
1. Auditing: Sitting in on lectures is fine as long as there are enough physical seats. Students auditing the course, either officially in Student Center or unofficially ("just sitting in"), should not submit any work, partner with officially registered non-auditors, take any exams, or join office hours if the lines are long (we need to conserve our grading and staff resources).

Policy on the use of generative AI (e.g. ChaGPT, CoPilot, Claude, etc.): We will release assignment-specific generative AI policies with each homework assignment. But, generally, remember that the goal of this course is to familiarize yourself with the basic building blocks of NLP systems. Therefore, you should never use ChatGPT or other generative AI engines to obtain the first drafts of your code or written responses. If you do use these, to say debug your code, you should explicitly state your use of these systems in your submitted work.