Time and place TuTh 2:55pm-4:10pm, Uris Hall G01
In-Class Midterm Tuesday March 10, 2:55pm, Uris Hall G01
Instructor: Prof. Cristian Danescu-Niculescu-Mizil --- Office hours: Fr 4:00-5:00pm you need to make an appointment
PhD TAs: , ,
Graduate TA:
Undergrad TAs listed on Piaza
Office hours schedule Google calendar (Cornell access, check for updates)
Course homepage http://www.cs.cornell.edu/Courses/cs4300/2020sp/
Summary How to make sense of the vast amounts of information available online, and how to relate it and to the social context in which it appears? This course introduces basic tools for retrieving and analyzing unstructured textual infordia. Applications include information retrieval (with human feedback), sentiment analysis and social analysis of text. The coursework will include programming projects that play on the interaction between knowledge and social factors.
Prerequisites: Linear algebra and discrete math: INFO 2950 or (MATH 2940 and CS 2800); Programming proficiency: CS 2110 or equivalent and good Python skills.
Date | Lecture | Agenda | Assignments | ||
---|---|---|---|---|---|
Tu, Jan 21, 2020 |
#1 |
Intro: Dimensions of Information Systems Conversational Behavior and Social information Related material:Linguistic Coordination Toolkit NPR Story: Before The Internet, Librarians Would 'Answer Everything' — And Still Do Google duplex example and writeup in The Verge -- References:Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang and Jon Kleinberg. Echoes of power: Language effects and power differences in social interaction. Cristian Danescu-Niculescu-Mizil, Michael Gamon and Susan Dumais. Mark my words! Linguistic style accommodation in social media. Proceedings of WWW, 2011. Kate G. Niederhoffer and James W. Pennebaker. Linguistic Style Matching in Social Interaction. Journal of Language and Social Psychology 2002 21: 337. Filip Radlinsky and Nick Craswell. A Theoretical Framework for Conversational Search. Proceedings of CHIIR 2017. |
Setup Quiz out (on CMS) Assignment 0 out (on CMS) |
||
Th, Jan 23, 2020 |
#2 |
Text similarity measures: Mimimum edit distance Edit Distance worksheet (includes sketch of the Wagner Fisher algorithm we used in class) Related materialReadings:J&M Chapters 3.11 |
Assignment 1 out (on CMS) |
||
Tu, Jan 28, 2020 |
#3 |
Basic text processing concepts: Sentence Splitting, Word Tokenization, Types, Tokens Text similarity measures: Type Overlap, Jaccard similarity Classic (ad hoc) information retrieval systems Vector space model cheatsheet (useful to keep track of notation) In-class demo: Proto Information Retrieval System: IPython notebook and html Related material:Readings:J&M Chapters 3.8 and 23.1.1 |
|||
Th, Jan 30, 2020 |
#4 |
Vector Space Model Dot product similarity, Cosine similarity, Geometric intuition Inverse document frequency (IDF) TF-IDF weighting In-class demo: (continued and updated) IPython notebook and html Readings:MRS Chapters 6.2, 6.3, 6.4.1 and 6.4.4 |
Assignment 2 out (on CMS) |
||
Tu, Feb 4, 2020 |
#5 |
Term document matrix Efficient retrieval Inverted Index Posting merge algorithm Boolean search In-class demo: (continued and updated) IPython notebook and html Postings merge quiz (includes sketch of the algorithm we used in class) Related Material:Readings:MRS Chapter 1 |
|||
Th, Feb 6, 2020 |
#6 |
Efficient cosine similarity scoring using the inverted index (algorithm) Fast cosine retrieval worksheet (includes sketch of the algorithm using the inverted index) Related Material:Inspiration for Assignment 3: QUOTUS project and interactive visualization Readings:MRS Chapter 6.3.3 |
Assignment 3 out (on CMS) |
||
Tu, Feb 11, 2020 |
#7 |
Efficient cosine similarity scoring using the inverted index (implementation) In-class demo: (continued and updated) IPython notebook and html Before optimizing retrieval with inverted indexes (one query on a collection of 40,000 reality TV utterances): After optimizing retrieval with inverted indexes (one query on a collection of 40,000 reality TV utterances): |
|||
Th, Feb 13, 2020 |
#8 |
Evaluation of ranked retrieval systems: Intuition, Precision, Recall and F1 Thinking about evaluation metrics worksheet Readings:MSR Chapter 8 |
Assignment 4 out (on CMS) |
||
Tu, Feb 18, 2020 |
#9 |
Evaluation of ranked retrieval systems: Precision@K, Recall@K, Precision-Recall Plot, Mean Average Precission, Discounted Cumulative Gain In-class demo: IPython notebook and html Readings:MSR Chapter 8 |
|||
Th, Feb 20, 2020 |
#10 |
Relevance feedback, Rocchio's method for query rewriting, Pseudo-relevance feedback Annotation: Pooling, K-statistic Query update using relevance feedback worksheet (includes the Rocchio query update rule) Related material Readings:MSR Chapters 9, MSR Chapter 8 |
|||
Tu, Feb 25, 2020 |
FALL BREAK |
||||
Th, Feb 27, 2020 |
#11 |
Query expansion, Co-occurrence matrix, Pointwise Mutual Information Scikit Learn basics In-class demo: IPython notebook and html Readings:MSR Chapters 9 |
Assignment 5 out (on CMS) |
||
Tu, Mar 3, 2020 |
#12 |
Wrapping up Ad-hoc IR, Midterm practice |
|||
Th, Mar 6, 2020 |
#13 |
Project discussion and brainstorming session |
|||
Tu, Mar 10, 2020 |
#14 |
MIDTERM - in class |
|||
Th, Mar 12, 2020 |
#15 |
MIDTERM discussion |
|||
Tu, Apr 7, 2020 |
#16 |
Lecture topics:Text mining, Classifiers, Feature Representation |
|||
Th, Apr 9, 2020 |
#17 |
Lecture topics:Bernoulli Naive Bayes, Smoothing |
|||
Tu, Mar 14, 2020 |
#18 |
Lecture topics:Multinomial Naive Bayes, Generative Models, Linear Classifiers |
|||
Tu, Apr 16, 2020 |
#19 |
Lecture topics:Practical unsupervised learning on textual data: Singular Value Decomposition (SVD) In-class demo: IPython notebook Related material:Indexing by latent semantic analysis. Deerwester, Dumais and Harshman 1990 |
|||
Tu, Apr 23, 2020 |
#20 |
Project Prototype Madness |
|||
Th, Apr 25, 2020 |
#21 |
Project Prototype Madness |
|||
Tu, Apr 28, 2020 |
#22 |
Lecture topics:Opinions and Trust: Link Analysis, Hubs and Authorities, Spectral Analysis Related material:NetworkX python package for link analysis Reading: |
|||
Th, Apr 30, 2020 |
#23 |
Opinions and Trust: Sentiment analysis, Opinion mining, Helpfulness, Credibility Related Material: |
|||
Tu, May 5, 2020 |
#24 |
Project presentations |
| |
| |
| |
Th, May 7, 2020 |
#25 |
Project presentations |
| |
| |
| |
Tu, May 11, 2020 |
#26 |
Misinformation and Anti-Social behavior |
| |
|