# Info/CS 4300: Language and Information - in-class demo


## Sentiment analysis 
### Building lexicons tailored to a domain for which we don't have sentiment labels

In [326]:
%matplotlib inline

from __future__ import print_function
import json
from operator import itemgetter
from collections import defaultdict

from matplotlib import pyplot as plt
import numpy as np

from nltk.tokenize import TreebankWordTokenizer
from nltk import FreqDist,pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files
from sklearn.naive_bayes import MultinomialNB

tokenizer = TreebankWordTokenizer()


Using the movie review data, but this time we will not use the sentiment labels (we will pretend we don't have labels).

In [327]:
## loading movie review data: 
## http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
data = load_files('txt_sentoken')
print(data.data[0])

arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . 
it's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? 
once again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . 
in this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . 
with the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! 
parts of this are actually so absurd , that they would fit right in with dogma . 
yes , the film is that wea

In [328]:
## building the term documnet matrix
vec = CountVectorizer(min_df = 50)
X = vec.fit_transform(data.data)
terms = vec.get_feature_names()
len(terms)

2153

In [329]:
# PMI type measure via matrix multiplication
def getcollocations_matrix(X):
    XX=X.T.dot(X)  ## multiply X with it's transpose to get number docs in which both w1 (row) and w2 (column) occur
    term_freqs = np.asarray(X.sum(axis=0)) ## number of docs in which a word occurs
    pmi = XX.toarray() * 1.0  ## Casting to float, making it an array to use simple operations
    pmi /= term_freqs.T ## dividing by the number of documents in which w1 occurs
    pmi /= term_freqs  ## dividing by the number of documents in which w2 occurs
    
    return pmi  # this is not technically PMI beacuse we are ignoring some normalization factor and not taking the log 
                # but it's sufficient for ranking

In [330]:
pmi_matrix = getcollocations_matrix(X)
pmi_matrix.shape 

(2153, 2153)

In [331]:
def getcollocations(w,PMI_MATRIX=pmi_matrix,TERMS=terms):
    if w not in TERMS:
        return []
    idx = TERMS.index(w)
    col = PMI_MATRIX[:,idx].ravel().tolist()
    return sorted([(TERMS[i],val) for i,val in enumerate(col)],key=itemgetter(1),reverse=True)

In [332]:
getcollocations("good")

[(u'good', 0.0012711337380982813),
 (u'trek', 0.0010038914000850665),
 (u'sean', 0.0009922470727116103),
 (u'nudity', 0.0009374840201587473),
 (u'nicely', 0.0009268742752181751),
 (u'trash', 0.0009217014608968155),
 (u'showed', 0.000916850400576306),
 (u'compared', 0.00091151987499156),
 (u'fairly', 0.0008716089901959017),
 (u'comparison', 0.0008698557537213697),
 (u'laughed', 0.0008665639627895953),
 (u'crap', 0.0008473706979212659),
 (u'pulp', 0.0008450365730278281),
 (u'parts', 0.0008435572066033899),
 (u'fifteen', 0.0008424927416009955),
 (u'sorry', 0.0008413817621615216),
 (u'pretty', 0.0008334590198961828),
 (u'nights', 0.0008333717375608706),
 (u'chris', 0.000833301911692621),
 (u'doctor', 0.0008330167404996009),
 (u'rating', 0.0008322781072402701),
 (u'average', 0.0008295313148071339),
 (u'forward', 0.0008295313148071339),
 (u'watched', 0.0008295313148071339),
 (u'cool', 0.0008275372491465399),
 (u'stupid', 0.0008213343650560753),
 (u'sadly', 0.0008174507616788748),
 (u'matt', 

In [333]:

sorted(sentscores.items(),key=itemgetter(1),reverse=False)

[(u'worst', -0.047847985347985345),
 (u'bad', -0.005144142257015721),
 (u'over', -0.0008741258741258741),
 (u'ever', -0.0006927835481992246),
 (u'old', -0.000509611311233071),
 (u'horrible', -0.0005045767240889192),
 (u'appropriate', -0.0004807692307692308),
 (u'single', -0.0003521130740894542),
 (u'worried', -0.0003434065934065934),
 (u'rich', -0.00020903010033444816),
 (u'year', -0.0001923076923076923),
 (u'normal', -0.00016869095816464237),
 (u'ready', -0.00016411333242216783),
 (u'busy', -0.00014386161489820027),
 (u'able', -0.00011459845186260283),
 (u'enough', -0.0001019798896944335),
 (u'high', -8.761468369791503e-05),
 (u'rude', -8.029485482690247e-05),
 (u'seriously', -7.754342431761787e-05),
 (u'sorry', -7.302249637155297e-05),
 (u'still', -5.139207374979732e-05),
 (u'other', -5.07118667496026e-05),
 (u'like', -4.767420925957511e-05),
 (u'too', -4.1431243431412225e-05),
 (u'away', -4.001232677893313e-05),
 (u'fast', -3.4470246734397684e-05),
 (u'little', -1.496198113885625e-0

lots of words that correlate with good which do not even have a polarity, so we need to focus on words that are more likely to have a polarity: adverbs and adjectives.

In [334]:
##example part of speech (POS) tagging (note that you need to tokenize the sentence first)
pos_tag(tokenizer.tokenize("This was a great day but the time is running out fast"))

[('This', 'DT'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('day', 'NN'),
 ('but', 'CC'),
 ('the', 'DT'),
 ('time', 'NN'),
 ('is', 'VBZ'),
 ('running', 'VBG'),
 ('out', 'RP'),
 ('fast', 'RB')]

In [335]:
## POS tagging  all reviews
## POS tagging is relatively slow, so this will take a while

#reviews_pos_tagged=[pos_tag(tokenizer.tokenize(m)) for m in data.data]

## Reconstructing adjective-and-adverb-only reviews
reviews_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
                      for m in reviews_pos_tagged]

In [336]:
print(data.data[1])

good films are hard to find these days . 
great films are beyond rare . 
proof of life , russell crowe's one-two punch of a deft kidnap and rescue thriller , is one of those rare gems . 
a taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these days , genuine motivation in a story that rings true . 
consider the strange coincidence of russell crowe's character in proof of life making the moves on a distraught wife played by meg ryan's character in the film -- all while the real russell crowe was hitching up with married woman meg ryan in the outside world . 
i haven't seen this much chemistry between actors since mcqueen and mcgraw teamed up in peckinpah's masterpiece , the getaway . 
but enough with the gossip , let's get to the review . 
the film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south america who is k

In [337]:
## It kind of works:
reviews_adj_adv_only[1]

"good hard great rare one-two rare taut strong subtle intelligent masterful together virtually unheard genuine true strange distraught meg real married outside n't much enough david american south anti-government only forward available ryan terry highly skilled wrong always most surprising own notable very simple complex intelligent character-driven well-written long sharply together tony biggest not gutsy right most ryan too many david memorable gunslinger terry most memorable extremely well skillfully amazing old-school trier"

In [338]:
## term doc matrix only for adj/adv
X = vec.fit_transform(reviews_adj_adv_only)
terms = vec.get_feature_names()

In [339]:
len(terms)

562

In [340]:
pmi_matrix=getcollocations_matrix(X)
pmi_matrix.shape  # n_words by n_words

(562, 562)

In [342]:
getcollocations("good",pmi_matrix,terms)

[(u'good', 0.0012845617524013917),
 (u'sean', 0.0009252217997465145),
 (u'nicely', 0.0009139270410318754),
 (u'fairly', 0.0008755655970071575),
 (u'robin', 0.0008653937882442727),
 (u'pretty', 0.0008548338879871134),
 (u'forward', 0.0008305488343511157),
 (u'terrific', 0.0008224793031847478),
 (u'cool', 0.0008204205677528381),
 (u'sadly', 0.0008203411798967191),
 (u'horrible', 0.0008162394739972354),
 (u'stupid', 0.0008141637119023551),
 (u'technical', 0.0008138216263091188),
 (u'lovely', 0.000809148389221857),
 (u'totally', 0.0007957590383758413),
 (u'sad', 0.0007916292386003339),
 (u'anti', 0.000788200947102258),
 (u'therefore', 0.0007862742336760081),
 (u'climactic', 0.0007856565791326648),
 (u'naturally', 0.0007855407689057879),
 (u'thankfully', 0.0007735470703392302),
 (u'bad', 0.0007711712373639965),
 (u'total', 0.0007710181664554288),
 (u'average', 0.0007709092809637673),
 (u'nice', 0.0007687057994165336),
 (u'mainly', 0.0007579711225428067),
 (u'fun', 0.0007575426481942805),
 (

We can make this better by combining multiple seet terms

In [343]:
def seed_score(pos_seed,PMI_MATRIX=pmi_matrix,TERMS=terms):
    score=defaultdict(int)
    for seed in pos_seed:
        c=dict(getcollocations(seed,PMI_MATRIX,TERMS))
        for w in c:
            score[w]+=c[w]
    return score

In [345]:
sorted(seed_score(['good','great','perfect','cool']).items(),key=itemgetter(1),reverse=True)

[(u'cool', 0.012001912748204434),
 (u'perfect', 0.006782938654467102),
 (u'great', 0.004234935151833858),
 (u'anti', 0.004160925070909675),
 (u'fake', 0.003978386428679741),
 (u'looking', 0.003957222634925364),
 (u'frank', 0.003953470579252501),
 (u'lovely', 0.0038977169233890795),
 (u'eccentric', 0.0038458229553531894),
 (u'greatest', 0.0037893056708582906),
 (u'totally', 0.0036293608998168546),
 (u'amazing', 0.003617561923228757),
 (u'stupid', 0.0035962513836334904),
 (u'generally', 0.003553253311814994),
 (u'climactic', 0.003537863066483464),
 (u'fun', 0.0035376706229829896),
 (u'twice', 0.0034429868622216564),
 (u'known', 0.0034156002474412875),
 (u'plain', 0.0033558593778353143),
 (u'good', 0.003300231759403832),
 (u'nicely', 0.0032826646303092937),
 (u'alien', 0.0032506377240264714),
 (u'overall', 0.003246573557239219),
 (u'convincing', 0.0032306160532405035),
 (u'necessary', 0.0032268324022759576),
 (u'earlier', 0.00320436224269597),
 (u'pretty', 0.0032003653412340915),
 (u'sad'

In [347]:
posscores=seed_score(['good','great','perfect','cool'])
negscores=seed_score(['bad','terrible','wrong',"crap","long","boring"])

## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms:
    sentscores[w] = posscores[w] - negscores[w]
    

In [348]:
sorted(sentscores.items(),key=itemgetter(1),reverse=False)

[(u'terrible', -0.010972487858524456),
 (u'boring', -0.009152588531402),
 (u'wrong', -0.0037842569272043196),
 (u'unfunny', -0.0028839715464925985),
 (u'bad', -0.002745669347410218),
 (u'frankly', -0.002735683658733542),
 (u'worst', -0.002650800210468679),
 (u'terribly', -0.002497993000217121),
 (u'anywhere', -0.002479642275811881),
 (u'laughable', -0.0024600189948948362),
 (u'horrible', -0.0023085769877623907),
 (u'awful', -0.0022332893067823654),
 (u'exciting', -0.0021194079061992045),
 (u'dull', -0.0019475225393855247),
 (u'running', -0.001919677366722775),
 (u'ugly', -0.0019027857871608356),
 (u'total', -0.0018358263440521236),
 (u'oddly', -0.001825867801362017),
 (u'painfully', -0.0017780445048585325),
 (u'ridiculous', -0.0017569353131335745),
 (u'poorly', -0.0017508500966694365),
 (u'bottom', -0.0016995579532760772),
 (u'current', -0.0016987113085641865),
 (u'successfully', -0.0016642378925818217),
 (u'pathetic', -0.0016356962074799996),
 (u'long', -0.0016266635116819225),
 (u'lo

Now let's apply this methodology to real (and important!) scenario where we don't have any sentiment labels: the Kardashians

In [349]:
## Loading the Kardashian data
with open("kardashian-transcripts.json", "rb") as f:
    transcripts = json.load(f)

In [350]:
msgs = [m['text'].lower() for transcript in transcripts
        for m in transcript ]


In [351]:
#msgs_pos_tagged = [pos_tag(tokenizer.tokenize(m)) for m in msgs]

In [352]:
msgs_adj_adv_only_tokenized=[[w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]]
                      for m in msgs_pos_tagged]

In [353]:
msgs_adj_adv_only=[" ".join([w for w,tag in m if tag in ["JJ","RB","RBS","RBJ","JJR","JJS"]])
                      for m in msgs_pos_tagged]

In [354]:
msgs[23]

u'and then if you could take out the trash, and then if you go to dash, maybe tomorrow or whatever, later today and just...'

In [355]:
msgs_adj_adv_only[23]

u'then then maybe later just'

In [356]:
vec = CountVectorizer(min_df = 10)
X = vec.fit_transform(msgs_adj_adv_only)
terms_kard = vec.get_feature_names()
len(terms_kard)

347

In [370]:
pmi_matrix_kard=getcollocations_matrix(X)

In [371]:
getcollocations("good",pmi_matrix_kard,terms_kard)

[(u'good', 0.0014394723893038387),
 (u'changei', 0.0013550135501355014),
 (u'positive', 0.0006097560975609756),
 (u'horrible', 0.0003695491500369549),
 (u'awful', 0.00031269543464665416),
 (u'nude', 0.00030795762503079576),
 (u'you', 0.0002463661000246366),
 (u'extremely', 0.00022583559168925022),
 (u'proud', 0.00021557033752155703),
 (u'willing', 0.00019357336430507162),
 (u'pretty', 0.00016592002654720425),
 (u'strong', 0.00016260162601626016),
 (u'and', 0.00013550135501355014),
 (u'anywhere', 0.00013550135501355014),
 (u'such', 0.00013428062208550013),
 (u'adrienne', 0.0001231830500123183),
 (u'dramatic', 0.0001231830500123183),
 (u'honest', 0.0001231830500123183),
 (u'online', 0.0001231830500123183),
 (u'though', 0.0001231830500123183),
 (u'two', 0.0001231830500123183),
 (u'kimberly', 0.00011291779584462511),
 (u'fun', 0.00010986596352450011),
 (u'half', 0.00010423181154888472),
 (u'very', 0.0001007104665641251),
 (u'really', 9.148499128303384e-05),
 (u'all', 7.970667941973537e-05)

In [375]:
posscores=seed_score(['good',"rude"],pmi_matrix_kard,terms_kard)
negscores=seed_score(['bad'],pmi_matrix_kard,terms_kard)

## sentiment polarity score will be the difference between the words that are close to the positive seed
## and the words that are close to the negative seed
sentscores={}
for w in terms_kard:
    sentscores[w]=posscores[w]-negscores[w]

neglexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[:10]
poslexicon_kard = sorted(sentscores.items(),key=itemgetter(1),reverse=False)[-10:]

In [380]:
sorted(sentscores.items(),key=itemgetter(1),reverse=False)

[(u'bad', -0.004933346763201359),
 (u'over', -0.0008741258741258741),
 (u'horrible', -0.0005045767240889192),
 (u'appropriate', -0.0004807692307692308),
 (u'san', -0.00040064102564102563),
 (u'worried', -0.0003434065934065934),
 (u'able', -0.0002403846153846154),
 (u'worst', -0.00022893772893772894),
 (u'high', -0.00022361359570661896),
 (u'rich', -0.00020903010033444816),
 (u'year', -0.0001923076923076923),
 (u'normal', -0.00016869095816464237),
 (u'ready', -0.00016411333242216783),
 (u'fast', -0.00016025641025641026),
 (u'especially', -0.00015762925598991173),
 (u'busy', -0.00014386161489820027),
 (u'entire', -0.00012651821862348178),
 (u'sorry', -0.0001201923076923077),
 (u'enough', -0.0001019798896944335),
 (u'rude', -8.029485482690247e-05),
 (u'seriously', -7.754342431761787e-05),
 (u'again', -7.243382008860433e-05),
 (u'other', -6.868131868131868e-05),
 (u'away', -6.585879873551106e-05),
 (u'now', -6.56137913959721e-05),
 (u'probably', -5.9528944095807e-05),
 (u'around', -5.72344

We (roughly) calculate the each sentence's sentiment score by comparing the number of words with positive sentiment score vs negative sentiment score (according to our automatically induced lexicon)

In [389]:
final_message_sentiment = {}

for k, m in enumerate(msgs_adj_adv_only_tokenized):
    m_sent_score = sum([sentscores.get(w,0)>0 for w in m])-sum([sentscores.get(w,0)<0 for w in m])
    final_message_sentiment[msgs[k]]=m_sent_score

sorted(final_message_sentiment.items(), key=itemgetter(1), reverse=False)[:10]


[(u"i couldn't be any more sorry, and i'll never excuse the way i acted the other night in vegas, but, like, i don't know what i ever did so bad to, like, deserve you to, like, hate me so much.",
  -9),
 (u"he just needs to be pushed a little bit so that he takes care of something that's made him feel really bad for a really long time.",
  -7),
 (u"i mean, honestly i really thought you brought me here to spend time with you and like, it's like a bonding thing and you really wanted to take me to lunch and hang out, but obviously, this is not really why i'm here.",
  -6),
 (u'now, i do not know what case you have him on, but whatever it is, it is going bad, and it sounds like it is going bad right now.',
  -6),
 (u"i understand that we're gonna fight 'cause we do so much stuff together, but i got you guys a little gift because i felt a little bad-- for you both.",
  -6),
 (u"khloe getting married has really made me think about my own love life, and you know, i'm still sad, but i don't re

In [390]:
sorted(final_message_sentiment.items(), key=itemgetter(1))[-10:]

[(u"it's gonna be a pretty big game.", 4),
 (u"this is a great time to tell khloe that it's not always all about us and that maybe once in a while it's a great thing to help somebody else out.",
  4),
 (u"so, tonight, khloe, i ask you to honor that very same promise to his grandmother, that you will always support lamar and stand by him because you have realized very quickly what the rest of us already know: it's very easy to love lamar.",
  4),
 (u"i wouldn't be a good manager or a good mom if i didn't find out who's really single out there and who would be a great match for kim.",
  4),
 (u"they're always pretty strong women, actually.", 4),
 (u'i feel very at peace, very comfortable in my own skin.', 4),
 (u"i definitely feel protective over summer because she's so young and new to the industry, but i think the smart thing to do is let her learn her own lessons and kind of feel her way through on her own.",
  4),
 (u"i just want to say all you kids, i'm extremely proud of you becaus

Pretty good considering that we had absolutely no sentiment labels to start with!