We're going to build a very basic proto-IR system from scratch. In the first part we will preprocess the "documents" (in this case sentences from a Wikipedia article); in the second part we will implement a method for searching for the set of documents which are closest to a given query.
from __future__ import print_function
import re
# Copy and pasted background info from the show's Wikipedia page
# (http://en.wikipedia.org/wiki/Keeping_Up_with_the_Kardashians)
background = u"""
Robert Kardashian (1944—2003) and Kristen Mary "Kris" Houghton (born 1955) married in 1978 and had four children together, daughters Kourtney (born 1979), Kim (born 1980), and Khloé, (born 1984), and son Rob (born 1987). The couple divorced in 1991.[3] In 1994, Robert entered the media spotlight when he defended O.J. Simpson for the murders of Nicole Brown Simpson and Ronald Goldman during a lengthy trial. Kris married former Olympic champion Bruce Jenner (born 1949) in 1991. Bruce and Kris had two daughters together, Kendall (born 1995) and Kylie (born 1997). Robert died in 2003 eight weeks after being diagnosed with esophageal cancer.[4] In 2004, Kim became a personal stylist to recording artist Brandy Norwood; she eventually developed into a full-time stylist, and was a personal shopper and stylist to actress Lindsay Lohan.[5] Khloé, Kim and Kourtney further ventured into fashion, opening a high fashion boutique D-A-S-H in Calabasas, California. Throughout Kim's early career, she involved herself in some high-profile relationships including Norwood's brother, rapper Ray J and later, singer Nick Lachey.[5] In 2006, Kourtney starred in her first reality television series, Filthy Rich: Cattle Drive.[6] In February 2007, a home sex video that Kim made with Ray J years earlier was leaked.[7] Vivid Entertainment bought the rights for $1 million and released the film as Kim Kardashian: Superstar on February 21.[7] Kim sued Vivid for ownership of the tape, but dropped the suit in April 2007 and settled with Vivid Entertainment for $5 million.[8]
In August 2007, it was announced that the Kardashians and Jenners would star in a yet-to-be-titled reality show on E!, with Ryan Seacrest serving as executive producer.[6] Seacrest said "At the heart of the series—despite the catfights and endless sarcasm—is a family that truly loves and supports one another [...] The familiar dynamics of this family make them one Hollywood bunch that is sure to entertain." The series announcement came one week after Paris Hilton and her friend Nicole Richie announced that their popular E! series entitled The Simple Life was ending.[6] Keeping Up With the Kardashians premiered on October 14, 2007.[9]
On November 13, 2007, it was announced that E! renewed the show for its second season.[9] The following year, it was renewed for a third season. Lisa Berger, executive vice president of original programming and series development for the network, said "viewers have embraced the Kardashian family and the series has become one of television's most-talked-about shows [...] We are fortunate to work with Seacrest and Bunim-Murray, which have an exceptional ability to capture the Kardashians' hilarious, chaotic and always entertaining personalities and family dynamics."[10] The Hollywood Reporter reported that the family made an estimated $65 million throughout 2010.[3]
In 2011, Kim married NBA player Kris Humphries in a highly publicized wedding ceremony,[11] but filed for divorce 72 days later.[12] This caused widespread backlash from the public and media.[13] Several news outlets surmised that Kardashian's marriage to Humphries was merely a publicity stunt, to promote the Kardashian family's brand and their subsequent television ventures.[14] A widely circulated petition asking to remove all Kardashian-related programming from the air has followed since their split.[15] As of December 2013, eight seasons of the series have aired. On April 24, 2012, E! signed a three-year deal with the Kardashian family that will keep the series airing through seasons seven, eight and nine.[16] The deal was estimated at $40 million.[17][18] On January. 4, 2015, a tenth season was announced to premiere on Feb 8, 2015 after the Kourtney and Khloé Take The Hamptons season finale.
The show revolves around the children of Kris Jenner, and originally focused on her children from her first marriage to deceased attorney Robert Kardashian: Kourtney, Kim, Khloé, and Rob Kourtney's boyfriend Scott Disick is a main character on the show. As the series progressed, Kris' children Kendall and Kylie also became recurring cast members of the show. Kris' second husband[19] 1976 Summer Olympics decathlon champion Bruce Jenner, is also frequently featured on the show, and has been a recurring cast member since the show began.
Since the series' premiere, the Kardashian sisters have established careers in the fashion industry, co-owning the fashion boutique D-A-S-H and launching several fragrances and clothing collections.
Kim gained notoriety as the subject of a sex tape in 2007, and later became involved in a relationship with New Orleans Saints running back Reggie Bush from 2007- March 2010.[20] In 2011, she received widespread criticism after filing for divorce from New Jersey Nets power forward Kris Humphries after a 72-day marriage. In 2012, while still married Kim became pregnant by rapper Kanye West; after suffering from preeclampsia she gave birth prematurely to their daughter North the following June.
Khloé attained notoriety in her own right after being arrested for driving under the influence in 2007, for which she was jailed for approximately three hours in 2008. The following year (during the fourth season), she married Los Angeles Lakers forward Lamar Odom after a one-month relationship. In 2012, she served as a co-host during the second season of the American version of The X Factor.
Rob launched the sock line "Arthur George" in 2012, and was involved in a relationship with singer Adrienne Bailon in the second and third seasons.
Kendall and Kylie have also established careers in the modeling industry.
In the eighth season, Bruce's sons Brandon and Brody Jenner, and Brandon's wife Leah Felder (daughter of Eagles band member Don Felder), were integrated into the supporting cast, while Kourtney, Khloé, and Kim's friends Malika Haqq and Jonathan Cheban joined the series in the second and third seasons.
The family earns an alleged total of $10 million per season of the series.[21]
"""
The simplest way to identify individual words is to split the string at every whitespace.
background.split()
The output is not satisfactory, especially around punctuation. Let's try to catch a higher-level structure first: sentences. A simple idea is to break at periods.
background.split(".")
This is not perfect either. One problem consists of the Wikipedia footnotes such as "[9]", which are irrelevant to us. Let's remove them with a regular expression.
re.findall(r"\[\d+\]", background)
background_nofoot = re.sub(r"\[\d+\]", "", background)
We can now try to split sentences again. We also want to split on other punctuation marks, not just the period. This is possible with regular expressions.
splitter = re.compile(r"""
[.!?] # split on punctuation
""", re.VERBOSE)
While this looks better, it messes up som instances: O.J. Simpson's name, the "E!" network's name, and ellipses (...).
To deal with this, we need to think about the decision process of whether a punctuation mark is an end of sentence or not. In class, we came up with three requirements:
O.J
)E! renewed
)J. Simpson
)Are these rules complete?
splitter = re.compile(r"""
(?<![A-Z]) # last character cannot be uppercase
[.!?] # match punctuation
\s+ # followed by whitespace
(?=[A-Z]) # next character must be uppercase
""", re.VERBOSE)
Running our sentence splitter on the text, it seems to do a good job.
for sentence in splitter.split(background_nofoot):
print(sentence.strip())
print("--")
Now let's identify individual words within each sentence. Rather than splitting at whitespace, let's match all sequences of word characters.
word_splitter = re.compile(r"""
(\w+)
""", re.VERBOSE)
sent_words = [word_splitter.findall(sent)
for sent in splitter.split(background_nofoot)]
sent_words_lower = [[w.lower() for w in sent]
for sent in sent_words]
How many sentences do we have?
len(sent_words_lower)
How many words are there in total? ("tokens")
allwords=[w for sent in sent_words_lower for w in sent]
sorted(allwords)
len(allwords)
How many distinct types of words ("types") are there?
len(set(allwords))
sorted(set(allwords))
We will try to retrieve the closest matching sentence to a given query. To do this, we must define what "closest" means. In other words, we need a similarity measure.
A simple one is the number of types in common between the query and the sentence.
def types_in_common(query_words, sentence):
A = set(query_words)
B = set(sentence)
return len(A.intersection(B))
A slightly more complex one is the the Jaccard similarity measure, which additionaly takes into account the total number of types that the query and the sentence has.
def jaccard(query_words, sentence):
A = set(query_words)
B = set(sentence)
return float(len(A.intersection(B)))/len(A.union(B))
Next we'll define a basic "search engine" which will go through all the sentences and calculate each one's similarity with the query. It returns a list of sentences sorted by their similarity score (if that score is greater than zero).
To calculate the similarity, this function takes as an argument a similarity_measure function.
from operator import itemgetter
def run_search(query, similarity_measure):
query_words = word_splitter.findall(query)
query_words = [w.lower() for w in query_words]
sent_scores = [(sent, similarity_measure(query_words, sent))
for sent in sent_words_lower]
sent_scores = sorted(sent_scores, key=itemgetter(1), reverse=True)
sent_scores = [(sent, score)
for sent, score in sent_scores
if score > 0]
joined_sents = [(" ".join(sent), score) for sent, score in sent_scores]
return joined_sents
Now we'll run two versions of the search engines (one using the types_in_commmon measure and one using the jaccard measure) for two different queries.
run_search("kris olympic",types_in_common)
run_search("kourtney",jaccard)
As we will be playing around with the internals of our information retrieval system, let's write a convenience function for displaying the results and highlighting the ones that we know are actually relevant for our information need.
def print_results(orderedlist, relevant_docs=[], maxresults=5):
"""Print search results while highlighting the ones we truly care about"""
count = 1
for item, score in orderedlist:
if item in relevant_docs:
print("{:d} !!! {:.2f} {}".format(count, score, item))
elif count <= maxresults:
print("{:d} {:.2f} {}".format(count, score, item))
print()
count += 1
relevant_docs_champion = ["kris married former olympic champion bruce jenner born 1949 in 1991",
"kris second husband 1976 summer olympics decathlon champion bruce jenner is also frequently featured on the show and has been a recurring cast member since the show began"]
The Jaccard measure ranks a completely irrelevant document on #1, gets one right on #2, but the actual best document is all the way down on position 16.
print_results(run_search("the olympic champion in kardashians", jaccard),
relevant_docs=relevant_docs_champion)
In order to have a more flexible system that we can modify, let's implement a vector space model. We will use term frequency (TF) weights in order to address one of the issues with Jaccard, namely, that if a word occurs more than once, it should naturally matter more.
terms = sorted(set(allwords))
# TF (term frequency) vectorization
# We represent vectors in a "sparse" dictionary format.
# All keys not present in the dictionary are assumed to be zeros.
def doc_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v)
return d
def query_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v)
return d
import math
def dot(d, q):
sum=0
for v in d: # iterates through keys
sum += d[v] * q[v]
return sum
One simple similarity measure operating on vectors is the dot product. The higher the dot product between two vectors, the more similar they are.
def dot_measure(query_words, sentence):
A = query_to_vec(query_words)
B = doc_to_vec(sentence)
return float(dot(A, B))
print_results(run_search("the olympic champion in kardashians",dot_measure),relevant_docs=relevant_docs_champion)
We can see that this does even worse, as it rewards longer documents unfairly. We can address that by length normalizing the documents and the query, that is, dividing the vectors by their norm.
The resulting measure is the cosine similarity measure:
def norm(d):
sum_sq = 0
for v in d:
sum_sq += d[v] * d[v]
return math.sqrt(sum_sq)
def cos_measure(query_words, sentence):
A = query_to_vec(query_words)
B = doc_to_vec(sentence)
return float(dot(A, B)) / (norm(A) * norm(B))
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)
This already does better, ranking one of our important documents as first. The other is still low in the ranking.
Another issue we have at the moment is that all words matter equally, but intuition dictates that some words in the query (e.g. olympic) are more important than others (e.g. in). A way to address this is to weight the words according to their specificity, and a concrete implementation is to use inverse document frequency (IDF)
IDF = {}
DF = {}
for t in terms:
DF[t] = len([1 for sent in sent_words_lower if t in sent])
IDF[t] = 1 / float(DF[t] + 1)
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[:10]:
print(IDF_t)
print("...")
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[-10:]:
print(IDF_t)
##TF-IDF weights
def doc_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v) * IDF[v]
return d
def query_to_vec(term_list):
d = {}
for v in terms:
d[v] = term_list.count(v) * IDF[v]
return d
print_results(run_search("the olympic champion in kardashians",cos_measure),relevant_docs=relevant_docs_champion)
With TF-IDF and cosine similarity, we now get both relevant documents ranked first and second in the returned list. Neat!