Truncated SVD Demo¶

from __future__ import print_function
import numpy as np
import json

First, we will load the data in...¶

If you have your own dataset, this cell simply makes a list of text files.

with open("kickstarter.jsonlist") as f:
    documents = [(x['name'], x['category'], x['text'])
                 for x in json.loads(f.readlines()[0])
                 if len(x['text'].split()) > 50]
    
#To prove I'm not cheating with the magic trick...
np.random.shuffle(documents)

print("Loaded {} documents".format(len(documents)))
print("Here is one of them:")
print(documents[0])

Loaded 45129 documents
Here is one of them:
(u'Telling stories through sculpture', u'Art', u'My goal is communicate the stories of some of the awesome people I have met through three dimensional photo sculptures. I combine photos with natural materials like stone and wood with handicrafts made by those featured in the pictures. The finished sculpture is a 360 degree display that highlights different aspects depending on which side it is viewed from. The picture above is of a sculpture I made last year for a local art show. It tells the story of the two families that lived next to my home when I served in Peace Corps Kenya. Im planning to do a series of 9 more on similar themes, covering Zimbabwe, Kenya, Uganda, Dominican Republic, and Mexico. The completed sculptures will be donated to local NGOs that do development work around the world. They will be used in their advocacy campaigns, and eventually auctioned by them as fundraisers for their organizations. The money I am raising through Kickstarter will be used for: 1: materials for producing the sculptures, including some new dremmel bits. There is a lot of detail work necessary so that the support structure is entirely encased within the sculpture and does not detract visually from the finished product. 2: spreading the word through sending the rewards to all you wonderful supporters! Since the goal of the project is to tell these stories, I am very much looking forward to sending photo books of the completed sculptures out to you wonderful supporters. The drawback of using stone in the sculptures is that its not practical for me to ship them outside of San Diego, by creating and sending out the photo books Ill be able to read a wider audience than I could locally. The money will be spent as follows 43 percent for the materials to produce the sculptures. 52 percent for printing and mailing out the rewards to all my backers. 5 percent for the credit card fees. 100%')

Lets make the term-by-document matrix, and do some decompositions!¶

from sklearn.feature_extraction.text import TfidfVectorizer
help(TfidfVectorizer)

Help on class TfidfVectorizer in module sklearn.feature_extraction.text:

class TfidfVectorizer(CountVectorizer)
 |  Convert a collection of raw documents to a matrix of TF-IDF features.
 |  
 |  Equivalent to CountVectorizer followed by TfidfTransformer.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  ----------
 |  input : string {'filename', 'file', 'content'}
 |      If 'filename', the sequence passed as an argument to fit is
 |      expected to be a list of filenames that need reading to fetch
 |      the raw content to analyze.
 |  
 |      If 'file', the sequence items must have a 'read' method (file-like
 |      object) that is called to fetch the bytes in memory.
 |  
 |      Otherwise the input is expected to be the sequence strings or
 |      bytes items are expected to be analyzed directly.
 |  
 |  encoding : string, 'utf-8' by default.
 |      If bytes or files are given to analyze, this encoding is used to
 |      decode.
 |  
 |  decode_error : {'strict', 'ignore', 'replace'}
 |      Instruction on what to do if a byte sequence is given to analyze that
 |      contains characters not of the given `encoding`. By default, it is
 |      'strict', meaning that a UnicodeDecodeError will be raised. Other
 |      values are 'ignore' and 'replace'.
 |  
 |  strip_accents : {'ascii', 'unicode', None}
 |      Remove accents during the preprocessing step.
 |      'ascii' is a fast method that only works on characters that have
 |      an direct ASCII mapping.
 |      'unicode' is a slightly slower method that works on any characters.
 |      None (default) does nothing.
 |  
 |  analyzer : string, {'word', 'char'} or callable
 |      Whether the feature should be made of word or character n-grams.
 |  
 |      If a callable is passed it is used to extract the sequence of features
 |      out of the raw, unprocessed input.
 |  
 |  preprocessor : callable or None (default)
 |      Override the preprocessing (string transformation) stage while
 |      preserving the tokenizing and n-grams generation steps.
 |  
 |  tokenizer : callable or None (default)
 |      Override the string tokenization step while preserving the
 |      preprocessing and n-grams generation steps.
 |      Only applies if ``analyzer == 'word'``.
 |  
 |  ngram_range : tuple (min_n, max_n)
 |      The lower and upper boundary of the range of n-values for different
 |      n-grams to be extracted. All values of n such that min_n <= n <= max_n
 |      will be used.
 |  
 |  stop_words : string {'english'}, list, or None (default)
 |      If a string, it is passed to _check_stop_list and the appropriate stop
 |      list is returned. 'english' is currently the only supported string
 |      value.
 |  
 |      If a list, that list is assumed to contain stop words, all of which
 |      will be removed from the resulting tokens.
 |      Only applies if ``analyzer == 'word'``.
 |  
 |      If None, no stop words will be used. max_df can be set to a value
 |      in the range [0.7, 1.0) to automatically detect and filter stop
 |      words based on intra corpus document frequency of terms.
 |  
 |  lowercase : boolean, default True
 |      Convert all characters to lowercase before tokenizing.
 |  
 |  token_pattern : string
 |      Regular expression denoting what constitutes a "token", only used
 |      if ``analyzer == 'word'``. The default regexp selects tokens of 2
 |      or more alphanumeric characters (punctuation is completely ignored
 |      and always treated as a token separator).
 |  
 |  max_df : float in range [0.0, 1.0] or int, default=1.0
 |      When building the vocabulary ignore terms that have a document
 |      frequency strictly higher than the given threshold (corpus-specific
 |      stop words).
 |      If float, the parameter represents a proportion of documents, integer
 |      absolute counts.
 |      This parameter is ignored if vocabulary is not None.
 |  
 |  min_df : float in range [0.0, 1.0] or int, default=1
 |      When building the vocabulary ignore terms that have a document
 |      frequency strictly lower than the given threshold. This value is also
 |      called cut-off in the literature.
 |      If float, the parameter represents a proportion of documents, integer
 |      absolute counts.
 |      This parameter is ignored if vocabulary is not None.
 |  
 |  max_features : int or None, default=None
 |      If not None, build a vocabulary that only consider the top
 |      max_features ordered by term frequency across the corpus.
 |  
 |      This parameter is ignored if vocabulary is not None.
 |  
 |  vocabulary : Mapping or iterable, optional
 |      Either a Mapping (e.g., a dict) where keys are terms and values are
 |      indices in the feature matrix, or an iterable over terms. If not
 |      given, a vocabulary is determined from the input documents.
 |  
 |  binary : boolean, default=False
 |      If True, all non-zero term counts are set to 1. This does not mean
 |      outputs will have only 0/1 values, only that the tf term in tf-idf
 |      is binary. (Set idf and normalization to False to get 0/1 outputs.)
 |  
 |  dtype : type, optional
 |      Type of the matrix returned by fit_transform() or transform().
 |  
 |  norm : 'l1', 'l2' or None, optional
 |      Norm used to normalize term vectors. None for no normalization.
 |  
 |  use_idf : boolean, default=True
 |      Enable inverse-document-frequency reweighting.
 |  
 |  smooth_idf : boolean, default=True
 |      Smooth idf weights by adding one to document frequencies, as if an
 |      extra document was seen containing every term in the collection
 |      exactly once. Prevents zero divisions.
 |  
 |  sublinear_tf : boolean, default=False
 |      Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
 |  
 |  Attributes
 |  ----------
 |  vocabulary_ : dict
 |      A mapping of terms to feature indices.
 |  
 |  idf_ : array, shape = [n_features], or None
 |      The learned idf vector (global term weights)
 |      when ``use_idf`` is set to True, None otherwise.
 |  
 |  stop_words_ : set
 |      Terms that were ignored because they either:
 |  
 |        - occurred in too many documents (`max_df`)
 |        - occurred in too few documents (`min_df`)
 |        - were cut off by feature selection (`max_features`).
 |  
 |      This is only available if no vocabulary was given.
 |  
 |  See also
 |  --------
 |  CountVectorizer
 |      Tokenize the documents and count the occurrences of token and return
 |      them as a sparse matrix
 |  
 |  TfidfTransformer
 |      Apply Term Frequency Inverse Document Frequency normalization to a
 |      sparse matrix of occurrence counts.
 |  
 |  Notes
 |  -----
 |  The ``stop_words_`` attribute can get large and increase the model size
 |  when pickling. This attribute is provided only for introspection and can
 |  be safely removed using delattr or set to None before pickling.
 |  
 |  Method resolution order:
 |      TfidfVectorizer
 |      CountVectorizer
 |      sklearn.base.BaseEstimator
 |      VectorizerMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=u'word', stop_words=None, token_pattern=u'(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<type 'numpy.int64'>, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
 |  
 |  fit(self, raw_documents, y=None)
 |      Learn vocabulary and idf from training set.
 |      
 |      Parameters
 |      ----------
 |      raw_documents : iterable
 |          an iterable which yields either str, unicode or file objects
 |      
 |      Returns
 |      -------
 |      self : TfidfVectorizer
 |  
 |  fit_transform(self, raw_documents, y=None)
 |      Learn vocabulary and idf, return term-document matrix.
 |      
 |      This is equivalent to fit followed by transform, but more efficiently
 |      implemented.
 |      
 |      Parameters
 |      ----------
 |      raw_documents : iterable
 |          an iterable which yields either str, unicode or file objects
 |      
 |      Returns
 |      -------
 |      X : sparse matrix, [n_samples, n_features]
 |          Tf-idf-weighted document-term matrix.
 |  
 |  transform(self, raw_documents, copy=True)
 |      Transform documents to document-term matrix.
 |      
 |      Uses the vocabulary and document frequencies (df) learned by fit (or
 |      fit_transform).
 |      
 |      Parameters
 |      ----------
 |      raw_documents : iterable
 |          an iterable which yields either str, unicode or file objects
 |      
 |      copy : boolean, default True
 |          Whether to copy X and operate on the copy or perform in-place
 |          operations.
 |      
 |      Returns
 |      -------
 |      X : sparse matrix, [n_samples, n_features]
 |          Tf-idf-weighted document-term matrix.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  idf_
 |  
 |  norm
 |  
 |  smooth_idf
 |  
 |  sublinear_tf
 |  
 |  use_idf
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from CountVectorizer:
 |  
 |  get_feature_names(self)
 |      Array mapping from feature integer indices to feature name
 |  
 |  inverse_transform(self, X)
 |      Return terms per document with nonzero entries in X.
 |      
 |      Parameters
 |      ----------
 |      X : {array, sparse matrix}, shape = [n_samples, n_features]
 |      
 |      Returns
 |      -------
 |      X_inv : list of arrays, len = n_samples
 |          List of arrays of terms.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from VectorizerMixin:
 |  
 |  build_analyzer(self)
 |      Return a callable that handles preprocessing and tokenization
 |  
 |  build_preprocessor(self)
 |      Return a function to preprocess the text before tokenization
 |  
 |  build_tokenizer(self)
 |      Return a function that splits a string into a sequence of tokens
 |  
 |  decode(self, doc)
 |      Decode the input into a string of unicode symbols
 |      
 |      The decoding strategy depends on the vectorizer parameters.
 |  
 |  get_stop_words(self)
 |      Build or fetch the effective stop words list
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from VectorizerMixin:
 |  
 |  fixed_vocabulary
 |      DEPRECATED: The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.

vectorizer = TfidfVectorizer(stop_words = 'english', max_df = .7,
                            min_df = 75)
my_matrix = vectorizer.fit_transform([x[2] for x in documents]).transpose()

print(type(my_matrix))
print(my_matrix.shape)

<class 'scipy.sparse.csc.csc_matrix'>
(9680, 45129)

How many dimensions does our data live in?¶

from scipy.sparse.linalg import svds
u, s, v_trans = svds(my_matrix, k=100)

print(u.shape)
print(s.shape)
print(v_trans.shape)

(9680, 100)
(100,)
(100, 45129)

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(s[::-1])
plt.xlabel("Singular value number")
plt.ylabel("Singular value")
plt.show()

So it looks like most of our data lives in as few as 10 dimensions. Lets take 40 dimensions to be safe.¶

words_compressed, _, docs_compressed = svds(my_matrix, k=40)
docs_compressed = docs_compressed.transpose()

print(words_compressed.shape)
print(docs_compressed.shape)

(9680, 40)
(45129, 40)

It's time for some magic!¶

word_to_index = vectorizer.vocabulary_
index_to_word = {i:t for t,i in word_to_index.iteritems()}
print(words_compressed.shape)

(9680, 40)

#row normalize
from sklearn.preprocessing import normalize
words_compressed = normalize(words_compressed, axis = 1)

def closest_words(word_in, k = 10):
    if word_in not in word_to_index: return "Not in vocab."
    sims = words_compressed.dot(words_compressed[word_to_index[word_in],:])
    asort = np.argsort(-sims)[:k+1]
    return [(index_to_word[i],sims[i]/sims[asort[0]]) for i in asort[1:]]

Magic trick time! You pick the words.¶

Sugguestions:

technology
children
vinyl
community
school
food
nuclear
america
fabric

closest_words("nuclear")

[(u'threats', 0.87921387135026152),
 (u'extinction', 0.87865991678545008),
 (u'government', 0.87853995110300565),
 (u'regime', 0.87508326932779756),
 (u'authorities', 0.86457233273695111),
 (u'governments', 0.85870766425891643),
 (u'collapse', 0.85006926570598984),
 (u'terrorist', 0.84366412319424078),
 (u'corruption', 0.83584274000709025),
 (u'disasters', 0.83248283636362497)]

print(word_to_index.keys()[:200])

[u'woods', u'woody', u'bringing', u'wooden', u'wednesday', u'hanging', u'affiliated', u'kids', u'uplifting', u'controversy', u'dna', u'populations', u'yahoo', u'wrong', u'welcomed', u'fit', u'screaming', u'fix', u'discourse', u'size', u'individuals', u'olds', u'needed', u'master', u'genesis', u'positively', u'forwarded', u'tech', u'patch', u'heirloom', u'wage', u'extend', u'nature', u'fruits', u'extent', u'fifth', u'professionally', u'memorial', u'mentors', u'corporate', u'advancement', u'crowd', u'stern', u'crown', u'marshall', u'fabric', u'universitys', u'humbled', u'passenger', u'abraham', u'chain', u'chair', u'ballet', u'exact', u'minute', u'illustrators', u'following', u'thanking', u'fueled', u'surfing', u'webpage', u'jim', u'quarter', u'quartet', u'husband', u'spoken', u'montreal', u'turned', u'turner', u'fashionable', u'opposite', u'immigration', u'imagined', u'ensembles', u'sponsorship', u'readings', u'talented', u'advertisement', u'generator', u'traffic', u'seating', u'brazilian', u'societal', u'abused', u'trips', u'cream', u'yoga', u'unparalleled', u'tricky', u'natalie', u'tricks', u'ceramic', u'norm', u'sang', u'sand', u'hasn', u'portrays', u'initiated', u'company', u'installing', u'learn', u'huge', u'hugs', u'brett', u'media', u'sauce', u'colleague', u'document', u'courses', u'shocking', u'research', u'offline', u'sculpted', u'mentally', u'rewards', u'understands', u'cultivating', u'blended', u'overwhelmed', u'activists', u'collectively', u'analog', u'bell', u'desktop', u'belt', u'binding', u'affairs', u'execution', u'couple', u'tasting', u'erica', u'influence', u'haunt', u'detectives', u'video', u'dynamics', u'victor', u'multimedia', u'flowing', u'makes', u'maker', u'comedy', u'houston', u'books', u'enforcement', u'targeted', u'false', u'placement', u'brew', u'entities', u'destiny', u'staring', u'genetic', u'nerd', u'unimaginable', u'newest', u'dish', u'recognition', u'passion', u'biology', u'evident', u'excitement', u'problem', u'details', u'magnetic', u'integrity', u'spin', u'mate', u'replication', u'jamaica', u'1991', u'1990', u'1993', u'1992', u'1995', u'1994', u'1996', u'1999', u'1998', u'nuts', u'innovative', u'production', u'routines', u'soundcloud', u'illustrate', u'contents', u'cage', u'convenient', u'pilgrimage', u'conscience', u'haunted', u'runs', u'draws', u'cooperation', u'drawn', u'kitchen', u'gratitude', u'excite', u'sculpt', u'hat']

Well, I think that's pretty magical! It figured out what words meant all by itself :-)¶

What about documents, though?¶

Let's try projecting them down into a visualizable space, and then coloring them based on their kickstarter category to see if there are any patterns....

from sklearn.manifold import TSNE
tsne = TSNE(verbose=1)

print(docs_compressed.shape)
#we'll just take the first 5K documents, because TSNE is memory intensive!
subset = docs_compressed[:5000,:]
projected_docs = tsne.fit_transform(subset)
print(projected_docs.shape)

(45129, 40)
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.007597
[t-SNE] KL divergence after 100 iterations with early exaggeration: 1.432116
[t-SNE] Error after 350 iterations: 1.432116
(5000, 2)

plt.figure(figsize=(15,15))
plt.scatter(projected_docs[:,0],projected_docs[:,1])
plt.show()

Does this space mean anything, though? Lets color by some of the categories¶

from collections import Counter
cats = Counter([x[1] for x in documents])
print(cats)

Counter({u'Film & Video': 13318, u'Music': 10634, u'Publishing': 4697, u'Art': 4181, u'Theater': 2464, u'Games': 1715, u'Photography': 1495, u'Design': 1486, u'Food': 1415, u'Fashion': 1105, u'Comics': 1060, u'Technology': 806, u'Dance': 753})

from collections import defaultdict
cat_to_color = defaultdict(lambda: 'k')
cat_to_color.update({"Photography":'g',
               "Music":'c',
               "Food":'r',
               "Comics": "b"})
color_to_project = defaultdict(list)
for i in range(projected_docs.shape[0]):
    color_to_project[cat_to_color[documents[i][1]]].append(i)

plt.figure(figsize=(15,15))
for color, indices in color_to_project.iteritems():
    indices = np.array(indices)
    plt.scatter(projected_docs[indices,0], projected_docs[indices,1],
                color = color)
plt.show()

Lets search for nearest kickstarter projects!¶

Note that we don't really need an inverted index because our document vectors are only 40 dimensional, rather than |V| dimensional.

docs_compressed = normalize(docs_compressed, axis = 1)
def closest_projects(project_index_in, k = 5):
    sims = docs_compressed.dot(docs_compressed[project_index_in,:])
    asort = np.argsort(-sims)[:k+1]
    return [(documents[i][0],sims[i]/sims[asort[0]]) for i in asort[1:]]

for i in range(10):
    print(documents[i][0])
    for title, score in closest_projects(i):
        print("{}:{:.3f}".format(title[:40], score))
    print()

Telling stories through sculpture
BRUSH ACROSS AMERICA:0.802
Arctic Landscape Paintings:0.783
The Forest Emperor : A Sculpture for the:0.774
Life in a box:0.771
The Painted Pitbull Project - Portrait o:0.771

PennyGems
Ramos alarm clock:0.780
The Last Word:0.770
Mobile Frame Zero: Rapid Attack:0.756
Plus Spindles:0.752
Meat Soap - Get That Bacon-Fresh Scent!:0.741

The Road to ElVado
"WINTERFROST" & "THIS TIME AROUND":0.801
Multi Award Winning Artist RICH WYMAN's :0.798
Between Redemption and the Pain:0.786
Skinshifter and Reverence & Mirth: New C:0.775
The Dream Songs Project - CD of Mauro Gi:0.774

The LuxuryPAW Travel Series TV Pilot
Put This On: Season One:0.907
One Square Mile - a documentary web seri:0.904
Fred & Earl: a new animated web series:0.892
Other Side of the Tracks: Comedy Web Ser:0.886
YOUNG GUNS - Season One:0.886

The Obligated Combustor!
The Golden Age Remastered:0.934
Goddammit Baby This Is Soul:0.932
"John Rambit: Merc for hire!":0.925
"The Package", an Original Crime Graphic:0.920
Doctor Atlantis:0.919

Help Launch "Stir It Up"!
Help "Stir It Up" Get Started!:0.917
Modern Family Business: a fresh-cut come:0.748
NOW... TV that brings light to the darkn:0.739
Derby Little Secrets: Chicago Outfit Rol:0.722
Sharp Cooking:0.699

LARNOPOLIS
OBEY THE GIANT - The First Narrative Fil:0.753
Sticky Situation:0.718
Haitian Son:0.699
BONESHAKER:0.698
"One Way" Student Short Film:0.695

Dallas Design Tee
Unicorn Parking Shirts and Hoodies:0.831
Girls Rock!:0.814
Peanoonies Co. Designing Unique,Whimsica:0.809
The Mass Production of Cultures Clothing:0.803
AND(&) CORN BIRD:0.800

Atlanta Nights: The Movie
Rhinehoth Horror Novel Audio Book:0.834
"The View From Here" Vile Knowledge-movi:0.823
First book printing and distribution:0.820
The Drowning Girl: Stills From a Movie t:0.786
Zombie, Illinois:0.773

ECHO - A Short Film
Riscatto - Redemption Starts Within:0.857
Golden (aka My First Short Film):0.853
Date at a Funeral: ThesisProject:0.822
Pacsaw:0.819
HILLBILLY HOLOCAUST, an ultra low budget:0.819