Quick intro to the Kickstarter Data¶

from __future__ import print_function
import numpy as np
import json

with open("kickstarter.jsonlist") as f:
    dataset = json.loads(f.readlines()[0])
np.random.shuffle(dataset) #just for fun :-)

Let's get some basic statistics, to know what we are working with here...¶

print("There are {} projects in the dataset".format(len(dataset)))

There are 45815 projects in the dataset

What information does each project contain?¶

print(type(dataset[0]))

<type 'dict'>

print(dataset[0].keys())

[u'raised', u'sub_category', u'text', u'creator_num_backed', u'featured', u'result', u'duration', u'category', u'goal', u'creator_facebook_connect', u'projectId', u'lon', u'has_video', u'comments', u'faqs', u'start_date', u'rewards', u'end_date', u'parent_category', u'updates', u'lat', u'short_text', u'name', u'url', u'backers']

for i in range(5):
    print("{}: {}".format(dataset[i]['name'],
                                "Success" if dataset[i]['result'] else "Failure"))
    print(dataset[i]['url'] + "\n")

No Regrets for Our Youth: Success
http://www.kickstarter.com/projects/112656196/no-regrets-for-our-youth

Documentary: Music on Foot (Walking Massachusetts): Success
http://www.kickstarter.com/projects/paulgandy/documentary-music-on-foot-walking-massachusetts

Already There: The Story of the Kwoncok Project: Failure
http://www.kickstarter.com/projects/alreadythere/already-there-the-story-of-the-kwoncok-project

Public Arts Project 66: Success
http://www.kickstarter.com/projects/231508740/public-arts-project-66

Life Abstract: Failure
http://www.kickstarter.com/projects/1935496424/life-abstract

Interesting! We have the following fields of interest...¶

The text of the Kickstarter Project
The category of the Kickstarter Project
Some information about rewards, backers, etc.
Whether or not the project was successful

How many projects are successful?¶

print("Success rate of Kickstarter projects: {}/{}".format(len([i for i, d in enumerate(dataset) if d['result']]),
                                                            len(dataset)))

Success rate of Kickstarter projects: 23604/45815

How many backers do projects generally get?¶

Let's make a histogram!

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist([sum([x['num_backers'] for x in y['rewards']]) for y in dataset],bins=100)#,log=True)
plt.show()

This is a very skewed distribution! Lets see if we can make that plot a bit better by excluding the top 1000 biggest projects, say.¶

n_backers = np.array([sum([x['num_backers'] for x in y['rewards']]) for y in dataset])
n_backers = np.sort(n_backers)
n_backers = n_backers[:-1000]
plt.hist(n_backers, bins = 100, log = True)
plt.show()

What are the types of categories in our dataset?¶

print(set([x['category'] for x in dataset]))

set([u'', u'Film & Video', u'Fashion', u'Art', u'Publishing', u'Food', u'Photography', u'Comics', u'Design', u'Games', u'Theater', u'Music', u'Technology', u'Dance'])

print(set([x['sub_category'] for x in dataset]))

set([u'', u'Jazz', None, u'Performance Art', u'Conceptual Art', u'Poetry', u'Fiction', u'Classical Music', u'Animation', u'Art Book', u'Digital Art', u'Indie Rock', u'Board & Card Games', u'Painting', u'Crafts', u'Video Games', u'Illustration', u'Public Art', u'Country & Folk', u'Open Hardware', u'Narrative Film', u'Electronic Music', u'Journalism', u'Webseries', u'Graphic Design', u'Short Film', u'Product Design', u"Children's Book", u'World Music', u'Rock', u'Documentary', u'Hip-Hop', u'Open Software', u'Pop', u'Nonfiction', u'Periodical', u'Sculpture', u'Mixed Media'])

How many projects are in each category? What are the success rates for each category?¶

from collections import defaultdict
cat_to_proj = defaultdict(list)
for p in dataset:
    cat_to_proj[p['category']].append(p)

for c, projs in cat_to_proj.iteritems():
    print("{}: {} ({:.3f}%)".format(c, len(projs), 100.*len([p for p in projs if p['result']==1])/len(projs)))

: 5 (0.000%)
Film & Video: 13502 (47.786%)
Fashion: 1134 (31.481%)
Art: 4236 (53.447%)
Publishing: 4761 (36.967%)
Food: 1431 (48.008%)
Photography: 1508 (43.899%)
Comics: 1068 (51.966%)
Design: 1507 (44.658%)
Games: 1728 (41.725%)
Theater: 2484 (67.351%)
Music: 10884 (63.929%)
Technology: 808 (38.243%)
Dance: 759 (70.224%)

Quick intro to the Kickstarter Data¶

Let's get some basic statistics, to know what we are working with here...¶

What information does each project contain?¶

Interesting! We have the following fields of interest...¶

How many projects are successful?¶

How many backers do projects generally get?¶

This is a very skewed distribution! Lets see if we can make that plot a bit better by excluding the top 1000 biggest projects, say.¶

What are the types of categories in our dataset?¶

How many projects are in each category? What are the success rates for each category?¶

Question of the day: can we predict success given the text of a project?¶