Tuesday, November 8, 2016

Machine Learning Introduction


*Interested in Machine Learning? I recommend starting with Google's free course. It is very detailed and will make this post make a little bit more sense. My notes on it are here.

I had the privilege to join a workshop on Machine Learning hosted by Galvanize and CrowdFlower using the Scikit-learn toolkit. Below are my notes using Jupyter Notebook.

You can get the course material at https://github.com/lukas/scikit-class. You can follow the README there to download the code and get all the libraries setup (which includes Scikit-learn, which is the library which contains all the neat machine learning tools  we'll be using).

Ok ready? Let's go!

Start by running scikit/feature-extraction-1.py
In [1]:
# First attempt at feature extraction
# Leads to an error, can you tell why?

import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

from sklearn.feature_extraction.text import CountVectorizer

count_vect=CountVectorizer()
count_vect.fit(text)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-ad8b07653e46> in <module>()
     12 
     13 count_vect=CountVectorizer()
---> 14 count_vect.fit(text)

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit(self, raw_documents, y)
    794         self
    795         """
--> 796         self.fit_transform(raw_documents)
    797         return self
    798 

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    822 
    823         vocabulary, X = self._count_vocab(raw_documents,
--> 824                                           self.fixed_vocabulary_)
    825 
    826         if self.binary:

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    750         for doc in raw_documents:
    751             feature_counter = {}
--> 752             for feature in analyze(doc):
    753                 try:
    754                     feature_idx = vocabulary[feature]

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in <lambda>(doc)
    239 
    240             return lambda doc: self._word_ngrams(
--> 241                 tokenize(preprocess(self.decode(doc))), stop_words)
    242 
    243         else:

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in decode(self, doc)
    119 
    120         if doc is np.nan:
--> 121             raise ValueError("np.nan is an invalid document, expected byte or "
    122                              "unicode string.")
    123 

ValueError: np.nan is an invalid document, expected byte or unicode string.
There's an exception already?! This throws an exception because there is missing information on line 8 of tweets.csv. If you look at that file you will see that there is no tweet text at all. You'll find this is extremely common with data science. Half the battle is manipulating the data into something you can use. To fix it, the pandas library provides a convenient notnull() function on arrays. Here's an example of how this works:
In [40]:
import pandas as pd
#pandas has a special type of object called a Series object
s = pd.Series(['apple','banana','cat','dog','elephant','fish']) 
print type(s)
print
print s
print

# you can pass a list of booleans to this series object to include or exclude an index.
print s[[True,False,True]] 
print

# in our example above the extracted tweet_text is also in the same Pandas Series object
df = pd.read_csv('tweets.csv')
text = df['tweet_text']
print type(text)
print 

# pandas.notnull returns a boolean array with False values where values are null
print pd.notnull(['apple','banana', None, 'dog',None,'fish']) 
print

#Thus combining the Series datatype and pandas.notnull, you can exclude null values.
print s[pd.notnull(['apple','banana', None, 'dog',None,'fish'])]
print
<class 'pandas.core.series.Series'>

0       apple
1      banana
2         cat
3         dog
4    elephant
5        fish
dtype: object

0    apple
2      cat
dtype: object

<class 'pandas.core.series.Series'>

[ True  True False  True False  True]

0     apple
1    banana
3       dog
5      fish
dtype: object

In [41]:
# scikit/feature-extraction-2.py 
# second attempt at feature extraction

import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

# what did we do here?
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(fixed_text)

# print the number of words in the vocabulary
print(count_vect.vocabulary_)
{u'unscientific': 9042, u'hordes': 4175, u'pickmeupanipad2': 6385, u'yellow': 9608, u'four': 3434, u'prices': 6652, u'woods': 9501, u'hanging': 3940, u'16mins': 70, u'looking': 5143, u'html5': 4215, u'gad': 3543, u'eligible': 2846, u'gadgetoverload': 3546, u'insertion': 4461, u'lori': 5154, u'sxswdad': 8340, u'lord': 5152, u'newmusic': 5809, u'dynamic': 2743, u'bergstrom': 1065, u'dell': 2351, u'rancewilemon': 6892, u'leisurely': 4985, u'bringing': 1305, u'basics': 971, u'prize': 6675, u'customizable': 2213, u'wednesday': 9356, u'oooh': 6028, ... output truncated, its quite long ... }
CountVectorizer will convert the text to a token count. The fit() function applies our tweet data to the CountVectorizer. If you look at the vocabulary_ of count_vect you'll see each word lower cased and assigned to an index.
Before you take a look at scikit/feature-extraction-3.py its worth taking a look at this next example as its a simplified version
In [44]:
import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(lowercase=True) # this lowercase=True is not necessary because the default is True
count_vect.fit(fixed_text)

transformed = count_vect.transform(["I love my iphone!!!"])
print transformed

vocab = count_vect.vocabulary_
for v in transformed.indices:
    print vocab.keys()[vocab.values().index(v)]
  (0, 4573) 1
  (0, 5170) 1
  (0, 5700) 1
iphone
love
my
By calling transform on a given text such as "I love my iphone!!!", a matrix is returned with the counts of each vocabulary word found. Our original vocab that we fitted to the CountVectorizer is used. "iphone", "love", and "my" are found once in our "I love my iphone!!!" text. In (0, 4573): 0 is used because we only have one sentence and it refers to the first sentence. If you added another sentence you would see a 1 representing the second sentence. 4573 is the index of "iphone" and you can verify if you wanted by finding it in print(countvect.vocabulary) of the previous example. I should mention that "I" is not found because by default only 2 or more character tokens are included in our vocab while the exclamation poins in "iphone!!!" are ignored since punctuation is completely ignored and always treated as a token separator
In [48]:
# scikit/feature-extraction-3.py 
import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(lowercase=True) # this lowercase=True is not necessary because the default is True
count_vect.fit(fixed_text)

# turns the text into a sparse matrix
counts = count_vect.transform(fixed_text)

print(counts)
  (0, 168) 1
  (0, 430) 1
  (0, 774) 2
  (0, 2291) 1
  (0, 3981) 1
  (0, 4210) 1
  (0, 4573) 1
  (0, 4610) 1
  (0, 4678) 1
  (0, 5767) 1
  (0, 6479) 1
  (0, 7233) 1
  (0, 8077) 1
  (0, 8324) 1
  (0, 8703) 1
  (0, 8921) 1
  (0, 9063) 1
  (0, 9304) 1
  (0, 9374) 1
  (1, 313) 1
  (1, 527) 1
  (1, 644) 1
  (1, 677) 1
  (1, 774) 1
  (1, 876) 1
  : :
  (9090, 5802) 1
  (9090, 5968) 1
  (9090, 7904) 1
  (9090, 8324) 1
  (9090, 8563) 1
  (9090, 8579) 1
  (9090, 8603) 1
  (9090, 8617) 1
  (9090, 8667) 1
  (9090, 9159) 1
  (9090, 9358) 1
  (9090, 9372) 1
  (9090, 9403) 1
  (9090, 9624) 1
  (9091, 774) 1
  (9091, 1618) 1
  (9091, 3741) 1
  (9091, 4374) 1
  (9091, 5058) 1
  (9091, 5436) 1
  (9091, 5975) 1
  (9091, 7295) 1
  (9091, 8324) 1
  (9091, 8540) 1
  (9091, 9702) 1
In this example each of the valid 9092 (because we 0-indexed) sentences are transformed
In the next step we get to apply algorithms to our data. How do we decide what algorithm to use? One simple way to decide what you want is to use this cheat sheet. You can find it at http://scikit-learn.org/stable/tutorial/machine_learning_map/


We'll use a classifier in this step next step. Classification is like Shazam (the music discovery app). The app was told what songs to identity and then when it hears a song it tries to match it to one of them. In this example we'll be training the program to know what happy and sad is and when it sees a new sentence it will try to figure out to smile or not. 
In [103]:
# classifier.py

counts = count_vect.transform(fixed_text)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(counts, fixed_target)

print nb.predict(count_vect.transform(["I love my iphone!!!"]))
print nb.predict(count_vect.transform(["I hate my iphone!!!"]))
['Positive emotion']
['Negative emotion']
You can see that we have added our target data as well all of our token count data. Using the Naive Bayes algorithm we are able to make some predictions. But how do we know how well the algorithm is working?
In [97]:
predictions = nb.predict(counts)
print sum(predictions == fixed_target) / float(len(fixed_target))
0.795094588649
Here we see that almost 80% of the predictions that we have made are correct. That's pretty good right. But we made a rookie mistake here. We used the same data that we trained with to test with. This doesn't help since we could just parrot the results of what we have seen back and get a 100% prediction rate. What we really want is to use our trained model on yet-unseen data. So lets do it again but this time lets train using the first 6k lines of data and then test the rest (~3k).
In [107]:
nb.fit(counts[0:6000], fixed_target[0:6000])

predictions = nb.predict(counts[6000:])
print sum(predictions == fixed_target[6000:]) / float(len(fixed_target[6000:]))
0.611254851229
Thats seems much better. But this number might mean more if we compare it to some baseline. Lets compare to a simple dummy 'most frequent' classifier which will just blindly return the most frequent label (In this case that would be "No emotion toward brand or product"
In [106]:
from sklearn.dummy import DummyClassifier

nb = DummyClassifier(strategy='most_frequent')
nb.fit(counts[0:6000], fixed_target[0:6000])
predictions = nb.predict(counts[6000:])

print sum(predictions == fixed_target[6000:]) / float(len(fixed_target[6000:]))
0.611254851229
So it turns out our classifier using Naive Bayes is just 5% better than a classifier that just looks at the most frequent token

Cross Validation

Cross validation gives us a more accurate gauge of accuracy. What it does is partition the data into a certain number of pieces. It will then do many rounds, rotating which partitions are used to train and which are used to validate
In [110]:
nb = MultinomialNB()
from sklearn import cross_validation
scores = cross_validation.cross_val_score(nb, counts, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.65824176  0.63076923  0.60659341  0.60879121  0.64395604  0.68901099
  0.70077008  0.66886689  0.65270121  0.62183021]
0.648153102333
In the above example we split the data into 10 pieces (cv=10) and do a KFolds cross validator where 1 piece of data is used for validation while the other 9 peices are used for training. You can see the results of each round and the mean of all the rounds. Once again we'll do the same cross validation with a baseline 'most_frequence' classifier
In [128]:
nb = DummyClassifier(strategy='most_frequent')
scores = cross_validation.cross_val_score(nb, counts, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.59230769  0.59230769  0.59230769  0.59230769  0.59230769  0.59230769
  0.5929593   0.5929593   0.59316428  0.59316428]
0.592609330138

Pipelines

Pipelines are just some useful plumbing to chain together multiple transformers. You can notice that our code to create a CountVectorizer and apply Naive Bayes becomes much more compact:
In [138]:
p = Pipeline(steps=[('counts', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print p.predict(["I love my iphone!"])
['Positive emotion']

N-Grams

In the previous examples we've only built out our vocabulary using one word at a time. But there's a difference if someone says "Great" vs "Oh, Great". To get more accurate results we can try taking both 1 and 2 gram combinations
In [156]:
p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print p.named_steps['counts'].vocabulary_.get(u'garage sale')
print p.named_steps['counts'].vocabulary_.get(u'like')
print len(p.named_steps['counts'].vocabulary_)
18967
28693
59616
Notice that the vocab is much larger than before
In [140]:
scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.68351648  0.66593407  0.65384615  0.64725275  0.68021978  0.69120879
  0.73267327  0.70517052  0.68026461  0.64829107]
0.678837748442
And our result taking the 1 and 2 gram is a bit more accurate

Feature Selection

You want to be selecting features or attributes that will help be the most predictive to either boost performance or to make results more accurate
In [175]:
# feature_selection.py
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('feature_selection', SelectKBest(chi2, k=10000)),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)

from sklearn import cross_validation

scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.67032967  0.66813187  0.62087912  0.64285714  0.64945055  0.67912088
  0.67876788  0.6809681   0.66041896  0.63947078]
0.659039495078
In this case we took only the most predictive 10k tokens. You can see that this actually lowered the accuracy
In [177]:
 p = Pipeline(steps=[('counts', CountVectorizer()),
                ('feature_selection', SelectKBest(chi2)),
                ('multinomialnb', MultinomialNB())])


parameters = {
    'counts__max_df': (0.5, 0.75, 1.0),
    'counts__min_df': (1, 2, 3),
    'counts__ngram_range': ((1,1), (1,2))
    }

grid_search = GridSearchCV(p, parameters, n_jobs=1, verbose=1, cv=10)

grid_search.fit(fixed_text, fixed_target)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
Fitting 10 folds for each of 18 candidates, totalling 180 fits
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.9min finished
Best score: 0.605
Best parameters set:
 counts__max_df: 0.5
 counts__min_df: 3
 counts__ngram_range: (1, 1)
This last step shows how to do a Grid Search. This tries out all possible combinations of given parameters and returns the parameters that give us the best fit. In the example above there are 3 max_df options, 3 min_df options, and 2 ngram_range options. Multiplying them together gives us 3x3x2 = 18 candidates. All 18 are tried and the best score and best parameters are give.