Tuesday, November 29, 2016

Don't Settle for the Default

There are a lot of places in life where there is a normal curriculum to follow. For example college is usually a 4 year program, high school is 4 years, you pay off a house in 30 or 15 years. If you follow these plans, its probably not bad but you have to keep in mind one thing: These plans are often catered for the lowest common denominator. Or in other words, they are designed in a way so that even the worst students can pace themselves and pass. Yes of course there are exceptions, I'm sure there are people who fail high school or don't pay off their house. But the point is that it is designed so that the masses will be successful.

So the next line of reasoning is that if you are average or above average and still follow the same course as the person who is below average, then you are probably being sub-optimal about it. You aren't learning as much and aren't pushing yourself hard enough. In the case of the house, there is a tremendous cost savings of the overall purchase of a house for someone who can pay off their loan quicker. If you can graduate college or high school in 2 years, you may want to consider it. I say consider it because there are reasons to follow the normal course (maybe social reasons perhaps) and not charge ahead. But at least don't forget to look at the default that is being given to you and question whether those settings are optimal for you. If not, then feel free to poke at the walls.

Saturday, November 26, 2016

It Won't Make You Happy

Black Friday and Cyber Monday are here again and there's millions of products that are marked down. Even if you didn't want to contend with the traffic and long lines you could go online or use your mobile app and buy that product with just a simple click. But before buying anything, lets remind ourselves again why we do what we do.

Why do we buy any product? At its simplest it is because it makes us happy. This might be through making our lives more easier(a tool) so we can do more of the things we love. Or it might be to make us look more attractive, or generate a feeling of respect. Its helpful to take a look at Maslow's Hierarchy since different products address different levels of needs.


Basic food and water would satisfy our physiological needs. I assume you aren't doing your Black Friday shopping at Trader Joe's so we know that the products aren't satisfying your Physiological needs. So where do they fall on the triangle?


Yesterday was Black Friday and I was asked to go shopping with some friends. I haven't seen them in a while so I decided to come hang out. We headed over to the San Francisco Outlets (which are actually in Livermore). Along the drive we came to a standstill of cars, bumper to bumper, 30 minutes from the mall's parking lot. This was amazing to me as not only were people rushing to the mall, they had to wait in line just to do so. After we waited again to find parking and started walking around I noticed that again there were more lines. Lines that stretched outside the stores that were 2 or 3 store lengths long.  Many of these stores were high end stores such as Kate Spade, Gucci, Ferragamo, etc. So after all this, people were presented the opportunity to buy expensive items at a cheaper price. Why would people do this?

I think consumerism really targets the Esteem and sometimes Love/Belonging levels of the pyramids. The promise is that if you buy this product, you will get confidence and respect from yourself and others. On its surface this sounds like a great deal. But at what cost?

Well if you look two levels down you see Safety which entails security and resources. The act of buying something means trading a resource for another resource. When you buy a high end luxury item you are trading some Safety for Esteem. Now if you have a very wide base and plenty of resources then this seems like a pretty reasonable trade to do (I'd argue that there are different ways to build Esteem to consider, but we'll leave it at that for today).  The much bigger issue is when you don't have an excess of resources and you still make the trade. You are sacrificing lower level needs for higher level ones, Possibly inverting the middle level of your pyramid:


Maslow's Fish

"Maslow's theory suggests that the most basic level of needs must be met before the individual will strongly desire (or focus motivation upon) the secondary or higher level needs." This might still be true, but Maslow's Fish might represent some people not building up such a strong foundation of Safety before reaching for Esteem.

Ok, maybe that looks a bit ridiculous, but consider the number of people who hate their jobs but aren't able to change because they say they can't make ends meet. In the meantime they drive their BMW and buy expensive Gucci bags to build their confidence and show to the world that they are indeed successful. The allure of the higher rungs of the pyramid are so powerful that in many cases we even let it erode our physiological level needs. That might be an example of trading Physiological needs for Safety, but its possible that it is just a reaction to the prior trade of Safety for Esteem. Our resources or depleted so we trade our time and energy to work harder for more resources to keep our pyramid standing.

I'm not hating on consumerism, everyone is free to choose whatever they want. But I think its a good idea to take a step back and look at the big picture. See our wants and needs as a whole and how other levels of needs might be affected. Think about if that product really does have the effect that is promised and what trade-offs are being made.



Wednesday, November 23, 2016

Giving Thanks

Thanksgiving is a tradition where we get together with loved ones and appreciate each other. Its a good time to reflect on our lives. We all have so much to be thankful for. I think it tends to be one of the most joyful holidays because (as the name implies) its kind of pushes you to practice gratitude. And as Tony Robbins puts it, as humans we are wired so that we cannot experience negative emotions like anger and sadness at the same time as experiencing gratitude. So in order to go through life happier and with more appreciation, in addition to doing some reflection on things we are grateful for during Thanksgiving, we should try to make it a daily habit.

This Thanksgiving I've decided to create a list of 100 things I am thankful for:

  1. Immediate and extended family including (Mom,Dad,Brother) as well as those cousins that I only see once a year.
  2. My friends that have had my back through thick and thin
  3. All the people that I have crossed paths with whether that might be a random conversation with a stranger or learning something from a coworker
  4. This laptop that I am writing on that is so reliable and still going strong and works so great after 7 years!
  5. My Nexus 5 phone with a crack in the screen
  6. The podcasts that I listen on it (particularly Jocko, Tim Ferris, Security Now)
  7. Google (and the internet)
  8. These Bose bluetooth headphones that were given to me from hired.com (as a gift even though I was not hired through them)
  9. Hiking up mission peak. I can walk from my parents house.
  10. Headspace (app), mindfulness, meditation
  11. Haeegandaz Ice cream
  12. Baking breakfast muffins with my mom
  13. A nice home cooked meal
  14. Coffee - this list isn't in order, but this would be high up there
  15. My amazing place in the heart of San Francisco
  16. Amazingly good health
  17. My latest job and everyone who contributed to my growth and development
  18. Myself being a minimalist simple person who needs very little to be happy - lots of credit to my parents for raising me this way
  19. Sports. Both playing and watching competition
  20. Assorted nuts (from Costco)
  21. My other bluetooth sports headphones that I can run with or take to the gym and not have to deal with wires
  22. Having the opportunity to take a 3 month South America trip
  23. Get togethers with friends
  24. Chicken wings
  25. Baths
  26. Going camping in the desert and getting away from it all
  27. Writing code (and being able to make a living on it)
  28. Design discussions with a particular coworker
  29. Not being perfect and having a bunch of failures but being able to bounce back
  30. YouTube - I know I already put Google but particular shout out
  31. Blogging
  32. The library
  33. Seth Godin - learned a lot from his ideas
  34. Hacker news
  35. Reading other peoples blogs
  36. Coding contests
  37. Slack (the chat service)
  38. Rock climbing
  39. Running
  40. A glass of wine
  41. That feeling of cool wind blowing against my skin
  42. A warm shower
  43. Believing that I am a badass
  44. Doing a pullup
  45. Chicken
  46. Intellij IDEA
  47. Trader Joes - particular one a couple blocks from my house
  48. Sparkling water
  49. Blueberries
  50. Music
  51. Sunny days, sunrises
  52. My camera - storing memories
  53. Wifi
  54. Beaches
  55. Electricity
  56. A made bed
  57. Life being hard, challenges
  58. Good eyesight
  59. My other senses (touch, smell, taste, hear)
  60. My thoughts
  61. Exercise
  62. Sunsets
  63. Anxiety
  64. The feeling of confidence
  65. Fear
  66. Pets - cats and dogs
  67. Women of all kind
  68. Fruits
  69. Electricity
  70. Fire
  71. A warm blanket
  72. Laughter
  73. Motivation
  74. Habits
  75. Smiling
  76. Laughing
  77. Crying
  78. Both good and bad things in life
  79. Nature
  80. Having a good conversation with someone
  81. All the memories that I have
  82. Helping others
  83. Receiving help from others
  84. The rain
  85. Puzzles
  86. Running shorts
  87. The feeling of sweating during a good workout
  88. Traveling
  89. The feeling of getting stronger
  90. Still having hair left
  91. Being born beautiful
  92. Myself
  93. A good beer
  94. Being very comfortable by myself
  95. Getting a good nights sleep
  96. Snow and snowboarding
  97. Brunch
  98. A good pair of shoes
  99. Being up early and enjoying the morning
  100. Putting in work and getting after it


Monday, November 21, 2016

Derek Sivers on the Tim Ferris Podcast

Love the Tim Ferris podcast. So many gold nuggets. Particularly like it when Derek Sivers is on as he explains things so simply and elegantly:
  • Its not what you know, its what you do and what you practice every day.
  • Would we still call Richard Branson successful if his goal was to live a quiet life but as a compulsive gambler he keeps creating companies?
  • Early in your career, say yes to everything. You don't know what the lottery tickets are
  • The standard pace is for chumps. Go as fast as you want. Why would you graduate in 4 years if you can do it in 2? The pace of 4 years was geared for the lowest common denominator.
  • Don't be donkey. Think long term. That means you can decide something now. (The fable is that a donkey stands between hay and water, not being able to choose, and eventually dies; a donkey can't think long term enough to figure out that it can have both).
  • Hell yes or no. We say yes to much and let too many mediocre things fill our lives
  • Busy implies out of control. If you keep saying "I'm too busy, I have no time for this", it seems like you lack control over your time.
  • When making a business, sometimes we focus on the big things but sometimes something small (like a special email to make someone smile) will have a big impact.

Sunday, November 20, 2016

Take the Training Wheels Off

When you are struggling or learning something, it is sometimes useful to introduce some structure or positive constraints. This might mean working in 30-minute pomodoros, tracking every dollar spent, or calorie counting. However, as you develop the habit or skill, sometimes this structure comes back and becomes a limit on further growth. Think of training wheels. In the beginning we need them to help while we train our coordination, balance, and confidence. It doesn't take long for us to outgrow them. Soon the support turns into a crutch. We could still ride just find with those training wheels but it would limit our speed and turning ability.

Its important to build discipline and good habits, and most of us probably never do enough. But its also important to step back and take a look at the reasons for setting up these systems in the first place. Sometimes the original reasons are no longer applicable and we need to adapt again to continue to grow.

I've personally felt this with pomodoros. At first I felt undisciplined, always getting distracted so I decided to do focused work in 30 minute intervals. It certainly helped me be more productive, but I also felt it limited me as I would get distracted and get pulled out of my flow state at each interval. Sometimes I can work for hours straight and not feel distracted or bored. So I decided take off the training wheels. Things feel more natrual and I am as productive as ever. However, I still feel there are good situational uses for pomodoros; if I'm ever feeling lazy or distracted I might go back to this technique for a spark.

Monday, November 14, 2016

Limits To Growth

I love thinking about systems. The Fifth Discipline has some great insight into understanding common systems that occur in different contexts in our live. One of them is the concept that in order to grow, you can't just push growth. The world has underlying systems and these systems will have limits on them. At some point if you push to much the systems resists and pushes back.

For example, lets look at growing the engineering org at a typical company. You will do so by making more hires. This in turn will make the complexity of the system more difficult to manage. The senior engineers will be pushed to management and thus pulled away from developing product. At first there is a very noticeable increase in productivity. This is good so you add more people. At some point the system starts becoming pretty complex and the increase in hierarchy and difficulty of communication means that the rate of getting things done slows to a crawl. You can't just keep growing in the same way. The system has reached a limit here.

Another example is dieting. If you go on a diet you will probably see immediate results, but in not so long your rate of weight loss will decrease and you may hit a limit on how much weight you can lose.

A naive way of thinking is to push harder into the system and add more workers or eat or push the diet even harder. But this will solve nothing as the limits are already reached. More workers will cause morality issues as workers realize it is impossible to get things done and are overburdened by complexity. Dieters will be overcome by hunger pains and unable to cut calories any further.

In each of these situations the solution is not to push growth but to really take a step back and see what the limits are on the system. Focus on removing the limits. This might mean decentralizing the hierarchy or starting an exercise plan in addition to the diet. It might mean keeping the management hierarchy flat and introducing a new software management methodology. It might mean changing the diet completely; maybe your body responds differently to the Paleo diet than the Ketogenic.

Whenever you see a situation where there is growth at first but then the growth starts to slow to a halt, it is likely that the system is hitting a limit. The way past that limit is not more of what has been successful before but instead changing the system in a way that raises the limit.

Sunday, November 13, 2016

Dangerous vs Scary

On NPR's How I Built This Podcast, Jim Koch, the founder of Sam Adams talks about the difference between scary vs. dangerous. Many things that are scary to us are not dangerous. Oppositely, many things that are dangerous are not scary. 

He gives the example that repelling off of a cliff is scary but you are held by a belay rope which could hold up a car. Therefore it is not really dangerous. Things like walking near a snow mountain when the weather heats up probably isn't scary, but is really dangerous as it could cause an avalanche. I think not wearing sunscreen out to the beach might also be not scary, but dangerous.

Jim then explains that him staying at Boston Consulting Group would not be scary but would be very dangerous as one day when he is 65 he'd look back at his life and see that he wasted it by not doing something that made him happy.

Although I can honestly say that I loved my job, made such great friendships and learned so much from the experience, I knew it was time to leave. Towards the end the winds changed. There was a change in management, the community broke down and the people who I shed blood and sweat with started leaving. My rate of growth slowed down a lot. Although I made good money there I realized that it would be really dangerous to stay. Quitting my job would be scary but I'd ultimately be fine and I'd gain new skills and confidence. I'd be more antifragile

Whenever something is scary, we should also ask if it is dangerous. If not, then don't be afraid to take the leap.

Wednesday, November 9, 2016

Hyperfocus

I've noticed a pattern across a few notably successful people. When they are very interested in something they get into a flow and hyper-focus on it.

For example when Elon Musk got his first computer it came with a BASIC programming workbook. It it supposed to take 6 months to finish but he got OCD and spent the next 3 days coding straight and finished it.

Derek Sivers, founder of CD Baby, says that he doesn't have morning rituals because he gets so focused on what he does. For example, he spent 5 months, every waking hour programming SQL, only stopping an hour or two. After the project was over he hyper-focused on the next thing. When he started CD Baby he did almost nothing else but work on his company from 7am to midnight.

In addition he mentions that when people are asked about their general happiness, those who have been in the flow more often will report higher happiness.

Personally I've have always followed a to-do list style of approach where I would tick off 5-10 tasks a day. However, I'm going to really focus on doing just one thing, to put my self in an environment to experience flow more often and to stay in-tune with my interests and not stop myself from pursuing them.

I'm interested in seeing how this goes as opposed to my current busy (but possibly running in place?) schedule. So I'm going to do an experiment and for the next month I'll stay off the Trello board and all daily planning completely. Curious to see how it goes.

Tuesday, November 8, 2016

Machine Learning Introduction


*Interested in Machine Learning? I recommend starting with Google's free course. It is very detailed and will make this post make a little bit more sense. My notes on it are here.

I had the privilege to join a workshop on Machine Learning hosted by Galvanize and CrowdFlower using the Scikit-learn toolkit. Below are my notes using Jupyter Notebook.

You can get the course material at https://github.com/lukas/scikit-class. You can follow the README there to download the code and get all the libraries setup (which includes Scikit-learn, which is the library which contains all the neat machine learning tools  we'll be using).

Ok ready? Let's go!

Start by running scikit/feature-extraction-1.py
In [1]:
# First attempt at feature extraction
# Leads to an error, can you tell why?

import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

from sklearn.feature_extraction.text import CountVectorizer

count_vect=CountVectorizer()
count_vect.fit(text)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-ad8b07653e46> in <module>()
     12 
     13 count_vect=CountVectorizer()
---> 14 count_vect.fit(text)

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit(self, raw_documents, y)
    794         self
    795         """
--> 796         self.fit_transform(raw_documents)
    797         return self
    798 

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    822 
    823         vocabulary, X = self._count_vocab(raw_documents,
--> 824                                           self.fixed_vocabulary_)
    825 
    826         if self.binary:

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    750         for doc in raw_documents:
    751             feature_counter = {}
--> 752             for feature in analyze(doc):
    753                 try:
    754                     feature_idx = vocabulary[feature]

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in <lambda>(doc)
    239 
    240             return lambda doc: self._word_ngrams(
--> 241                 tokenize(preprocess(self.decode(doc))), stop_words)
    242 
    243         else:

/home/jerry/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in decode(self, doc)
    119 
    120         if doc is np.nan:
--> 121             raise ValueError("np.nan is an invalid document, expected byte or "
    122                              "unicode string.")
    123 

ValueError: np.nan is an invalid document, expected byte or unicode string.
There's an exception already?! This throws an exception because there is missing information on line 8 of tweets.csv. If you look at that file you will see that there is no tweet text at all. You'll find this is extremely common with data science. Half the battle is manipulating the data into something you can use. To fix it, the pandas library provides a convenient notnull() function on arrays. Here's an example of how this works:
In [40]:
import pandas as pd
#pandas has a special type of object called a Series object
s = pd.Series(['apple','banana','cat','dog','elephant','fish']) 
print type(s)
print
print s
print

# you can pass a list of booleans to this series object to include or exclude an index.
print s[[True,False,True]] 
print

# in our example above the extracted tweet_text is also in the same Pandas Series object
df = pd.read_csv('tweets.csv')
text = df['tweet_text']
print type(text)
print 

# pandas.notnull returns a boolean array with False values where values are null
print pd.notnull(['apple','banana', None, 'dog',None,'fish']) 
print

#Thus combining the Series datatype and pandas.notnull, you can exclude null values.
print s[pd.notnull(['apple','banana', None, 'dog',None,'fish'])]
print
<class 'pandas.core.series.Series'>

0       apple
1      banana
2         cat
3         dog
4    elephant
5        fish
dtype: object

0    apple
2      cat
dtype: object

<class 'pandas.core.series.Series'>

[ True  True False  True False  True]

0     apple
1    banana
3       dog
5      fish
dtype: object

In [41]:
# scikit/feature-extraction-2.py 
# second attempt at feature extraction

import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

# what did we do here?
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(fixed_text)

# print the number of words in the vocabulary
print(count_vect.vocabulary_)
{u'unscientific': 9042, u'hordes': 4175, u'pickmeupanipad2': 6385, u'yellow': 9608, u'four': 3434, u'prices': 6652, u'woods': 9501, u'hanging': 3940, u'16mins': 70, u'looking': 5143, u'html5': 4215, u'gad': 3543, u'eligible': 2846, u'gadgetoverload': 3546, u'insertion': 4461, u'lori': 5154, u'sxswdad': 8340, u'lord': 5152, u'newmusic': 5809, u'dynamic': 2743, u'bergstrom': 1065, u'dell': 2351, u'rancewilemon': 6892, u'leisurely': 4985, u'bringing': 1305, u'basics': 971, u'prize': 6675, u'customizable': 2213, u'wednesday': 9356, u'oooh': 6028, ... output truncated, its quite long ... }
CountVectorizer will convert the text to a token count. The fit() function applies our tweet data to the CountVectorizer. If you look at the vocabulary_ of count_vect you'll see each word lower cased and assigned to an index.
Before you take a look at scikit/feature-extraction-3.py its worth taking a look at this next example as its a simplified version
In [44]:
import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(lowercase=True) # this lowercase=True is not necessary because the default is True
count_vect.fit(fixed_text)

transformed = count_vect.transform(["I love my iphone!!!"])
print transformed

vocab = count_vect.vocabulary_
for v in transformed.indices:
    print vocab.keys()[vocab.values().index(v)]
  (0, 4573) 1
  (0, 5170) 1
  (0, 5700) 1
iphone
love
my
By calling transform on a given text such as "I love my iphone!!!", a matrix is returned with the counts of each vocabulary word found. Our original vocab that we fitted to the CountVectorizer is used. "iphone", "love", and "my" are found once in our "I love my iphone!!!" text. In (0, 4573): 0 is used because we only have one sentence and it refers to the first sentence. If you added another sentence you would see a 1 representing the second sentence. 4573 is the index of "iphone" and you can verify if you wanted by finding it in print(countvect.vocabulary) of the previous example. I should mention that "I" is not found because by default only 2 or more character tokens are included in our vocab while the exclamation poins in "iphone!!!" are ignored since punctuation is completely ignored and always treated as a token separator
In [48]:
# scikit/feature-extraction-3.py 
import pandas as pd
import numpy as np

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(lowercase=True) # this lowercase=True is not necessary because the default is True
count_vect.fit(fixed_text)

# turns the text into a sparse matrix
counts = count_vect.transform(fixed_text)

print(counts)
  (0, 168) 1
  (0, 430) 1
  (0, 774) 2
  (0, 2291) 1
  (0, 3981) 1
  (0, 4210) 1
  (0, 4573) 1
  (0, 4610) 1
  (0, 4678) 1
  (0, 5767) 1
  (0, 6479) 1
  (0, 7233) 1
  (0, 8077) 1
  (0, 8324) 1
  (0, 8703) 1
  (0, 8921) 1
  (0, 9063) 1
  (0, 9304) 1
  (0, 9374) 1
  (1, 313) 1
  (1, 527) 1
  (1, 644) 1
  (1, 677) 1
  (1, 774) 1
  (1, 876) 1
  : :
  (9090, 5802) 1
  (9090, 5968) 1
  (9090, 7904) 1
  (9090, 8324) 1
  (9090, 8563) 1
  (9090, 8579) 1
  (9090, 8603) 1
  (9090, 8617) 1
  (9090, 8667) 1
  (9090, 9159) 1
  (9090, 9358) 1
  (9090, 9372) 1
  (9090, 9403) 1
  (9090, 9624) 1
  (9091, 774) 1
  (9091, 1618) 1
  (9091, 3741) 1
  (9091, 4374) 1
  (9091, 5058) 1
  (9091, 5436) 1
  (9091, 5975) 1
  (9091, 7295) 1
  (9091, 8324) 1
  (9091, 8540) 1
  (9091, 9702) 1
In this example each of the valid 9092 (because we 0-indexed) sentences are transformed
In the next step we get to apply algorithms to our data. How do we decide what algorithm to use? One simple way to decide what you want is to use this cheat sheet. You can find it at http://scikit-learn.org/stable/tutorial/machine_learning_map/


We'll use a classifier in this step next step. Classification is like Shazam (the music discovery app). The app was told what songs to identity and then when it hears a song it tries to match it to one of them. In this example we'll be training the program to know what happy and sad is and when it sees a new sentence it will try to figure out to smile or not. 
In [103]:
# classifier.py

counts = count_vect.transform(fixed_text)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(counts, fixed_target)

print nb.predict(count_vect.transform(["I love my iphone!!!"]))
print nb.predict(count_vect.transform(["I hate my iphone!!!"]))
['Positive emotion']
['Negative emotion']
You can see that we have added our target data as well all of our token count data. Using the Naive Bayes algorithm we are able to make some predictions. But how do we know how well the algorithm is working?
In [97]:
predictions = nb.predict(counts)
print sum(predictions == fixed_target) / float(len(fixed_target))
0.795094588649
Here we see that almost 80% of the predictions that we have made are correct. That's pretty good right. But we made a rookie mistake here. We used the same data that we trained with to test with. This doesn't help since we could just parrot the results of what we have seen back and get a 100% prediction rate. What we really want is to use our trained model on yet-unseen data. So lets do it again but this time lets train using the first 6k lines of data and then test the rest (~3k).
In [107]:
nb.fit(counts[0:6000], fixed_target[0:6000])

predictions = nb.predict(counts[6000:])
print sum(predictions == fixed_target[6000:]) / float(len(fixed_target[6000:]))
0.611254851229
Thats seems much better. But this number might mean more if we compare it to some baseline. Lets compare to a simple dummy 'most frequent' classifier which will just blindly return the most frequent label (In this case that would be "No emotion toward brand or product"
In [106]:
from sklearn.dummy import DummyClassifier

nb = DummyClassifier(strategy='most_frequent')
nb.fit(counts[0:6000], fixed_target[0:6000])
predictions = nb.predict(counts[6000:])

print sum(predictions == fixed_target[6000:]) / float(len(fixed_target[6000:]))
0.611254851229
So it turns out our classifier using Naive Bayes is just 5% better than a classifier that just looks at the most frequent token

Cross Validation

Cross validation gives us a more accurate gauge of accuracy. What it does is partition the data into a certain number of pieces. It will then do many rounds, rotating which partitions are used to train and which are used to validate
In [110]:
nb = MultinomialNB()
from sklearn import cross_validation
scores = cross_validation.cross_val_score(nb, counts, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.65824176  0.63076923  0.60659341  0.60879121  0.64395604  0.68901099
  0.70077008  0.66886689  0.65270121  0.62183021]
0.648153102333
In the above example we split the data into 10 pieces (cv=10) and do a KFolds cross validator where 1 piece of data is used for validation while the other 9 peices are used for training. You can see the results of each round and the mean of all the rounds. Once again we'll do the same cross validation with a baseline 'most_frequence' classifier
In [128]:
nb = DummyClassifier(strategy='most_frequent')
scores = cross_validation.cross_val_score(nb, counts, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.59230769  0.59230769  0.59230769  0.59230769  0.59230769  0.59230769
  0.5929593   0.5929593   0.59316428  0.59316428]
0.592609330138

Pipelines

Pipelines are just some useful plumbing to chain together multiple transformers. You can notice that our code to create a CountVectorizer and apply Naive Bayes becomes much more compact:
In [138]:
p = Pipeline(steps=[('counts', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print p.predict(["I love my iphone!"])
['Positive emotion']

N-Grams

In the previous examples we've only built out our vocabulary using one word at a time. But there's a difference if someone says "Great" vs "Oh, Great". To get more accurate results we can try taking both 1 and 2 gram combinations
In [156]:
p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)
print p.named_steps['counts'].vocabulary_.get(u'garage sale')
print p.named_steps['counts'].vocabulary_.get(u'like')
print len(p.named_steps['counts'].vocabulary_)
18967
28693
59616
Notice that the vocab is much larger than before
In [140]:
scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.68351648  0.66593407  0.65384615  0.64725275  0.68021978  0.69120879
  0.73267327  0.70517052  0.68026461  0.64829107]
0.678837748442
And our result taking the 1 and 2 gram is a bit more accurate

Feature Selection

You want to be selecting features or attributes that will help be the most predictive to either boost performance or to make results more accurate
In [175]:
# feature_selection.py
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

p = Pipeline(steps=[('counts', CountVectorizer(ngram_range=(1, 2))),
                ('feature_selection', SelectKBest(chi2, k=10000)),
                ('multinomialnb', MultinomialNB())])

p.fit(fixed_text, fixed_target)

from sklearn import cross_validation

scores = cross_validation.cross_val_score(p, fixed_text, fixed_target, cv=10)
print scores
print scores.mean()
[ 0.67032967  0.66813187  0.62087912  0.64285714  0.64945055  0.67912088
  0.67876788  0.6809681   0.66041896  0.63947078]
0.659039495078
In this case we took only the most predictive 10k tokens. You can see that this actually lowered the accuracy
In [177]:
 p = Pipeline(steps=[('counts', CountVectorizer()),
                ('feature_selection', SelectKBest(chi2)),
                ('multinomialnb', MultinomialNB())])


parameters = {
    'counts__max_df': (0.5, 0.75, 1.0),
    'counts__min_df': (1, 2, 3),
    'counts__ngram_range': ((1,1), (1,2))
    }

grid_search = GridSearchCV(p, parameters, n_jobs=1, verbose=1, cv=10)

grid_search.fit(fixed_text, fixed_target)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
Fitting 10 folds for each of 18 candidates, totalling 180 fits
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:  1.9min finished
Best score: 0.605
Best parameters set:
 counts__max_df: 0.5
 counts__min_df: 3
 counts__ngram_range: (1, 1)
This last step shows how to do a Grid Search. This tries out all possible combinations of given parameters and returns the parameters that give us the best fit. In the example above there are 3 max_df options, 3 min_df options, and 2 ngram_range options. Multiplying them together gives us 3x3x2 = 18 candidates. All 18 are tried and the best score and best parameters are give.