Monday, April 2, 2018

Google's Machine Learning Crash Course

Machine Learning is taking over the world. Google is offering a free Machine Learning Crash Course. I finished it over 2 days and enjoyed it a lot; particularly their exercises where you can tweak parameters and see the result as your model is training.

Here are my notes.

ML Concepts


Introduction to ML


Machine learning makes us think in different ways, specifically takes it from a mathematical science to a natural science. Machines can do things that we have no idea how to do ourselves.

Framing


Some terminology.

Supervised ML produces predictions on never-before-seen data (much more common than unsupervised)
Unsupervised ML is like true AI, it can identify patterns without human guidance: for example can analyze a group of data and come up with its own way of classifying them.

An example is a piece of data. Examples have Features which describe that data (name,subject,body text,etc..) The example is Labeled if someone labels it (ex. spam or not spam). If no one (or thing) has labeled that example for us, then it is unlabeled.

A Model is the thing we are trying to create through ML: a map of examples to predicted labels

There are different types of models

Regression model: what's the probability of X? value of y?
Classification: hot dog, not hot dog

Descending into ML


Linear regression is a method for  finding a line that best fits the data points and something you did in high school algebra 1. If you remember, the equation for a line is y=mx+b. Machine learning has the same equation but different terminology.

y' = b + w1x1

y' is the predicted label
b is the bias. In some ML doc, it will be w0
w1 is the weight of feature 1. Weight is the same concept as slope.
x1 is a feature


The Loss of a data point is the difference between that line drawn and that particular data point. You can think of it as a penalty for a bad prediction for that example

L2 Loss is the square of the loss for a given example. If you take the average of the squared loss for all the examples it is called Mean square error (MSE)

Just like you train in the gym with weights and bias. Training a model means figuring out good values for weights and bias. In supervised learning it is all about minimizing loss, also called empirical risk minimization


Reducing Loss


How do we choose the model that minimizes loss? One way to do this is in small steps (Gradient steps) that minimized loss, called Gradient Descent. 

So you pick initial starting values for your model. For linear regression that would be the bias (b) and weight (w1). Then you would run the calculation over each point (or a batch of points) of your dataset to calculate the MSE.

From there you would compute parameter updates: this is done using an algorithm to calculate gradients. Its math but the general intuition of the math isn't really complicated. Remember a partial derivative just tells us how much the result is changing.  A gradient is just the partial derivative for each independent variable in the function. So if your function has x and y. Your gradient has 2 values: derivative of f with respect to x, derivative of f with respect to y. So if we know the change of x and the change of y. Now believe that if we know the change of x and the change of y, we can combine that with where we currently are to calculate the general direction of where we want to go. This is called the Directional derivative.

So we can use this Directional derivative to choose better values for the bias and weight. Then run the calculation again to give better results. Each one of these cycles is one step. When the loss stops changing or changes extremely slow, then the model has converged.

You want to set your Learning Rate or step size to something efficient. Too low and you take take too many steps and learn to slowly, too high and your steps dance around your goal and you may never hit your loss minimum.

Keep in mind that some problems will have more than one minimum so in these problems your starting values are important.

With large datasets, recalculating the MSE for all datapoints is too data intensitve. You can get nearly similar results if  you just do it over one example (Stochastic Gradient Descent or SGD) or a batch of examples (Mini-Batch Gradient Descent). You can think of SGD as just a mini-batch of 1.

steps and batch size are examples of hyperparameters. Parameters to tweak to tune the ML model.

First Steps with TensorFlow


TensorFlow (TF) consists of a hierarchy of tool-kits. TensorFlow itself models Java and the JVM. TensorFlow's file formats called protobufs (similar to Java... well only in analogy), run on a graph runtime (similar to JVM).

This crash course will focus on the high level API: tf.estimator, but you could really do the same thing with TF lower level APIs.
Quick Introduction to pandas
Pandas is a column-oriented data analysis API. The two main data structures are the DataFrame; similar to a relational data table with rows and named columns. A single column is a Series.

Important Note: pandas is not named after the cute bear but rather it comes from the term panel data.



pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
Out[3]:
0    San Francisco
1         San Jose
2       Sacramento
dtype: object

Series can be combined into DataFrame:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

pd.DataFrame({ 'City name': city_names, 'Population': population })



You use pandas to manipulate data. For example: you can add a new column like so:

cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))
cities
Out[42]:
City namePopulationArea square milesPopulation densityIs named after saint and city area > 50 square milesIs wide and has saint name
0San Francisco85246946.8718187.945381FalseFalse
1San Jose1015785176.535754.177760TrueTrue
2Sacramento48519997.924955.055147FalseFalse

 Each column and row has an index on creation. Reorder columns by reordering the index.

cities.reindex([2, 0, 1])
Out[49]:
City namePopulationArea square milesPopulation densityIs named after saint and city area > 50 square milesIs wide and has saint name
2Sacramento48519997.924955.055147FalseFalse
0San Francisco85246946.8718187.945381FalseFalse
1San Jose1015785176.535754.177760TrueTrue


First Steps with TensorFlow (exercise)


This exercise is an example of building a model to predict housing value based on number of rooms.

To build the model we first need to define feature columns and targets.

In this model we'll do a linear regression using a Gradient Descent (described earlier). A gradient clipping can be added which is just a cap to make sure steps are not too large, causing the descent to fail. It's like training wheels to a bike.

The last step is to create a function that pre processes the data to feed into the LinearRegressor.

After that the model can be trained. By examining predictions over periods of steps we can see the gradual improvements in predictions and tweak our hyper parameters(learning rate, steps, batch size)  to get to our minimum loss defined by RMSE



The straight lines are the lines at each period, ending with the final predictions red line. Each period is better and bettor, ending in a decent RMSE. This is an ideal gradual descent.

In this graph you can see that the RMSE is descending gradually. You don't want it to be too linear because that would be descending too slow. A descent like this is a good sign of a well tuned model.

Synthetic features and outliers


continuing from the previous housing example. A synthetic feature is a feature that is made up of other features. In this case we will create a synthetic feature as the combination of two features: total_rooms and population.

We can use a scatter-plot and histogram to detect outliers

Notice the outliers. These outliers can be clipped by applying min/max functions to our feature.

Generalization


Its a mistake to try and overfit your training data: meaning trying to draw a line that covers all points. While it works for your specific training data, the overfittedness can cause new data to be mislabeled where it would not have been for a less aggressive fit.

One way to prevent this and to test if our model is any good is to take some amount of our dataset to use for training and take a different amount of data from the dataset for testing. Do not test on your training data.

To draw data correctly:
  1. data must be randomly drawn
  2. the distribution is stationary, it doesn't change over time
  3. we always pull from the same distribution
Keep it simple -Occam's razor

Training and Test Sets


Ideally you want a good number of data for test datasets and validation datasets. If your dataset is large you are fine. If it is small then these are in conflict. You trade off confidence in testing vs validating. But if it is small you can do something like cross validation.

Validation


The more rounds of testing on the same test set causes you to overfit to that test set. One way to solve this is to create a Validation Set. Now you train on your training set, then validate on the validation set. You can do many iterations of this. Finally you confirm the results on the test set. If the results on the test set vary from the results on the validation set, then its a sign of overfitting.

Validation Exercise


Do checks on the data to make sure there are no errors. Pay attentions to any data caps, values outside of sane ranges. Doing a plot of the test and validation set samples separately will help you spot errors.

Representation


Getting features from real world data, called Feature Engineering is what ML scientists do most of the time. You can use 1-hot encoding to map a feature to a unique representation. 

Good features are those that have realistic and easy to understand values. Don't use magic numbers for features. They have to appear a reasonable number of times in the data set.

You can use the Binning Trick to create ranges and consider that range (or bin) a feature.

Feature Sets Exercise


You can use feature sets to train your model instead of a single feature. To find features for your feature set, use the corr() method on a dataframe to show a correlation matrix. This shows how much one feature correlates to all the other features. Pick features that are not strongly correlated with each other so that they add independent information.

Feature Crosses


Sometimes the data cannot be fit well by a linear line. Feature Crosses solves this by creating a synthetic feature (by multiplying two more more features) so that it does fit a linear learner. This is important because linear learners scale well to massive data.

Regularization


At some point your model training predictions will get better and better but your general data validation will start to get worse. This is an example of overfitting to the training data.  Regularlization is a strategy to prevent this. Strategies range from stopping early, to penalizing the model complexity.

One way to define model complexity is via weights; smaller is less complex, which is better.

L2 regularization has a model complexity defined as the sum of the squares of the weights.

The goal now is to minimize the combination of the loss AND model complexity rather than just loss.
You can tweak your model by adjusting the lambda: this gives more or less weight to the model complexity. High lambda = simple model but risk of underfit. Low lambda = complex model but risk of overfitting. What you decide depends on your specific data and circumstances.

Logistic Regression


A normal linear regression would give us strange values for a coin flip. Instead we need to use Logistic Regression to give us probability predictions. This is also efficient for large data sets.

Classification


Classification is defining if a puppy is cute or not cute or if something is spam or not.

We use a classification threshold to draw the line of when to label something  as spam or not. There are tradeoffs of Accuracy and Precision if setting the threshold too high or too low.

What metrics can we use to know if our classification is any good?

Accuracy = the number of correct classifications / the total predictions.  We can't just rely on accuracy because of class-imbalanced problems: problems where one of the classifications might be extremely rare compared to the other(s).

Instead of Accuracy, better metrics are:

Precision = true positives / all positive predictions. Only say wolf when we are absolutely sure. Tradeoff is some wolves not predicted. (false negatives).

recall = true positives / all actual positives. Say wolf when there is a rumble in the bushes. Tradeoff is some rumblings end up to not be wolves (false positives)



Receiver Operating Characteristic curve (ROC) - Plots True Positive Rate vs. False Positive rate. Stated another way: if we pick a random positive and random negative, this is  the probability the model ranks them in correct order.

Prediction Bias is the sum of all the things we predict and comparing them to observed. 
Prediction Bias = average of predictions - average of labels in data set.
No prediction bias (0) means average of predictions == average of observations. So if 1% of emails are spam then the average of predictions would be 1% in order for there not to be bias.

There is something wrong with the model if there is bias.

Regularization: Sparsity


Sparse feature crosses significantly increases feature space which leads to RAM usage and overfitting.

L0 regularizing: zero out weights. NP hard so not efficient in practice.
L1 regularization: penalize sum of abs(weights). Encourages many of the useless weights to be exactly 0.

Introduction to Neural Nets


Some datasets are really complex. We'd really like Neural Nets to model it for us.

In order to map a non-linear model a non-linear function is needed. A simple one, ReLU, gives great results. To make a neural net, a Non-linear layer of nodes needs to be added. That makes it non-linear and now more linear layers can be added.

Training Neural Nets


Back propagation - its all about gradients. Each layer can reduces signal vs. noise. Gradients can explode if learning rate is to large. ReLUs can die if the values go into the negative. If that happens change initialization values and start again.

Normalizing feature values helps Neural Net training. Different optimizers, such as Adagrad and Adam can help on convex problems (but not ideal for non-convex problems!).

Dropout means that for a single gradient, a node is randomly taken out of the network. This helps regularization. 

Multi-Class Neural Nets


Sometimes we need to classify to multiple classes; not just cat, not cat. But is it a dog, cat, human, or cow.


With multi-class, single label classification we can use soft-max which just makes sure all classes have a sum of probability == 1. This is similar to logistic regression. Softmax is expensive as the number of classes grows so there is an optimization called candidate sampling so that not all negative samples need to be modeled.

With multi-class, multi-label classification use a one vs-all strategy: there is output layer which has a binary yes/no option for each of our classes.

Embeddings


Embeddings are a tool to map items to vectors. Movie suggestions can't really be embedded on a single line: Someone who wants to watch a childrens movie might also want to watch a blockbuster movie. Thus a 2d embedding is more appropriate with a 2d space of children/adult and blockbuster/arthouse.


How do you get data for movie recommendations? If a user has watched 10 movies you can take 3 to use as labels and use the other 7 for training.

How many embedding dimensions to use in layers? There are tradeoffs between more accuracy and overfitting. A good rrule of thumb: dimentions = 4rth root of(possible values)

ML Engineering


Production ML Systems


ML code is only 5% or less of overall code for ML systems since they do a ton of other things.

Static vs. Dynamic Training


Are you training offline and just once? Or online as data is coming in? What you use depends on if your data will change(trends/seasonalities)

Static vs Dynamic Inference


Do we provide predictions offline? Or do we do predictions on the fly? Dynamic has issues with latency, offline needs all the data available to make the prediction.

Data Dependencies


Be aware of features you add to your model. Questions to ask: 

Reliability - Is this data reliable and produced the same way?
Versioning - will this data change over time? Consider a version number
Necessity - does the usefulness of the feature justify the cost? Sometimes adding a feature is not worth it if the gains are minuscule.
Correlations - How are different features tied together?
Feedback loops - predictions of a model can affect the input. A stock market predictor can cause a stock to go up which causes the predictions to change which ...