Monday, November 7, 2016

Bandit Algorithms

Imagine you were at a casino with a bunch of slot machines. You are told that the slot machines all have positive expected value but all have different win probabilities. Your task is to win as much money as possible.

So in order to do this, you will need to wisely balancing out learning about new warms and earning as much rewards as possible. If you continue pulling the same lever, you will make money, but are you sure you are optimizing your gain? Maybe there is a better lever you should be pulling. If you decided to go exploring for a better lever, you might run across some worse ones and end up not making as much as if you stayed on the current best lever.

This generic class of problem where there is a explore vs exploit decision at each step is known as a bandit problem. It doesn't just have to do with slot machines but are applicable in many fields and areas in life This is seen from a/b testing a webpage to when to stop dating.

There are many well researched algorithms to optimize or smooth out the explore-exploit tradeoff. One simple one involves just flipping a coin. If it is heads, then you exploit your best lever. A tails means you go exploring. A modification to this may be rather than flip a coin, use a different percentage for exploit-explore.

There are many much better algorithms to use. I think the key is to just be thinking about this tradeoff in the decisions we make in life. We might be sitting there pulling the same lever repeatedly with little gain where instead we should be casting the widest net and exploring as many options as possible. Or the opposite may be true, we might be consistently using or time and energy exploring where nearly as good options are right in front of us.