Explore vs. Exploit

Cover Image for Explore vs. Exploit
Kevin B. Ridgway
Kevin B. Ridgway

Photo by Joshua Earle on Unsplash

I recently started listening to the audio book Algorithms to Live By - The Computer Science of Human Decisions by Brian Christian and Tom Griffiths, and it has been very good. There are a few points that stuck out to me:

  • Tradeoffs: In life, just like in computer science, it's all about tradeoffs. When making decisions, what am I gaining, and what am I passing on, when making a particular choice given many options?
  • Timing/Stopping: When should I choose to stop persuing one course of action, and try another? How many times should I try the first, second, or third course of action? When should I stop trying options, and stick with the one I am on?

As a software engineer, I know algorithms help computers make the best decision they can, with the time, and information they have (see [https://en.wikipedia.org/wiki/Big_O_notation](Big O Notation), but I did wonder, how could some of these things apply in human decisions? Not just sorting-your-sock-drawer decisions, but big life decisions such as should I take that job, or this one? Should I send my child to that school, or a different one? How do I know I've made the right (read as, best decision based on the time and info at hand) decision, once I've made it? This in computer science, and mathmatics is called the optimal stopping problem. Here's what optimal stopping means:

...the theory of optimal stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost...

These are fascinating ideas, that one could take computer science algorithms to help aid oneself in those big human decisions. One in particular really got me thinking was this idea of 'exploit vs exploration'. How long should you try one decision (explore), before you decide to commit to it to get the benefit from it (exploit)? This idea comes from a particular algorithm:

In probability theory, the multi-armed bandit problem (sometimes called the K-[1] or N-armed bandit problem) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite (asymptotic) time horizons...

I would love to minimize regret either way, in my decisions. So where does this idea of 'exploit' vs. 'explore' come in? When you're in-between options, do you continue to explore and see what the reward of other options are? Or do you sit on the one you've currently chosen, expecting the most reward to come from this current option? And finally, what do you learn while you're exploring?

In early versions of the multi-armed bandit problem, the gambler has no initial knowledge about the machines. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in reinforcement learning.

This thinking about the multi-armed bandit problem was happening in the 50s. In the 70s Mr. Gittens thought about it, and decided to name something after himself and formulate a way to maximize our decision's reward:

A theorem, the Gittins index, first published by John C. Gittins, gives an optimal policy for maximizing the expected discounted reward.

the "Gittins index" is a real scalar value associated to the state of a stochastic process with a reward function and with a probability of termination. It is a measure of the reward that can be achieved by the process evolving from that state on, under the probability that it will be terminated in future.

Success and failure, in human terms. Zero and one, in computer terms. The probability lies in between those integers, zero, and one. And therein lies our decisions. Looking forward to finishing this book.