Exploring the explore exploit trade-off


I’m writing this to discuss a decision-making concept that’s relevant to effective altruists. When faced with a decision between uncertain options, there is often a trade-off between getting more information about how good each option is (exploring) and choosing the best known option (exploiting). For example, when choosing which career path to choose, you are faced with a trade-off between testing out several options or committing to just one. In this post, you can play a simple game that demonstrates this trade-off then learn about the optimal way to play this game and make decisions in life.

Key takeaways

  1. How much you should explore depends on how much you already know about how the values are distributed over the options, how heavy the tails of the distribution are, and how costly it is to explore.
  2. All else equal, the less you know about the distribution, the more you should explore.
  3. All else equal, if you suspect that the chance of finding an option with an extremely high value is not negligible (heavy-tailed), you should explore more than if you think that it is negligible (light-tailed).
  4. All else equal, the higher the cost of exploring, the less you should explore.


What do the following questions have in common?

  • Which cake should you choose at the coffee shop?
  • Which charities should a charity evaluator recommend to donors?
  • Which offer should you accept when selling your house?
  • Which career path should you follow?
They are all decision questions that involve a choice between unknown options. Unless you’ve tried all the cakes in the coffee shop, you don’t know how good each cake will be. Charity evaluators don’t have information about the expected impact of all the charities. In each of these decisions, the decision-maker can decide to explore the space of options or exploit the best known option. But exploration comes at a cost. It takes time, money, and energy to research charities let alone the opportunity cost. So the decision-maker is faced with a trade-off:
  • if they explore too long then they pay more cost and miss out on the utility of exploiting a good option early.
  • If they exploit the best known option too early then they are likely to be committing to a sub-optimal choice.

I want to make the right decisions. I’m sure you do too. That’s why I’ve been fascinated by the explore exploit trade-off. In this post, I’ll introduce a simple decision-making game that illustrates this trade-off. I’ll then talk about the optimal way to play the game, followed by the implications this has for making important decisions in real life.

Modelling the explore exploit trade-off

Model set-up

Here’s a fun game that models the explore-exploit trade-off*. Each player draws a number from a continuous probability distribution of known type but unknown parameters. On every turn, each player can choose to either draw again or stop drawing. Once you stop drawing, you can never draw again. The winner is the player with the maximum score, which is calculated as the player’s maximum draw minus a constant cost for each draw taken. Let’s say your maximum draw was 134, you took 13 draws, and the cost per draw was 4, then your final score would be 134 - (13 * 4) = 82.

This game is quite simple. It’s more simple than almost any decision you’ll make in real life. But it isolates an important decision-making concept: the trade-off between exploring and exploiting. By drawing, you are exploring the unknown probability distribution. But each draw comes at a cost. After a number of draws, you will be happy that you’ve explored the distribution enough to commit to the score you have and stop drawing.

In realistic decisions, we tend to have some idea of at least the type of probability distribution. For example, there are reasons to think that the expected cost-adjusted impact of a randomly selected charity follows a power-law distribution or other heavy-tailed distribution. For this reason, in the game, you will be able to choose the type of distribution you want to draw from.

Play the game

Before we go any further, have a go yourself. I’ve written a rough-and-ready version of this game in Python which you can run below. The game can be played with one or more players and one of three distribution types can be chosen: uniform, normal, and lomax**. To refresh your memory of what these distributions look like, the probability density functions of these distributions can be seen below. These curves give the relative likelihood (vertical axis) that a given value (horizontal axis) is drawn.

Uniform distribution

Normal distribution
Lomax distribution

(Images from Wikipedia)

Play the game here:

Optimal strategies

WARNING: Parts of this section are full of mathematics. Skip this section if you want to real-world applications.

CAVEAT: Whilst I am a mathematician, I am not an expert in statistics, probability, or risk, so there is likely to be mistakes or misconceptions. Please let me know if you spot any.

How should you play this game in order to maximise your score? When should you stop drawing? Intuitively, we would expect that the optimum decision procedure is to explore the option-space until the expected value of exploring another option is less than the cost incurred by exploring another option. Let’s put this intuition on a mathematical footing.

You’re playing the game, and after \(n\) draws, your maximum draw is \(M_n\), and you need to decide whether or not to take the next draw. Let \(U: A \to R\), be your utility function that maps actions from the action set \(A = \{\text{draw}, \text{stop}\}\) to the real number line. In this game your utility function is your score. Stopping would yield a known utility of $$ U(\text{stop}) = M_n - nc. $$ Drawing the next number (and then stopping after then \((n + 1)\)th draw) would yield a utility of $$ U(\text{draw}) = M_{n+1} - (n + 1)c. $$ But we don’t know what the maximum will be after the \((n + 1)\)th draw. Instead, we need to calculate the expected value of the utility of drawing, \(E(U(\text{draw}))\), which I will shorten to just \(E(\text{draw})\). This is $$ E(\text{draw}) = E(M_{n+1}) - (n + 1)c = E(\max\{x_{n + 1}, M_n\}) - (n + 1)c, $$ Where \(x_{n+1}\) is the draw you will get if you decide to draw. Then our intuition that we should draw only when we can expect the score we’d get if we draw to be greater than the score we’d get if we stop can be written mathematically as $$ E(\text{draw}) > E(\text{stop}), $$ Which, using the previous two equations, can be written as $$ E(\max\{x_{n + 1}, M_n\}) - M_n > c. $$

Now it’s useful to first consider the simple case where the distribution is completely known to the player. That is, the player has perfect information.

Strategy 1 - Perfect information

In this case, the probability density function, \(f(x)\) is known. Then the left hand side of Equation () can be written as $$ \begin{aligned} E(\max\{x_{n + 1}, M_n\}) - M_n &= P(x_{n + 1} > M_n) E(x_{n + 1} | x_{n + 1} > M_n) + P(x_{n + 1} \leq M_n) M_n - M_n \\ &= \int_{M_n}^{\infty} x f(x)~dx + \left(\int_{-\infty}^{M_n} f(x)~dx\right) M_n - M_n, \\ &= \int_{M_n}^{\infty} x f(x)~dx + \left( 1 - \int_{M_n}^{\infty} f(x)~dx\right) M_n - M_n, \\ &= \int_{M_n}^{\infty} (x - M_n) f(x)~dx, \end{aligned} $$ where we have used the definition of conditional expectation. So we should draw only if $$ \int_{M_n}^{\infty} (x - M_n) f(x)~dx > c. $$

Now we use a part of mathematics known as dynamic programming. With perfect information, there is no more information about the distribution available. So the decision rule is the same for each draw. That is, the optimal strategy is to stop as soon as you draw a number higher than some critical value. The critical value is the value of M that satisfies $$ \int_{M}^{\infty} (x - M) f(x)~dx = c. $$ Solve this equation for \(M\), then you should stop drawing as soon as you draw a value greater than \(M\).

For example, let’s say you are drawing from the standard normal distribution***, then we’d have to solve the equation $$ \frac{1}{\sqrt{2\pi}} e^{-M^2/2} - \frac{1}{2} M (1 - \text{erf}(M/\sqrt{2})) = c $$ for \(M\). As an example, with a cost of \(c = 0.01\), the correct strategy when yo uhave perfect information is to stop drawing as soon as you draw a number higher than 1.938.

Strategy 2 - Myopic

What if we know the type of distribution but we don’t know the distribution parameters? This is more like the important decisions we make in real-life.

In the previous derivation, we reasoned as if you would stop after the next draw. When perfect information is known about the distribution, this reasoning doesn’t affect the strategy. This is because the value of information of each draw is zero because all the information is known. If, instead, we only know the type of the distribution and not the parameters, the value of information is non-zero. The strategy where we look only to the new draw is known as a myopic strategy. We can calculate the optimal myopic strategy as follows.

After \(n\) draws, we want to decide whether to draw again. Let our \(n\) previous draws make up the set \(D = \{x_1, x_2, …, x_{n - 1}, x_n\}\), which has maximum \(M_n\). We can use this set to make a maximum likelihood estimate of the parameters of the probability distribution from which we are drawing. Let the probability density function of the distribution that has these maximum likelihood parameters be \(f_D(x)\). Then we can expect that the next draw, \(x'_{n + 1}\), will be drawn from this distribution. Here, the \('\) is to make it clear that we are deciding from a position where this value is unknown. Then we can calculate the expected utility from drawing and the expected utility from stopping to see if drawing is worth it. $$ E(\max\{x'_{n + 1}, M_n\}) - M_n = \int_{M_n}^{\infty} (x - M_n) f_D(x) ~dx. $$ This is the same as the corresponding expression from the previous section, but with a maximum likelihood distribution rather than a known distribution. When this expression is greater than the cost per draw, \(c\), then you should draw again, if not, you should stop. After each draw, a new maximum likelihood estimate must be calculated.

As an example, say we are drawing from a normal distribution of unknown mean and variance. After \(n\) draws, let the maximum likelihood estimates of the mean and variance be \(\mu_n\) and \(\sigma_n\), respectively****. Then $$ \begin{aligned} E(\max\{x'_{n + 1}, M_n\}) - M_n &= \int_{M_n}^{\infty} (x - M_n) f_D(x) ~dx \\ &= \int_{M_n}^{\infty} (x - M_n) \frac{1}{\sqrt{2\pi}\sigma_n} e^{-\left(\frac{x - \mu_n}{\sqrt{2}\sigma_n}\right)^2} ~dx \\ &= \frac{1}{\sqrt{2\pi}}\sigma_n e^{-\left(\frac{x - \mu_n}{\sqrt{2}\sigma_n}\right)^2} + \frac{1}{2}(\mu_n - M_n) \left(1 - \text{erf}\left(\frac{M_n - \mu_n}{\sqrt{2}\sigma_n}\right)\right), \end{aligned} $$ where \(\text{erf}\) is the error function. So the optimal (myopic) strategy is to calculate the maximum likelihood parameters \(\mu_n\) and \(\sigma_n\) and if the above expression is less than the cost per draw, \(c\), then stop drawing.

Since this is a myopic strategy, strictly, this gives a lower bound on when you should actually stop drawing. By this I mean that if this strategy tells you to stop, you should definitely stop, but there may be cases where you should keep drawing even if this strategy tells you to stop. This is because there is value of information (which can only be positive) in taking the next draw that is not considered in this optimal myopic strategy. In the next section, we'll derive a strategy that takes into account some of this value of information.

Strategy 3 - Less myopic

It's possible to derive a strategy that takes into account the value of the information you will get about the distribution for the \((n + 2)\)th draw when deciding whether you should draw the \((n + 1)\)th draw. The derivation is much longer, so I will omit it here and just call this the "less myopic strategy".

Comparing optimal strategies

Let’s see how the performance of each of these strategies compare. Firstly, just running one game, where the distribution is standard normal and the cost per draw is \(c = 0.01\), the scores for each strategy are given in the figure below.

Each line shows how the scores change as each strategy draws. The circles on the lines are the draw at which the strategy decided to stop.The downward slope in sections of each line correspond to the decrease in score every time a draw is taken. Sudden sharp rises in score occur when a new maximum is drawn.

As expected, the strategy which has perfect information achieves the highest score in the end, with the myopic strategy performing worst of all. But this is just a single (albeit typical) game. We are interested in how these strategies perform in the long run, when many games are played. How much better does the perfect information strategy do in the long run?

To understand this, I ran a Monte Carlo simulation of 10,000 games between each of the strategies, drawing from a standard normal distribution with cost per draw of 0.01. The histograms below illustrates the distribution of score achieved by each strategy.

The table below gives the average number of draws taken and the average score of each strategy.

Average number of draws: [37.67626 17.69837 36.05847] Average scores: [1.9387332 1.3808339 1.5845812] Min scores: [-2.1 -2.56 -2.77] Max scores: [4.54 5.48 5.12]
Strategy Average score Average number of draws
Perfect information 1.94 37.68
Myopic 1.38 17.70
Less myopic 1.58 36.06

As expected, the strategy with perfect information has by far the highest average score, achieving on average nearly two standard deviations higher than the mean. What is perhaps more surprising is the fact that on average the perfect information strategy takes more draws than the other two strategies. This is surprising because the perfect information strategy does not gain any extra information about the distribution by drawing. So you might think expect that this strategy would draw less. However, this is overridden by this strategy’s knowledge of the critical value to wait for. They know all the information about the distribution so can keep drawing even if a string of low values are drawn and wait for an extremely high value.

The spread of scores is narrower for the perfect information strategy than the myopic strategies. The perfect strategy achieves approximately just as many high scores (say, >2 standard deviations above the mean) as the myopic strategies, but it achieves much fewer low scores. In particular, the perfect strategy almost never achieves a score lower than the mean, whereas such a low score is fairly common for the myopic strategies.

Model flaws

Like any model, this model has some areas where it doesn’t fully approximate real-life decision making. Here are some that come to mind that you should bear in mind when interpreting the ideas in this post:

  1. Real-life decision-making is not a binary choice to explore or exploit. Instead, one can apportion their time between exploring and exploiting.
  2. It’s rare that a decision process requires you to stop exploring completely until the end of time. Usually, you can revert back to a period of exploration if more information becomes available suggesting that challenges your belief about the underlying distribution.
  3. This model assumes that exploration costs are known a priori. In reality, this is rarely the case.
  4. This model assumes that it is possible to gain full information about the utility of an option. In reality, this is rarely the case, in principle, let alone in practice.

Implications for real-world decision making

Real world decisions that involve a trade-off between exploring and exploiting options are common. We rarely have access to perfect information about how value is distributed across the options. Instead, it’s best to have a prior probability model (that is, what you expect the distribution to look like), which you update based on new information that you have gained by exploring the option space. Ideally, you would be able to calculate the value of information that you would get by exploring. In practice, it is very hard to calculate (or intuit) the value of the information for all future decisions. Instead, it’s much easier to calculate (or intuit) the value of the information for the next one or few decisions (that is, a myopic strategy).

So, what is the optimal way to trade-off between exploring and exploiting? There is no clear-cut answer to this because it depends on your knowledge of the distribution, the heaviness of the tails of the distribution, and how much it costs to explore. Here are some heuristics that can guide us to better decision making:

  1. If you don’t know very much about the distribution of value across your options, then exploration will yield larger updates about the distribution. Put another way, the less you know about the distribution, the larger the value of information, therefore, you should err on the side of exploring more. Usually, early on in a decision making process, you know very little about the distribution, therefore, you should explore unknown options more. Later in the decision process, you are likely to know more about the distribution, therefore, you should exploit your best options more.
  2. If your prior is that the distribution is heavy-tailed, then it’s likely that a large proportion of the value in the options is in the few most valuable options. In this case, you should err on the side of exploring more. The heaviness of a distribution’s tails is difficult to estimate. There are psychological reasons why we tend to underestimate the heaviness of tails. For example, an underestimate of the kurtosis risk in financial investments has led to several financial collapses.
  3. If it is cheap to explore, compared to the expected value of information gained from exploration, then you should err on the side of exploring more. There are often very cheap ways to get some useful information about the value of an option. For career choice, this might look something like talking to someone in a given career or working on a small project or internship for a few weeks.

Thanks to Michael Gritzmacher for idea generation and for help creating the original version of the game, which involved turning over small pieces of cardboard with numbers on them and was used for an Effective Altruism Sheffield workshop on Bayesian decision making. Thanks to Fionnlagh Mackenzie-Dover for play testing.


* This game is an example of an optimal stopping problem. This particular problem is a version of the house selling problem. It has similar features to the famous multi-armed bandit and secretary problems.

** The Lomax distribution is a power-law distribution. It’s a Pareto distribution that has been shifted so that all positive numbers can be realised. It is heavy-tailed, which means that numbers very far from the mean will be realised more then in the normal distribution.

*** Normal distribution with mean 0 and variance 1.

****For the normal distribution, the maximum likelihood estimates for the mean and variance are the same as the sample mean and variance.


Comments powered by Disqus