The explore exploit chapter in the book “Algorithms to Live By” looks into the problem called the “Multi-armed Bandit”. The “Multi-armed Bandit” is when you have multiple independent choices with different reward probabilities, and you can only try one at a time (like a slot machine). One algorithm that the book describes to tackle this problem is the Gittins Index.

The Gittins Index is a fairly complicated calculation, but the idea for the slot machines example is that it favors exploration of new machines when there is less data gathered about the odds. The amount that the index favors exploration is calculated from the “discount” factor, the higher the discount factor, the more exploration. For example, if you only have a few minutes to test the machines, then you might choose a lower discount factor, because you don’t have time to test any one thoroughly, so stick with a decent one if you find it. With a high 90% factor, shown in the graph below, even a machine with a 9 to 5 win ratio would have a lower Gittins Index than an untested machine, guiding the player to move on.

Gittins Index with 90% Discount Factor.

Graph showing Gittins Index on the y-axis using a 90% discount rate, number of wins on the x-axis, and number of losses as separately colored lines.

A gambling example is a bit uninteresting though, since the odds are designed to be unfavorable, so I enjoyed how the author went more in depth into how the algorithm could be used in clinical trials. Randomized controlled trials are used to test drugs and other treatments on patients to determine how well they work. If you’ve watched any medical TV show, you’ve probably seen an episode about this.

Trials for serious medical conditions are difficult for everyone involved, since it requires patients receive different treatments, probably with different efficacy. So even in the ideal scenario, where a new treatment has a positive effect, then that leaves the other groups worse off for the duration of the trial. The question is, when can you determine that the treatment is effective, and how effective is it? The Gittins Index or other reinforcement-learning methods could be used to dynamically determine how effective treatments are, but it’s understandably difficult to change a life and death process that has worked so far.