The Explore/Exploit-dilemma

Every time I go to an Indian restaurant, I'm faced with a dilemma: Should I order my favorite Tikka Massala, or dare to try something completely new from the menu? Many decisions in our daily lives boil down to such a choice between the new and familiar. Should we have lunch at the regular place or try the newcomer on the corner? Should we choose a pizza capricciosa or try the special? Should we take a trip to somewhere we’ve never been, or book another charter to Mallorca? On one side: the familiar. On the other: uncharted territory.

 
 

Exploring options allows you to discover new favorites, but also exposes you to the risk of getting something inferior to what you already know. Where is the balance between exploring new options and using what you've already learned about what you like best? This explore/exploit dilemma has preoccupied mathematicians and computer scientists during the second half of the 20th century. The dilemma is not only notoriously difficult to solve; it also has several surprising applications.

AB testing

Marketing has become increasingly digital. Ads in paper magazines and glossy product catalogues have been replaced by digital newsletters and website banners. There is a lot of money to be gained if you can convince website visitors to sign up for a newsletter or make a purchase. But the battle for online customers turns out to be an explore/exploit dilemma.

Small details, such as fonts, colors and wording, have been shown to have an unexpectedly large impact on our online behavior. To determine which design - such as the color of the "buy" button - that makes users shop on a website, many companies routinely conduct statistical surveys. They design two (or more) versions of their website - A and B - and show one version to a certain percentage of their visitors and the other version to the rest. They then measure how well the different designs meet their objectives. The process is called AB testing.

Netflix is one of the companies that routinely uses AB testing to optimize the user experience. Among other things, they test which cover image makes users most curious to watch a particular series.

Barack Obama's campaign team also used AB testing for the 2008 election. Among other things, they tested which of the following buttons got the most website visitors to sign up as members.

 
 

It turned out that "Learn more" was the most effective option. The AB testing of the website increased the number of new members by 40%, and generated an estimated $60 million in increased donations.

So AB testing can be astonishingly effective. But there's a catch. While the experiment is running, information is obtained regarding which version of the website works the best. Continuing to allow a percentage of users to see the worse site, could be bad for business. In a perfect world, one would want to continue the testing (explore), but at the same time capitalize on the lessons learned during the experiment (exploit). In other words, a solution to the explore/exploit dilemma would help businesses in their quest for website visitors and paying customers.

Such a solution could also save lives.

Randomized studies

When you test a new medical treatment, you normally use randomized trials. In a randomized trial, the subjects are divided into two groups - one group receiving the new treatment and one group receiving the standard treatment (or placebo). After the treatment is completed, statistical tests (hypothesis testing) can be used to determine whether the new treatment was effective compared to the standard treatment. But the procedure raises an ethical dilemma.

Let's say the study is looking at a new cancer treatment. If, during the course of the study, the new cancer treatment appears to have a significantly better effect, shouldn't the patients in the other group also be given the opportunity to have that treatment? Maintaining the division of the two groups promotes research and follows the statistical framework, but does not put the welfare of patients first. In the best case scenario, one would like to be able to explore which treatment works best, and at the same time exploit the lessons learned during the course of the study. A solution to the explore/exploit dilemma could thus save lives, but such a solution has proved difficult to find.

Strategies

The explore/exploit dilemma has proven notoriously difficult to solve. Even today, there is no universal, optimal approach. However, there are methods that solve the problem under certain conditions. One such example is the Gittin's index, which you can read more about in Brian Christian and Tom Griffith's excellent book Algorithms to live by. Here, we'll focus on strategies that will go some way towards finding a solution.

Time

The first thing to note is that any solution to the explore/exploit dilemma depends on your time horizon. Is the current choice one that you will face many times in the future? If yes, then it is worth exploring options. For example, if you've just arrived in a new city where you'll be staying for a few months, it's well worth exploring the restaurant options. Going home the next day? Well, then it makes more sense to choose a nice bistro that you've already tried, seeing as you won’t have time to take advantage of the information you gain by exploring another restaurant. As Brian Christian and Tom Griffith put it:

"[E]xplore when you will have time to use the resulting knowledge, exploit when you're ready to cash in." p. 35

Greedy algorithm

A simple way to deal with the explore/exploit dilemma is to use a so-called greedy algorithm. Suppose you want to optimize the way you choose holiday destinations. You know you'll be going on holiday at least once a year, so it's worth spending some time exploring various destinations. Hence, for a number of years (n) you travel to a new destination each holiday. Once the exploring stage is over, it's time to put what you've learned to good use. According to the algorithm, nine times out of ten, you choose to visit your favorite of the n destinations. But you still leave some room for exploration. Every tenth holiday, you choose a destination you've never visited before. The algorithm is greedy in the sense that in the vast majority of cases, it chooses the very best option known so far.

Zelen’s algorithm

Another approach to addressing the explore/exploit dilemma was devised by biostatistician Marvin Zelen. He was troubled by the fact that traditional randomized trials could lead to patients being denied an admittedly unproven, but potentially life-saving, treatment. Instead of dividing patients into two groups, he proposed choosing which treatment to offer a patient based on a probability model. Zelen's algorithm can be likened to drawing colored balls from an urn.

At the first draw, there are two balls in the urn, one for each of the two treatment options. Thus, there is a 50% chance that the patient will be assigned either of the treatments. If the chosen treatment is successful, another such ball is placed in the urn. This means that the probability of the next patient receiving that treatment is 2/3. If the treatment fails, a ball representing the second treatment is placed in the urn instead. In this way, Zelen's algorithm makes it more likely that each patient will be assigned to the treatment with the greatest effect.

 
 

Sixteen years later, Zelen's ideas were put into practice. A team of researchers wanted to compare two treatments for newborn babies who needed help breathing - a new treatment and a conventional one. The new treatment was a modified heart-lung machine, called an ECMO machine, which at the time of the study was a controversial treatment. One newborn baby was treated with the conventional method and died. Eleven babies in a row were then given the ECMO treatment, and survived. Despite the favourable outcome, the study was criticised. Too few babies were treated with the conventional method to provide a meaningful basis for comparison.

In the 1990s, another study was conducted in the UK. This time it was a traditional randomized trial. A total of 200 babies were divided into two groups, each of whom received one of the treatments. This study also showed the benefits of ECMO treatment, but there was a high price to pay for this knowledge. In the group that received the conventional treatment, 24 more babies died, compared to those treated with ECMO.

Time as your guide

The explore/exploit dilemma comes in many different guises. It pops up in corporate AB testing and randomized drug trials. It's part of your everyday life when you pick out a pizza at the pizza parlour, try to find the optimal travel destination or choose which book to read next. Should you choose something tried and tested, or dare to try something new? The dilemma has no single, optimal solution, although there are a number of more or less effective algorithms. Perhaps the most important lesson, however, is that time is your guide. When you are young and have your whole life ahead of you - explore! As you approach old age, it's time to make use of everything you've learned. In other words, being stuck in your habits as an old person is exactly as it should be.

References and further reading

Christian, Brian & Griffiths, Tom (2016) Algorithms to live by. The computer science of human decisions. Picador. 

Cohen, Jonathan, D et al (2007) Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos Trans R Soc Lond B Biol Sci. 2007 May 29; 362(1481): 933–942

Masum, Mohammad (2020) Intro to Reinforcement Learning: The Explore-Exploit Dilemma

Netflix technology blog (2016) Selecting the best artwork for videos through A/B testing

Goodman, Rio & Goel, Ashish (2009) Goodman Lecture 5: Uncertainty

Siroker, Dan, How Obama Raised $60 Million by Running a Simple Experiment

Weng, Lilian (2018) The Multi-Armed Bandit Problem and Its Solutions.

Wikipedia, Multi-armed bandit

Föregående
Föregående

The math of facial recognition

Nästa
Nästa

Life is not linear