The exploration-exploitation trade-off is a fundamental dilemma in online decision making. Reinforcement learning (RL) approaches are often employed to achieve optimal outcomes. Multi-armed bandits (MAB) are popular RL algorithms tailored for tackling the exploration-exploitation trade-off. However, increasing the number of arms (i.e., decision criteria) leads to exponential increase in complexity. Multi-armed bandits need a fast feedback loop to be able to improve their policy decisions and converge to the optimal solution, but delayed feedback is common in many applications—for example, in advertising, information about conversion would be available long after the advertisement was displayed.
Shradha Agrawal offers an overview of MABs and explains how to efficiently scale to multiple decision criteria. Shradha focuses on the Thompson sampling technique, which uses randomization effectively to handle observational delays—using an example from advertising to show how the solution can be used to provide relevant and personalized experiences to users in real-time to increase conversions.
Shradha Agrawal is a data scientist at Adobe in San Jose. She holds a master’s degree in computer science with a focus on AI and machine learning from the University of California, San Diego. She is the author of a number of papers and patent applications.
Comments on this page are now closed.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org