Multi-armed bandit

In probability theory, the multi-armed bandit problem (sometimes called the K- or N-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration-exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as 'one-armed bandits'), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called 'exploration') and optimize their decisions based on existing knowledge (called 'exploitation'). The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. There are many practical applications of the bandit model, for example:The multi-armed bandit (short: bandit or MAB) can be seen as a set of real distributions B = { R 1 , … , R K } {displaystyle B={R_{1},dots ,R_{K}}} , each distribution being associated with the rewards delivered by one of the K ∈ N + {displaystyle Kin mathbb {N} ^{+}} levers. Let μ 1 , … , μ K {displaystyle mu _{1},dots ,mu _{K}} be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon H {displaystyle H} is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret ρ {displaystyle ho } after T {displaystyle T} rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards: A common formulation is the Binary multi-armed bandit or Bernoulli multi-armed bandit, which issues a reward of one with probability p {displaystyle p} , and otherwise a reward of zero.A major breakthrough was the construction of optimal population selection strategies, or policies (that possess uniformly maximum convergence rate to the population with highest mean) in the work described below.A particularly useful version of the multi-armed bandit is the contextual multi-armed bandit problem. In this problem, in each iteration an agent has to choose between arms. Before making the choice, the agent sees a d-dimensional feature vector (context vector),associated with the current iteration. The learner uses these context vectors along with the rewards of the arms played in the past to make the choice of the arm to play inthe current iteration. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors.Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration, an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest generalizations of the bandit problem as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable K {displaystyle K} . In the infinite armed case, introduced by Agarwal (1995), the 'arms' are a continuous variable in K {displaystyle K} dimensions.Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play. A number of algorithms were presented to deal with this case, including Discounted UCB and Sliding-Window UCB.Many variants of the problem have been proposed in recent years.

Parent Topic

Child Topic

No Parent Topic