# Online Learning

## Multi-armed Bandits

## Regularized Contextual Bandits

X. Fontaine, Q. Berthet and V. Perchet

AIStatS 2019

We consider the stochastic contextual bandit problem with additional regularization. The motivation comes from problems where the policy of the agent must be close to some baseline policy known to perform well on the task. To tackle this problem, we use a nonparametric model and propose an algorithm splitting the context space into bins, solving simultaneously — and independently — regularized multi-armed bandit instances on each bin. We derive slow and fast rates of convergence, depending on the unknown complexity of the problem. We also consider a new relevant margin condition to get problem-independent convergence rates, yielding intermediate rates interpolating between the aforementioned slow and fast rates.

## Bridging the gap between regret minimization and best arm identification, with application to A/B tests

R. Degenne, T. Nedelec, C. Calauzenest and V. Perchet

AIStats 2019

## Stochastic Bandit Models for Delayed Conversions

Vernade, O. Cappé and V. Pechet

UAI 2017

## Sparse Stochastic Bandits

Kwon, V. Perchet and C. Vernade

COLT 2017

http://proceedings.mlr.press/v65/kwon17a/kwon17a.pdf

## Gains and losses are fundamentally different in regret minimization. The sparse case.

*(with J. Kwon)*

*Journal of Machine Learning Research*

http://www.jmlr.org/papers/volume17/15-503/15-503.pdf

## Anytime optimal algorithms in stochastic multi-armed bandits

Degenne and V. Perchet

ICML 2016

http://proceedings.mlr.press/v48/degenne16.pdf

## Online learning and game theory. A quick overview with recent results and applications

Faure, P. Gaillard, B. Gaujal and V. Perchet

ESAIM: Proceedings and surveys, 2015

https://www.esaim-proc.org/articles/proc/pdf/2015/04/proc145114.pdf

## Batched bandit problems

V. Perchet, P. Rigollet, S. Chassang and E. Snowberg

Annals of Statistics, 2015

Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a by-product, we derive optimal policies with low switching cost for stochastic bandits.

## Gaussian Process Optimization with Mutual Information

E. Contal, V. Perchet and N. Vayatis

ICML 2014 erratum’

http://proceedings.mlr.press/v32/contal14.pdf

## Bounded regret in stochastic multi-armed bandits

S. Bubeck, V. Perchet and P. Rigollet

COLT 2013

http://proceedings.mlr.press/v30/Bubeck13.pdf

## The multi-armed bandit problem with covariates

Perchet and P. Rigollet

Annals of Statistics, 2013

https://projecteuclid.org/download/pdfview_1/euclid.aos/1366980562

## Exploiting Structure of Uncertainty for Efficient Combinatorial Semi-Bandits

P. Perrault, V. Perchet and M. Valko

ICML 2019

http://proceedings.mlr.press/v97/perrault19a/perrault19a.pdf

## Finding the Bandit in a Graph: Sequential Search-and-Stop

Perrault, V. Perchet and M. Valko

AIStats 2019

http://proceedings.mlr.press/v89/perrault19a/perrault19a.pdf

We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution. The agent can only examine vertices whose in-neighbors have already been examined. In this paper, we address a *learning* setting where we allow the agent to stop before having found the object and restart searching on a new independent instance of the same problem. Our goal is to maximize the total number of hidden objects found given a time budget. The agent can thus skip an instance after realizing that it would spend too much time on it. Our contributions are both to the *search theory* and *multi-armed bandits*. If the distribution is known, we provide a quasi-optimal and efficient stationary strategy. If the distribution is unknown, we additionally show how to sequentially approximate it and, at the same time, act near-optimally in order to collect as many hidden objects as possible.

## Combinatorial semi-bandit with known covariance

R. Degenne and V. Perchet

NIPS, 2016

The combinatorial stochastic semi-bandit problem is an extension of the classical multi-armed bandit problem in which an algorithm pulls more than one arm at each stage and the rewards of all pulled arms are revealed. One difference with the single arm variant is that the dependency structure of the arms is crucial. Previous works on this setting either used a worst-case approach or imposed independence of the arms. We introduce a way to quantify the dependency structure of the problem and design an algorithm that adapts to it. The algorithm is based on linear regression and the analysis develops techniques from the linear bandit literature. By comparing its performance to a new lower bound, we prove that it is optimal, up to a poly-logarithmic factor in the number of pulled arms.

# Online Learning

## Partial information

## Bandits with Side Observations: Bounded vs. Logarithmic Regret

R. Degenne, E. Garcelon and V. Perchet

UAI 2018

http://auai.org/uai2018/proceedings/papers/182.pdf

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $\epsilon$, an extra observation is gathered by the agent for free. We prove that, no matter how small $\epsilon$ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than $\sum_i \frac{\log(1/\epsilon)}{\Delta_i}$, up to multiplicative constant and $\log\log$ terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.

## Internal regret with partial monitoring : Calibration-based optimal algorithms

V. Perchet

Journal of Machine Learning Research 2011

http://www.jmlr.org/papers/volume12/perchet11a/perchet11a.pdf

## Calibration and Internal no-Regret with Random Signals

Perchet

ALT 2009

We provide consistent random algorithms for sequential decision under partial monitoring*, i.e.,* when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono\ »i diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal — as well as the usual external — regret at stage $n$ by $O(n^{-1/3})$, which is known to be optimal.

*@inproceedings{**CalibrationRegret**,**author = {Perchet, Vianney},**title = {Calibration and Internal No-Regret with Random Signals},**year = {2009},**isbn = {3642044131},**publisher = {Springer-Verlag},**address = {Berlin, Heidelberg},**booktitle = {Proceedings of the 20th International Conference on Algorithmic Learning Theory},**pages = {68–82},**numpages = {15},**location = {Porto, Portugal},**series = {ALT’09}**}*

## Online Learning and Blackwell Approachability with Partial Monitoring: Optimal Convergence Rates

J. Kwon and V. Perchet,

AIStats,2017

Blackwell approachability is an online learning setup generalizing the classical problem of regret minimization by allowing for instance multi-criteria optimization, global (online) optimization of a convex loss, or online linear optimization under some cumulative constraint. We consider *partial monitoring* where the decision maker does not necessarily observe the outcomes of his decision (unlike the traditional regret/bandit literature). Instead, he receives a random signal correlated to the decision–outcome pair, or only to the outcome.

We construct, for the first time, approachability algorithms with convergence rate of order $O(T^{-1/2})$ when the signal is independent of the decision and of order $O(T^{-1/3})$ in the case of general signals. Those rates are optimal in the sense that they cannot be improved without further assumption on the structure of the objectives and/or the signals.

## Set-Valued Approachability and Online Learning under Partial Monitoring

S. Mannor, V. Perchet and G. Stoltz

Journal of Machine Learning Research, 2014

http://jmlr.org/papers/volume15/mannor14a/mannor14a.pdf

Approachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward: it belongs to a set rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop a simple and generally efficient strategy (i.e., with constant per-step complexity) for this setup. As an important example, we instantiate our general strategy to the case when external regret or internal regret is to be minimized under partial monitoring.

## A Primal Condition for Approachability with Partial Monitoring

Mannor, V. Perchet and G. Stoltz

Journal of Dynamics and Games, 2014

https://www.aimsciences.org/article/doi/10.3934/jdg.2014.1.447

In approachability with full monitoring there are two types of conditions that are known to be equivalent for convex sets: a primal and a dual condition. The primal one is of the form: a set $\mathcal{C}$ is approachable if and only all containing half-spaces are approachable in the one-shot game. The dual condition is of the form: a convex set $\mathcal{C}$ is approachable if and only if it intersects all payoff sets of a certain form. We consider approachability in games with partial monitoring. In previous works, we provided a dual characterization of approachable convex sets and we also exhibited efficient strategies in the case where $\mathcal{C}$ $ is a polytope. In this paper we provide primal conditions on a convex set to be approachable with partial monitoring. They depend on a modified reward function and lead to approachability strategies based on modified payoff functions and that proceed by projections similarly to Blackwell’s (1956) strategy. This is in contrast with previously studied strategies in this context that relied mostly on the signalling structure and aimed at estimating well the distributions of the signals received. Our results generalize classical results by Kohlberg (1979) and apply to games with arbitrary signalling structure as well as to arbitrary convex sets.

@articleInfo{PrimalApproach,

title = « A primal condition for approachability with partial monitoring »,

journal = « Journal of Dynamics & Games »,

volume = « 1 »,

number = « 2164-6066_2014_3_447 »,

pages = « 447 »,

year = « 2014 »,

issn = « 2164-6066 »,

doi = « 10.3934/jdg.2014.1.447 »,

url = « http://aimsciences.org//article/id/d67ce683-6ebc-4320-aeb4-f9f3b1e73cb6 »,

author = « Shie Mannor and Vianney Perchet and Gilles Stoltz »

}

## Approachability in unknown games: Online learning meets multi-objective optimization

Mannor, V. Perchet and G. Stoltz

COLT 2014

http://proceedings.mlr.press/v35/mannor14.pdf

In the standard setting of approachability there are two players and a target set. The players play a repeated vector-valued game where one of them wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude. We revisit the classical setting and consider the setting where the player has a preference relation between target sets: she wishes to approach the smallest (« best ») set possible given the observed average payoffs in hindsight. Moreover, as opposed to previous works on approachability, and in the spirit of online learning, we do not assume that there is a known game structure with actions for two players. Rather, the player receives an arbitrary vector-valued reward vector at every round. We show that it is impossible, in general, to approach the best target set in hindsight. We further propose a concrete strategy that approaches a non-trivial relaxation of the best-in-hindsight given the actual rewards. Our approach does not require projection onto a target set and amounts to switching between scalar regret minimization algorithms that are performed in episodes.

@InProceedings{MultiObjLearning, title = {Approachability in unknown games: {O}nline learning meets multi-objective optimization}, author = {Shie Mannor and Vianney Perchet and Gilles Stoltz}, booktitle = {Proceedings of The 27th Conference on Learning Theory}, pages = {339–355}, year = {2014}, editor = {Maria Florina Balcan and Vitaly Feldman and Csaba Szepesvári}, volume = {35}, series = {Proceedings of Machine Learning Research}, address = {Barcelona, Spain}, month = {13–15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v35/mannor14.pdf}, url = {http://proceedings.mlr.press/v35/mannor14.html}}

## Approachability of convex sets in games with random signals

Perchet

Journal of Optimization Theory and Applications, 2011

https://www.aimsciences.org/article/doi/10.3934/jdg.2014.1.447

We provide a necessary and sufficient condition under which a convex set is approachable in a game with partial monitoring, *i.e.,* where players do not observe their opponents’ moves but receive random signals. This condition is an extension of Blackwell’s Criterion in the full monitoring framework, where players observe at least their payoffs. When our condition is fulfilled, we construct explicitly an approachability strategy, derived from a strategy satisfying some *internal consistency* property in an auxiliary game. We also provide an example of a convex set, that is neither (weakly)-approachable nor (weakly)-excludable, a situation that cannot occur in the full monitoring case. We finally apply our result to describe an $\varepsilon$-optimal strategy of the uninformed player in a zero-sum repeated game with incomplete information on one side.

@article{10.1007/s10957-011-9797-3,

author = {Perchet, Vianney},

title = {Approachability of Convex Sets in Games with Partial Monitoring},

year = {2011},

publisher = {Plenum Press},

address = {USA},

volume = {149},

number = {3},

issn = {0022-3239},

url = {https://doi.org/10.1007/s10957-011-9797-3},

doi = {10.1007/s10957-011-9797-3},

journal = {J. Optim. Theory Appl.},

pages = {665–677}

}

# Online Learning

## Approachability

## Approachability of Convex Sets in Generalized Quitting Games

Flesch, R. Laraki and V. Perchet

Games and Economic Behavior, 2017

We consider Blackwell approachability, a very powerful and geometric tool in game theory, used for example to design strategies of the uninformed player in repeated games with incomplete information. We extend this theory to « generalized quitting games », a class of repeated stochastic games in which each player may have quitting actions, such as the Big-Match. We provide three simple geometric and strongly related conditions for the weak approachability of a convex target set. The first is sufficient: it guarantees that, for any fixed horizon, a player has a strategy ensuring that the expected time-average payoff vector converges to the target set as horizon goes to infinity. The third is necessary: if it is not satisfied, the opponent can weakly exclude the target set. In the special case where only the approaching player can quit the game (Big-Match of type I), the three conditions are equivalent and coincide with Blackwell’s condition. Consequently, we obtain a full characterization and prove that the game is weakly determined – every convex set is either weakly approachable or weakly excludable. In games where only the opponent can quit (Big-Match of type II), none of our conditions is both sufficient and necessary for weak approachability. We provide a continuous time sufficient condition using techniques coming from differential games, and show its usefulness in practice, in the spirit of Vieille’s seminal work for weak approachability. Finally, we study uniform approachability where the strategy should not depend on the horizon and demonstrate that, in contrast with classical Blackwell approachability for convex sets, weak approachability does not imply uniform approachability.

@article{QuittingGames,

title = « Approachability of convex sets in generalized quitting games »,

journal = « Games and Economic Behavior »,

volume = « 108 »,

pages = « 411 – 431 »,

year = « 2018 »,

note = « Special Issue in Honor of Lloyd Shapley: Seven Topics in Game Theory »,

issn = « 0899-8256 »,

doi = « https://doi.org/10.1016/j.geb.2017.12.007 »,

url = « http://www.sciencedirect.com/science/article/pii/S0899825617302269 »,

author = « János Flesch and Rida Laraki and Vianney Perchet »

}

## Approachability of convex sets in games with random signals

V. Perchet

Journal of Dynamics and Games, 2014

https://www.aimsciences.org/article/doi/10.3934/jdg.2014.1.181

Blackwell approachability, regret minimization and calibration are three criteria used to evaluate a strategy (or an algorithm) in sequential decision problems, described as repeated games between a player and Nature. Although they have at first sight not much in common, links between them have been discovered: for instance, both consistent and calibrated strategies can be constructed by following, in some auxiliary game, an approachability strategy. We gather seminal and recent results, develop and generalize Blackwell’s elegant theory in several directions. The final objectives is to show how approachability can be used as a basic powerful tool to exhibit a new class of intuitive algorithms, based on simple geometric properties. In order to be complete, we also prove that approachability can be seen as a by-product of the very existence of consistent or calibrated strategies.

@articleInfo{2164-6066_2014_2_181,

title = « Approachability, regret and calibration: Implications and equivalences »,

journal = « Journal of Dynamics & Games »,

volume = « 1 »,

number = « 2164-6066_2014_2_181 »,

pages = « 181 »,

year = « 2014 »,

note = « »,

issn = « 2164-6066 »,

doi = « 10.3934/jdg.2014.1.181 »,

url = « http://aimsciences.org//article/id/873959ac-8911-4a81-abb5-7562a8ffa925 »,

author = « Vianney Perchet »

}

## On a Unified Framework for Approachability with Full or Partial Monitoring

V. Perchet and M. Quincampoix

Mathematics of Operations Research, 2014

We unify standard frameworks for approachability both in full or partial monitoring by defining a new abstract game, called *the purely informative game*, where the outcome at each stage is the maximal information players can obtain, represented as some probability measure. Objectives of players can be rewritten as the convergence (to some given set) of sequences of averages of these probability measures. We obtain new results extending the approachability theory developed by Blackwell moreover this new abstract framework enables us to characterize approachable sets with, as usual, a remarkably simple and clear reformulation for convex sets. Translated into the original games, those results become the first necessary and sufficient condition under which an arbitrary set is approachable, and they cover and extend previous known results for convex sets. We also investigate a specific class of games where, thanks to some unusual definition of averages and convexity, we again obtain a complete characterization of approachable sets along with rates of convergence.

@article{UnifiedFullPartial,

author = {Perchet, Vianney and Quincampoix, Marc},

title = {On a Unified Framework for Approachability with Full or Partial Monitoring},

journal = {Mathematics of Operations Research},

volume = {40},

number = {3},

pages = {596-610},

year = {2015},

doi = {10.1287/moor.2014.0686},

URL = { https://doi.org/10.1287/moor.2014.0686},

eprint = {https://doi.org/10.1287/moor.2014.0686}

}

## Exponential weight approachability, applications to calibration and regret minimization,

Vianney Perchet

Dynamic Games And Applications, 2014

https://link.springer.com/content/pdf/10.1007/s13235-014-0119-x.pdf

Basic ideas behind the *exponential weight algorithm* (designed for aggregation or minimization of regret) can be transposed into the theory of Blackwell approachability. Using them, we develop an algorithm — that we call *exponential weight approachability* — bounding the distance of average vector payoffs to some product set, with a logarithmic dependency in the dimension of the ambient space. The classic strategy of Blackwell would get instead a polynomial dependency. This result has important consequences, in several frameworks that emerged both in game theory and machine learning. The most striking application is the construction of algorithms that are calibrated with respect to the family of all balls (we treat in detail the case of the uniform norm), with dimension independent and optimal, up to logarithmic factors, rates of convergence. Calibration can also be achieved with respect to all Borel sets, covering and improving the previously known results. Exponential weight approachability can also be used to design an optimal and natural algorithm that minimizes refined notions of regret.

@article{cite-key,

Author = {Perchet, Vianney},

Doi = {10.1007/s13235-014-0119-x},

Isbn = {2153-0793},

Journal = {Dynamic Games and Applications},

Number = {1},

Pages = {136–153},

Title = {Exponential Weight Approachability, Applications to Calibration and Regret Minimization},

Url = {https://doi.org/10.1007/s13235-014-0119-x},

Volume = {5},

Year = {2015}

}

## On a Unified Framework for Approachability with Full or Partial Monitoring

V. Perchet and M. Quincampoix

Mathematics of Operations Research, 2014

We unify standard frameworks for approachability both in full or partial monitoring by defining a new abstract game, called *the purely informative game*, where the outcome at each stage is the maximal information players can obtain, represented as some probability measure. Objectives of players can be rewritten as the convergence (to some given set) of sequences of averages of these probability measures. We obtain new results extending the approachability theory developed by Blackwell moreover this new abstract framework enables us to characterize approachable sets with, as usual, a remarkably simple and clear reformulation for convex sets. Translated into the original games, those results become the first necessary and sufficient condition under which an arbitrary set is approachable, and they cover and extend previous known results for convex sets. We also investigate a specific class of games where, thanks to some unusual definition of averages and convexity, we again obtain a complete characterization of approachable sets along with rates of convergence.

@article{UnifiedFullPartial,

author = {Perchet, Vianney and Quincampoix, Marc},

title = {On a Unified Framework for Approachability with Full or Partial Monitoring},

journal = {Mathematics of Operations Research},

volume = {40},

number = {3},

pages = {596-610},

year = {2015},

doi = {10.1287/moor.2014.0686},

URL = { https://doi.org/10.1287/moor.2014.0686},

eprint = {https://doi.org/10.1287/moor.2014.0686}

}

# Eco - ML

## Auctions and Mechanism

## Adversarial learning in revenue-maximizing auctions

Nedelec, J. Baudet, V. Perchet and N. El Karoui

Technical Report

We introduce a new numerical framework to learn optimal bidding strategies in repeated auctions when the seller uses past bids to optimize her mechanism. Crucially, we do not assume that the bidders know what optimization mechanism is used by the seller. We recover essentially all state-of-the-art analytical results for the single-item framework derived previously in the setup where the bidder knows the optimization mechanism used by the seller and extend our approach to multi-item settings, in which no optimal shading strategies were previously known. Our approach yields substantial increases in bidder utility in all settings. Our approach also has a strong potential for practical usage since it provides a simple way to optimize bidding strategies on modern marketplaces where buyers face unknown data-driven mechanisms.

@misc{Adversarial Learning,

title={Adversarial learning for revenue-maximizing auctions},

author={Thomas Nedelec and Jules Baudet and Vianney Perchet and Noureddine El Karoui},

year={2019},

eprint={1909.06806},

archivePrefix={arXiv},

}

## Robust Stackelberg buyers in repeated auctions

C. Calauzènes, T. Nedelec, V. Perchet and N. El Karoui

Technical Report

We consider the practical and classical setting where the seller is using an exploration stage to learn the value distributions of the bidders before running a revenue-maximizing auction in an exploitation phase. In this two-stage process, we exhibit practical, simple and robust strategies with large utility uplifts for the bidders. We quantify precisely the seller revenue against non-discounted buyers, complementing recent studies that had focused on impatient/heavily discounted buyers. We also prove the robustness of these shading strategies to sample approximation error of the seller, to bidder’s approximation error of the competition and to possible change of the mechanisms.

@misc{Stackelberg,

title={Robust Stackelberg buyers in repeated auctions},

author={Clément Calauzènes and Thomas Nedelec and Vianney Perchet and Noureddine El Karoui},

year={2019},

eprint={1905.13031},

archivePrefix={arXiv},

primaryClass={cs.GT}

}

## Learning to bid in revenue-maximizing auctions

T. Nedelec, V. Perchet and N. El Karoui

ICML 2019

We introduce a new numerical framework to learn optimal bidding strategies in repeated auctions when the seller uses past bids to optimize her mechanism. Crucially, we do not assume that the bidders know what optimization mechanism is used by the seller. We recover essentially all state-of-the-art analytical results for the single-item framework derived previously in the setup where the bidder knows the optimization mechanism used by the seller and extend our approach to multi-item settings, in which no optimal shading strategies were previously known. Our approach yields substantial increases in bidder utility in all settings. Our approach also has a strong potential for practical usage since it provides a simple way to optimize bidding strategies on modern marketplaces where buyers face unknown data-driven mechanisms.

@InProceedings{LearningToBid,

title = {Learning to bid in revenue-maximizing auctions},

author = {Nedelec, Thomas and Karoui, Noureddine El and Perchet, Vianney},

booktitle = {Proceedings of the 36th International Conference on Machine Learning},

pages = {4781–4789},

year = {2019},

editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},

volume = {97},

series = {Proceedings of Machine Learning Research},

address = {Long Beach, California, USA},

month = {09–15 Jun},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v97/nedelec19a/nedelec19a.pdf},

url = {http://proceedings.mlr.press/v97/nedelec19a.html

}

## Dynamic Pricing with Finitely Many Unknown Valuations

N. Cesa-Bianchi, T. Cesari and V. Perchet

ALT 2019

Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate revenue maximization in stochastic dynamic pricing when the distribution of buyers’ private values is supported on an unknown set of points in $[0,1]$ of unknown cardinality $K$. This setting can be viewed as an instance of a stochastic $K$-armed bandit problem where the location of the arms (i.e., the $K$ unknown valuations) must be learned as well.

- In the distribution-free case, we show that our setting is just as hard as $K$-armed stochastic bandits: we prove that no algorithm can achieve a regret significantly better than $\sqrt{KT}$, (where $T$ is the time horizon) and present an efficient algorithm matching this lower bound up to logarithmic factors.
- In the distribution-dependent case, we show that for all $K > 2$ our setting is strictly harder than $K$-armed stochastic bandits by proving that it is impossible to obtain regret bounds that grow logarithmically in time or slower. On the other hand, when a lower bound $\gamma > 0$ on the smallest drop in the demand curve is known, we prove an upper bound on the regret of order $\big(1/\Delta + (\log\log T)/\gamma^2\big)\big(K\log T\big)$, where $\Delta$ is the gap between the revenue of the optimal valuation and that of the second-best valuation.

This is a significant improvement on previously known regret bounds for discontinuous demand curves, that are at best of order $\big(K^{12}/\gamma^8\big) \sqrt{T}$.

- When $K=2$ in the distribution-dependent case, the hardness of our setting reduces to that of a stochastic $2$-armed bandit: we prove that an upper bound of order $(\log T)/\Delta$ (up to $\log\log$ factors) on the regret can be achieved with no information on the demand curve.
- Finally, we show a $\mathcal{O}(\sqrt{T})$ upper bound on the regret for the setting in which the buyers’ decisions are nonstochastic, and the regret is measured with respect to the best between two fixed valuations one of which is known to the seller.

@InProceedings{DynamicPricing,

title = {Dynamic Pricing with Finitely Many Unknown Valuations},

author = {Cesa-Bianchi, Nicol\`o and Cesari, Tommaso and Perchet, Vianney},

booktitle = {Proceedings of the 30th International Conference on Algorithmic Learning Theory},

pages = {247–273},

year = {2019},

editor = {Garivier, Aur\’elien and Kale, Satyen},

volume = {98},

series = {Proceedings of Machine Learning Research},

address = {Chicago, Illinois},

month = {22–24 Mar},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v98/cesa-bianchi19a/cesa-bianchi19a.pdf},

url = {http://proceedings.mlr.press/v98/cesa-bianchi19a.html},

abstract = {Motivated by posted price auctions where buyers are grouped in an unknown number of latent types characterized by their private values for the good on sale, we investigate regret minimization in stochastic dynamic pricing when the distribution of buyers’ private values is supported on an unknown set of points in $[0,1]$ of unknown cardinality $K$.

}

}

## Explicit shading strategies for repeated truthful auctions

M. Abeille, C. Calauzenes, N. El Karoui, T. Nedelec and V. Perchet

Technical report

With the increasing use of auctions in online advertising, there has been a large effort to study seller revenue maximization, following Myerson’s seminal work, both theoretically and practically. We take the point of view of the buyer in classical auctions and ask the question of whether she has an incentive to shade her bid even in auctions that are reputed to be truthful, when aware of the revenue optimization mechanism. We show that in auctions such as the Myerson auction or a VCG with reserve price set as the monopoly price, the buyer who is aware of this information has indeed an incentive to shade. Intuitively, by selecting the revenue maximizing auction, the seller introduces a dependency on the buyers’ distributions in the choice of the auction. We study in depth the case of the Myerson auction and show that a symmetric equilibrium exists in which buyers shade non-linearly what would be their first price bid. They then end up with an expected payoff that is equal to what they would get in a first price auction with no reserve price. We conclude that a return to simple first price auctions with no reserve price or at least non-dynamic anonymous ones is desirable from the point of view of both buyers, sellers and increasing transparency.

@misc{ShadingStrategies,

title={Explicit shading strategies for repeated truthful auctions},

author={Marc Abeille and Clément Calauzènes and Noureddine El Karoui and Thomas Nedelec and Vianney Perchet},

year={2018},

eprint={1805.00256},

archivePrefix={arXiv},

}

## Thresholding at the monopoly price: an agnostic way to improve bidding strategies in revenue-maximizing auctions

T. Nedelec, C. Calauzenes, V. Perchet and N. El Karoui

Technical report

In auction theory, a Vickrey–Clarke–Groves (VCG) auction is a type of sealed-bid auction of multiple items. Bidders submit bids that report their valuations for the items, without knowing the bids of the other people in the auction. The auction system assigns the items in a socially optimal manner: it charges each individual the harm they cause to other bidders. It also gives bidders an incentive to bid their true valuations, by ensuring that the optimal strategy for each bidder is to bid their true valuations of the items. (Thanks for this text, Wikipedia.) This paper shows that the same idea works for electronic commerce applications.

@misc{Thresholding,

title={Thresholding at the monopoly price: an agnostic way to improve bidding strategies in revenue-maximizing auctions},

author={Thomas Nedelec and Clément Calauzènes and Vianney Perchet and Noureddine El Karoui},

year={2018},

eprint={1808.06979},

archivePrefix={arXiv},

primaryClass={cs.GT}

}

## Online learning in repeated auctions

Weed, V. Perchet and P. Rigollet

COLT 2016

Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially, and bidders only learn (potentially noisy) information about a good’s value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical, and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

@InProceedings{OnlineAuctions,

title = {Online learning in repeated auctions},

author = {Jonathan Weed and Vianney Perchet and Philippe Rigollet},

booktitle = {29th Annual Conference on Learning Theory},

pages = {1562–1583},

year = {2016},

editor = {Vitaly Feldman and Alexander Rakhlin and Ohad Shamir},

volume = {49},

series = {Proceedings of Machine Learning Research},

address = {Columbia University, New York, New York, USA},

month = {23–26 Jun},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v49/weed16.pdf},

url = {http://proceedings.mlr.press/v49/weed16.html

}

# Eco - ML

## Multi-Player Bandits

## A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players

E. Boursier, E. Kaufmann, A. Mehrabian and V. Perchet

AIStats 2019

https://arxiv.org/pdf/1902.01239.pdf

We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs, and the involved players receive zero reward. We consider the challenging *heterogeneous* setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicit communication and that of performing matching eliminations. We present a finite-time analysis of our algorithm, giving the first sublinear minimax regret bound for this problem, and prove that if the optimal assignment of players to arms is unique, our algorithm attains the optimal $O(\ln(T))$ regret, solving an open question raised at NeurIPS 2018.

@misc{PracticalMMAB,

title={A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players},

author={Etienne Boursier and Emilie Kaufmann and Abbas Mehrabian and Vianney Perchet},

year={2019},

eprint={1902.01239},

archivePrefix={arXiv},

primaryClass={stat.ML}

}

## SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits

E. Boursier and V. Perchet

NeurIPS 2019

Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage. We present a decentralized algorithm that achieves the same performance as a centralized one, in contradiction the existing lower bounds for that problem. This is possible by « hacking » the standard model by constructing a communication protocol between players that deliberately enforces collisions, allowing them to share their information at a negligible cost. This motivates the introduction of a more appropriate dynamic setting without sensing, where similar communication protocols are no longer possible. However, we show that the logarithmic growth of the regret is still achievable for this model with a new algorithm.

@incollection{SicMMAB,

title = {SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits},

author = {Boursier, Etienne and Perchet, Vianney},

booktitle = {Advances in Neural Information Processing Systems 32},

editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},

pages = {12071–12080},

year = {2019},

publisher = {Curran Associates, Inc.},

url = {http://papers.nips.cc/paper/9375-sic-mmab-synchronisation-involves-communication-in-multiplayer-multi-armed-bandits.pdf}

}

## Selfish Robustness and Equilibria in Multi-Player Bandits

E. Boursier and V. Perchet

Technical report

https://arxiv.org/pdf/2002.01197.pdf

Motivated by cognitive radios, stochastic multi-player multi-armed bandits gained a lot of interest recently. In this class of problems, several players simultaneously pull arms and encounter a collision — with 0 reward — if some of them pull the same arm at the same time. While the cooperative case where players maximize the collective reward (obediently following some fixed protocol) has been mostly considered, robustness to malicious players is a crucial and challenging concern. Existing approaches consider only the case of adversarial *jammers* whose objective is to blindly minimize the collective reward. We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare. We provide the first algorithm robust to selfish players (*a.k.a.*, Nash equilibrium) with a logarithmic regret, when the arm reward is observed. When collisions are also observed, *Grim Trigger* type of strategies enable some implicit communication-based algorithms and we construct robust algorithms in two different settings: in the homogeneous case (with a regret comparable to the centralized optimal one) and in the heterogeneous case (for an adapted and relevant notion of regret). We also provide impossibility results when only the reward is observed or when arm means vary arbitrarily among players.

@misc{RobustMMAB,

title={Selfish Robustness and Equilibria in Multi-Player Bandits},

author={Etienne Boursier and Vianney Perchet},

year={2020},

eprint={2002.01197},

archivePrefix={arXiv},

primaryClass={cs.LG}

}

# Eco - ML

## Game Theory

## Finding Robust Nash equilibria

V. Perchet

ALT 2019

When agents or decision maker have uncertainties of underlying parameters, they may want to take « robust » decisions that are optimal against the worst possible values of those parameters, leading to some max-min optimization problems. With several agents in competition, in game theory, uncertainties are even more important and robust games – or game with non-unique prior – have gained a lot of interest recently, notably in auctions. The existence of robust equilibria in those games is guaranteed using standard fixed-point theorems as in classical finite games, so we focus on the problem of finding and characterizing them. Under some linear assumption on the structure of the uncertainties, we provide a polynomial reduction of the robust Nash problem to a standard Nash problem (on some auxiliary different game). This is possible by proving the existence of a lifting transforming robust linear programs into standard linear programs. In the general case, the above direct reduction is not always possible. However, we prove how to adapt the Lemke-Howson algorithm to find robust Nash equilibria in non-degenerate games.

@InProceedings{FindingRobust,

title = {Finding Robust Nash equilibria},

author = {Perchet, Vianney},

booktitle = {Proceedings of the 31st International Conference on Algorithmic Learning Theory},

pages = {725–751},

year = {2020},

editor = {Kontorovich, Aryeh and Neu, Gergely},

volume = {117},

series = {Proceedings of Machine Learning Research},

address = {San Diego, California, USA},

month = {08 Feb–11 Feb},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v117/perchet20a/perchet20a.pdf},

url = {http://proceedings.mlr.press/v117/perchet20a.html}

}

## A differential game on Wasserstein space. Application to weak approachability with partial monitoring

V. Perchet and M. Quincampoix

Journal of Dynamics and Game, 2019

Studying continuous time counterpart of some discrete time dynamics is now a standard and fruitful technique, as some properties hold in both setups. In game theory, this is usually done by considering differential games on Euclidean spaces. This allows to infer properties on the convergence of values of a repeated game, to deal with the various concepts of approachability, etc. In this paper, we introduce a specific but quite abstract differential game defined on the Wasserstein space of probability distributions and we prove the existence of its value. Going back to the discrete time dynamics, we derive results on weak approachability with partial monitoring: we prove that any set satisfying a suitable compatibility condition is either weakly approachable or weakly excludable. We also obtain that the value for differential games with non-anticipative strategies is the same that those defined with a new concept of strategies very suitable to make links with repeated games.

@article{WeakApproach,

title = « A differential game on Wasserstein space. Application to weak approachability with partial monitoring »,

journal = « Journal of Dynamics & Games »,

volume = « 6 »,

pages = « 65 »,

year = « 2019 »,

issn = « 2164-6066 »,

doi = « 10.3934/jdg.2019005 »,

url = « http://aimsciences.org//article/id/25ec6a42-3c8b-446c-b136-74ef3e6344cf »,

author = « Perchet, Vianney and Quincampoix, Marc »

## A minmax theorem for concave-convex mappings with no regularity assumptions

V. Perchet and G. Vigeral

Journal of Convex Analysis, 2015

We prove that zero-sum games with a concave-convex payoff mapping defined on a product of convex sets have a value as soon as the payoff mapping is bounded and one of the sets is bounded and finite dimensional. In particular, no additional regularity assumption is required, such as lower or upper semi-continuity of the function or compactness of the sets. We provide several examples that show that our assumptions are minimal.

@article{MinMax,

author = {Perchet, Vianney and Vigeral, Guillaume},

title = { A Minmax Theorem for Concave-convex Mappings with no Regularity Assumptions },

journal = { Journal of Convex Analysis },

volume = {22},

pages = {537-540},

year = {2015},

URL = { http://www.heldermann.de/JCA/JCA22/JCA222/jca22025.htm},

}

## A note on robust Nash equilibria in games with uncertainties

Single author

Rairo Operations Research, 2014

In this short note, we investigate the framework where agents or players have some uncertainties upon their payoffs or losses, the behavior (or the type, number or any other characteristics) of other players. More specifically, we introduce an extension of the concept of Nash equilibria that generalize different solution concepts called by their authors, and depending on the context, either as robust, ambiguous, partially specified or with uncertainty aversion. We provide a simple necessary and sufficient condition that guarantees its existence and we show that it is actually a selection of conjectural (or self-confirming) equilibria. We finally conclude by how this concept can and should be defined in games with partial monitoring in order to preserve existence properties.

@article{NoteRobust,

author = {Perchet, Vianney},

title = {A note on robust Nash equilibria with uncertainties},

journal = {RAIRO – Operations Research – Recherche Op\’erationnelle},

publisher = {EDP-Sciences},

volume = {48},

number = {3},

year = {2014},

pages = {365-371},

doi = {10.1051/ro/2014001},

url = {http://www.numdam.org/item/RO_2014__48_3_365_0}

}

# Optimization, RL and Others

## Optimization and Active Learning

## An adaptive stochastic optimization algorithm for resource allocation

X. Fontaine, S. Mannor and V. Perchet

ALT 2019

We consider the classical problem of sequential resource allocation where a decision maker must repeatedly divide a budget between several resources, each with diminishing returns. This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret. We construct an algorithm that is *adaptive* to the complexity of the problem, expressed in term of the regularity of the returns of the resources, measured by the exponent in the Łojasiewicz inequality (or by their universal concavity parameter). Our parameter-independent algorithm recovers the optimal rates for strongly-concave functions and the classical fast rates of multi-armed bandit (for linear reward functions). Moreover, the algorithm improves existing results on stochastic optimization in this regret minimization setting for intermediate cases.

@InProceedings{AdaptiveAllocation,

title = {An adaptive stochastic optimization algorithm for resource allocation},

author = {Fontaine, Xavier and Mannor, Shie and Perchet, Vianney},

booktitle = {Proceedings of the 31st International Conference on Algorithmic Learning Theory},

pages = {319–363},

year = {2020},

editor = {Kontorovich, Aryeh and Neu, Gergely},

volume = {117},

series = {Proceedings of Machine Learning Research},

address = {San Diego, California, USA},

month = {08 Feb–11 Feb},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v117/fontaine20a/fontaine20a.pdf},

url = {http://proceedings.mlr.press/v117/fontaine20a.html}

}

## Utility/Privacy Trade-off through the lens of Optimal Transport

E. Boursier and V. Perchet

AIStats 2020

Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives. We formalize this as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.

@misc{utilityprivacy,

title={Utility/Privacy Trade-off through the lens of Optimal Transport},

author={Etienne Boursier and Vianney Perchet},

year={2019},

eprint={1905.11148},

archivePrefix={arXiv},

primaryClass={stat.ML}

}

## Active Matrix Design in Linear Regression

X. Fontaine, P. Perrault, V. Perchet and M. Valko

submitted, 2019

We consider the problem of active linear regression where a decision maker has to choose between several covariates to sample in order to obtain the best estimate $\hat{\beta}$ of the parameter $\beta^{\star}$ of the linear model, in the sense of minimizing $\mathbb{E} \lVert\hat{\beta}-\beta^*\rVert^2$. Using bandit and convex optimization techniques we propose an algorithm to define the sampling strategy of the decision maker and we compare it with other algorithms. We provide theoretical guarantees of our algorithm in different settings, including a $\mathcal{O}(T^{-2})$ regret bound in the case where the covariates form a basis of the feature space, generalizing and improving existing results. Numerical experiments validate our theoretical findings.

@misc{ActiveLinear,

title={ Active Matrix Design in Linear Regression },

author={Xavier Fontaine and Pierre Perrault and Vianney Perchet and Michal Valko},

year={2019},

eprint={1906.08509},

archivePrefix={arXiv},

primaryClass={stat.ML}

}

## Bandit Optimization with Upper-Confidence Frank-Wolfe

Q. Berthet and V. Perchet

NIPS 2017

We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results.

@incollection{UCFrankWolfe,

title = {Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe},

author = {Berthet, Quentin and Perchet, Vianney},

booktitle = {Advances in Neural Information Processing Systems 30},

editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett},

pages = {2225–2234},

year = {2017},

publisher = {Curran Associates, Inc.},

url = {http://papers.nips.cc/paper/6817-fast-rates-for-bandit-optimization-with-upper-confidence-frank-wolfe.pdf}

}

## Highly-smooth zero-th order online optimization

F. Bach and V. Perchet

COLT) 2016

The minimization of convex functions which are only available through partial and noisy information is a key methodological problem in many disciplines. In this paper we consider convex optimization with noisy zero-th order information, that is noisy function evaluations at any desired point. We focus on problems with high degrees of smoothness, such as logistic regression. We show that as opposed to gradient-based algorithms, high-order smoothness may be used to improve estimation rates, with a precise dependence of our upper-bounds on the degree of smoothness. In particular, we show that for infinitely differentiable functions, we recover the same dependence on sample size as gradient-based algorithms, with an extra dimension-dependent factor. This is done for both convex and strongly-convex functions, with finite horizon and anytime algorithms. Finally, we also recover similar results in the online optimization setting.

@InProceedings{HighlySmooth,

title = {Highly-Smooth Zero-th Order Online Optimization},

author = {Francis Bach and Vianney Perchet},

booktitle = {29th Annual Conference on Learning Theory},

pages = {257–283},

year = {2016},

editor = {Vitaly Feldman and Alexander Rakhlin and Ohad Shamir},

volume = {49},

series = {Proceedings of Machine Learning Research},

address = {Columbia University, New York, New York, USA},

month = {23–26 Jun},

publisher = {PMLR},

pdf = {http://proceedings.mlr.press/v49/bach16.pdf},

url = {http://proceedings.mlr.press/v49/bach16.html}

}

# Optimization, RL and Others

## Reinforcement Learning

## Markov Decision Process for MOOC users behavioral inference,

F. Jarboui, C. Gruson-Daniel, A. Durmus, V. Rocchisiani, S.-H. Goulet Ebongue, A. Depoux, W. Kirschenmann and V. Perchet

Emoocs’19.

Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However, defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users’ behaviour given their log data. We model their behavior into a Markov Decision Process framework. We associate the user’s intentions with the MDP reward and argue that this allows us to classify them.

@InProceedings{10.1007/978-3-030-19875-6_9,

author= »Jarboui, Firas and Gruson-Daniel, C{\’e}lya and Durmus, Alain and Rocchisani, Vincent and Goulet Ebongue, Sophie-Helene and Depoux, Anneliese

and Kirschenmann, Wilfried and Perchet, Vianney »,

editor= »Calise, Mauro and Delgado Kloos, Carlos and Reich, Justin and Ruiperez-Valiente, Jose A. and Wirsing, Martin »,

title= »Markov Decision Process for MOOC Users Behavioral Inference »,

booktitle= »Digital Education: At the MOOC Crossroads Where the Interests of Academia and Business Converge »,

year= »2019″,

publisher= »Springer International Publishing »,

address= »Cham »,

pages= »70—80”,

isbn= »978-3-030-19875-6″

}

## A comparative study of counterfactual estimators

T. Nedelec, N. Le Roux and V. Perchet

‘What If?’ to ‘What Next?’ workshop, NIPS 2017

We provide a comparative study of several widely used off-policy estimators (*Empirical Average*, *Basic Importance Sampling* and *Normalized Importance Sampling*), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved.

@misc{Comparative,

title={A comparative study of counterfactual estimators},

author={Thomas Nedelec and Nicolas Le Roux and Vianney Perchet},

year={2017},

eprint={1704.00773},

archivePrefix={arXiv},

primaryClass={stat.ML},

howpublished = {appeared in Nips 2017 Workshop *What If? to What Next?*}

}

# Optimization, RL and Others

## Others

## Quantitative analysis of Dynamic Fault Trees based on the coupling of structure functions ure functions and Monte-Carlo simulation

G. Merle, J.-M. Roussel, J.-J. Lesage, V. Perchet and N. Vayatis),

Quality and Reliability Engineering International, 2014

https://arxiv.org/pdf/1905.11797.pdf

This paper focuses on the quantitative analysis of Dynamic Fault Trees (DFTs) by means of Monte Carlo simulation. In a previous article, we defined an algebraic framework allowing to determine the structure function of DFTs. We exploit this structure function and the minimal cut sequences that it allows to determine, to know the failure mode configuration of the system, which is an input of Monte Carlo simulation. We show that the results obtained are in good accordance with theoretical results and that some results, such as importance measures and sensitivity indexes, are not provided by common quantitative analysis and yet interesting. We finally illustrate our approach on a DFT example from the literature.

@article{DFT,

author = {Merle, G. and Roussel, J. -M. and Lesage, J. -J. and Perchet, V. and Vayatis, N.},

title = {Quantitative Analysis of Dynamic Fault Trees Based on the Coupling of Structure Functions and Monte Carlo Simulation},

journal = {Quality and Reliability Engineering International},

volume = {32},

number = {1},

pages = {7-18},

keywords = {Dynamic Fault Tree, quantitative analysis, Monte Carlo simulation},

doi = {10.1002/qre.1728},

url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.1728},

eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/qre.1728},

year = {2016}

}

## Repeated A/B Testing

N Cesa-Bianchi, T Cesari, Y Mansour and V. Perchet** **

Technical Report

A/B testing is of central importance in the industry, especially in the technology sector where companies run long sequences of A/B tests to optimize their products. Since the space of potential innovations is typically vast, the experimenter must make quick and good decisions without wasting too much time on a single A/B test in the sequence. In particular, discarding an innovation with a small benefit might be better in the long run than using many samples to precisely determine its value. In this work, we introduce a performance measure that captures this idea and design an efficient algorithm that performs almost as well as the best A/B strategy in a given set. As it turns out, a key technical difficulty that significantly affects the learning rates is the hardness of obtaining unbiased estimates of the strategy rewards.

@misc{RepeatedABtest,

title={Repeated A/B Testing},

author={Nicolò Cesa-Bianchi and Tommaso R. Cesari and Yishay Mansour and Vianney Perchet},

year={2019},

eprint={1905.11797},

archivePrefix={arXiv},

primaryClass={cs.LG}

}