We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Q-learning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies. Unlike previous works, the guarantee depends critically on the variability of the uncertainty in the transition probabilities, but holds regardless of arbitrary changes in rewards and transition probabilities over time. Besides its intrinsic computational efficiency, this approach requires neither prior knowledge nor estimation of the transition probabilities.
Outage scheduling aims at defining, over a horizon of several months to years, when different components needing maintenance should be taken out of operation. Its objective is to minimize operation-cost expectation while satisfying reliability-related constraints. We propose a distributed scenario-based chance-constrained optimization formulation for this problem. To tackle tractability issues arising in large networks, we use machine learning to build a proxy for predicting outcomes of power system operation processes in this context. On the IEEE-RTS79 and IEEE-RTS96 networks, our solution obtains cheaper and more reliable plans than other candidates.
In the context of Multi Instance Learning, we analyze the Single Instance (SI) learning objective. We show that when the data is unbalanced and the family of classifiers is sufficiently rich, the SI method is a useful learning algorithm. In particular, we show that larger data imbalance, a quality that is typically perceived as negative, in fact implies a better resilience of the algorithm to the statistical dependencies of the objects in bags. In addition, our results shed new light on some known issues with the SI method in the setting of linear classifiers, and we show that these issues are significantly less likely to occur in the setting of neural networks. We demonstrate our results on a synthetic dataset, and on the COCO dataset for the problem of patch classification with weak image level labels derived from captions.
We study the neural-linear bandit model for solving sequential decision-making problems with high dimensional side information. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is being optimized during learning, information regarding exploration with old features is lost. Here, we propose the first limited memory neural-linear bandit that is resilient to this phenomenon, which we term catastrophic forgetting. We evaluate our method on a variety of real-world data sets, including regression, classification, and sentiment analysis, and observe that our algorithm is resilient to catastrophic forgetting and achieves superior performance.
Inspired by cognitive radio networks, we consider a setting where multiple users share several channels modeled as a multi-user multi-armed bandit (MAB) problem. The characteristics of each channel are unknown and are different for each user. Each user can choose between the channels, but her success depends on the particular channel chosen as well as on the selections of other users: if two users select the same channel their messages collide and none of them manages to send any data. Our setting is fully distributed, so there is no central control. As in many communication systems, the users cannot set up a direct communication protocol, so information exchange must be limited to a minimum. We develop an algorithm for learning a stable configuration for the multi-user MAB problem. We further offer both convergence guarantees and experiments inspired by real communication networks, including comparison to state-of-the-art algorithms.
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.
Low-Density Parity-Check (LDPC) codes are one of the most powerful classes of error-control codes known to date. These codes have been considered for many recent digital communication applications. In this dissertation, we propose stochastic decoding of state-of-the-art LDPC codes and demonstrate it as a competitive approach to practical LDPC decoding algorithms.In stochastic decoding, probabilities are represented as streams of random bits using Bernoulli sequences in which the information is contained in the statistics of the bit stream. This representation results in low hardware-complexity processing nodes that perform computationally-intensive operations. However, stochastic decoding is prone to the acute problem of latching. This problem is caused by correlated bit streams within cycles in the code's factor graph, and significantly deteriorates the performance of stochastic LDPC decoders.We propose edge memories, tracking forecast memories, and majority-based tracking forecast memories to address the latching problem. These units efficiently extract the evolving statistics of stochastic bit streams and rerandomize them to disrupt latching. To the best of our knowledge, these methods are the first successful methods for stochastic decoding of state-of-the-art LDPC codes.We present novel decoder architectures and report on several hardware implementations. The most advanced reported implementation is a stochastic decoder that decodes the (2048,1723) LDPC code from the IEEE 802.3an standard. To the best of our knowledge, this decoder is the most silicon area-efficient and, with a maximum core throughput of 61.3 Gb/s, is one of the fastest fully parallel soft-decision LDPC decoders reported in the literature. We demonstrate the performance of this decoder in low bit-error-rate regimes.In addition to stochastic LDPC decoding, we propose the novel application of the stochastic approach for joint decoding of LDPC codes and partial-response channels that are considered in practical magnetic recording applications. Finally, we investigate the application of the stochastic approach for decoding linear block codes with high-density parity-check matrices on factor graphs. We consider Reed-Solomon, Bose-Chaudhuri-Hocquenghem, and block turbo codes. %%%%A ce jour, les codes Low-Density Parity-Check (LDPC) font partie des codes correcteurs d'erreurs les plus performants. Ces codes sont inclus dans differents standards de communications numeriques. Dans ce manuscrit, nous proposons d'utiliser le decodage stochastique pour les codes LDPC. D'autre part, nous demontrons que pour les codes LDPC, le decodage stochastique represente une alternative realiste aux algorithmes de decodage existants.Dans le processus de decodage stochastique, les probabilites sont representees sous forme de sequences de Bernoulli. L'information est contenue dans la statistique de ces flux binaires aleatoires. Cette representation particuliere permet d'executer des calculs intensifs avec une faible complexite materielle. Cependant le decodage stochastique…
The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.
We study classification problems where features are corrupted by noise and where the magnitude of the noise in each feature is influenced by the resources allocated to its acquisition. This is the case, for example, when multiple sensors share a common resource (power, bandwidth, attention, etc.). We develop a method for computing the optimal resource allocation for a variety of scenarios and derive theoretical bounds concerning the benefit that may arise by non-uniform allocation. We further demonstrate the effectiveness of the developed method in simulations.