logo
    Offline Reinforcement Learning via Sequence Modeling for Vision-Based Robotic Grasping
    0
    Citation
    19
    Reference
    10
    Related Paper
    Abstract:
    High cost of environmental interaction and low data efficiency limit the development of reinforcement learning in robotic grasping. This paper proposes an end-to-end robotic grasping method based on offline reinforcement learning via sequence modeling. It considers the most recent n-step history to assist the agent in making decisions, where a predictive model learns to directly predict actions from raw image inputs. The experimental results show that our method can achieve higher grasping success rate with less training data than traditional reinforcement learning algorithms in offline setting.
    Keywords:
    Sequence (biology)
    Offline learning
    Offline reinforcement learning -- learning a policy from a batch of data -- is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
    Component (thermodynamics)
    Offline learning
    State action
    Online and offline
    Citations (0)
    The recent development of reinforcement learning (RL) has boosted the adoption of online RL for wireless radio resource management (RRM). However, online RL algorithms require direct interactions with the environment, which may be undesirable given the potential performance loss due to the unavoidable exploration in RL. In this work, we first investigate the use of \emph{offline} RL algorithms in solving the RRM problem. We evaluate several state-of-the-art offline RL algorithms, including behavior constrained Q-learning (BCQ), conservative Q-learning (CQL), and implicit Q-learning (IQL), for a specific RRM problem that aims at maximizing a linear combination {of sum and} 5-percentile rates via user scheduling. We observe that the performance of offline RL for the RRM problem depends critically on the behavior policy used for data collection, and further propose a novel offline RL solution that leverages heterogeneous datasets collected by different behavior policies. We show that with a proper mixture of the datasets, offline RL can produce a near-optimal RL policy even when all involved behavior policies are highly suboptimal.
    Offline learning
    Online and offline
    Citations (0)
    In goal-reaching reinforcement learning (RL), the optimal value function has a particular geometry, called quasimetric structure. This paper introduces Quasimetric Reinforcement Learning (QRL), a new RL method that utilizes quasimetric models to learn optimal value functions. Distinct from prior approaches, the QRL objective is specifically designed for quasimetrics, and provides strong theoretical recovery guarantees. Empirically, we conduct thorough analyses on a discretized MountainCar environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.
    Sample (material)
    Offline learning
    Value (mathematics)
    Sample complexity
    Citations (1)
    With the success of offline reinforcement learning (RL), offline trained RL policies have the potential to be further improved when deployed online. A smooth transfer of the policy matters in safe real-world deployment. Besides, fast adaptation of the policy plays a vital role in practical online performance improvement. To tackle these challenges, we propose a simple yet efficient algorithm, Model-based Offline-to-Online Reinforcement learning (MOORe), which employs a prioritized sampling scheme that can dynamically adjust the offline and online data for smooth and efficient online adaptation of the policy. We provide a theoretical foundation for our algorithms design. Experiment results on the D4RL benchmark show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption, and also significantly outperforms existing methods.
    Benchmark (surveying)
    Offline learning
    Online and offline
    Online algorithm
    Sample (material)
    Online model
    Transfer of learning
    Citations (3)
    Non-learning based motion and path planning of an Unmanned Aerial Vehicle (UAV) is faced with low computation efficiency, mapping memory occupation and local optimization problems. This article investigates the challenge of quadrotor control using offline reinforcement learning. By establishing a data-driven learning paradigm that operates without real-environment interaction, the proposed workflow offers a safer approach than traditional reinforcement learning, making it particularly suited for UAV control in industrial scenarios. The introduced algorithm evaluates dataset uncertainty and employs a pessimistic estimation to foster offline deep reinforcement learning. Experiments highlight the algorithm's superiority over traditional online reinforcement learning methods, especially when learning from offline datasets. Furthermore, the article emphasizes the importance of a more general behavior policy. In evaluations, the trained policy demonstrated versatility by adeptly navigating diverse obstacles, underscoring its real-world applicability.
    SAFER
    Offline learning
    Citations (1)
    In general, for a learner robot to learn an optimized behavior policy through reinforcement learning in locomotion tasks with continuous state and action spaces, a lot of trial and error experiences are required in the environment. To overcome the low data efficiency problem of online reinforcement learning, offline reinforcement learning methods using offline experience datasets are being actively studied in recent years. In this study, we propose a hybrid reinforcement learning framework that can effectively utilize online experience data in addition to offline datasets and then a Transformer-based policy network that reflects the temporal contextual information inherent in sequential experience data. In addition, to improve learning efficiency with the proposed hybrid reinforcement learning framework, a new priority sampling strategy is used to select a batch of training data from the trajectory replay buffer. Herein, we demonstrate the effectiveness and superiority of the proposed framework through various experiments on three different locomotion tasks provided by OpenAI Gym.
    Learning classifier system
    Offline learning
    Q-learning
    Patterns of operant emission produced by intermittent reinforcement schedules have been explored extensively by numerous investigators (3). The result has been a general conclusion that variable ratio reinforcement schedules are more applicable to social behavior and, in addition, produce higher and more stable operant emissions (2). While a rationale has been provided for the applicability of variable ratio reinforcement to social behavior, it is significant that there has not been an adequate explanation for the higher efficiency of variable ratio reinforcement. Such an explanation seems imperative in view of the general acceptance of the effects of differential reinforcement which suggest that the emission of a particular operant rather than otherr which would produce the same reinforcer is some function of thai operant's ability to produce the reinforcer in a greater amount, at a higher frequency and with a higher probability.' If differential reinforcement is logically faithful to the basic assumptions underlying operant theory, it seems that conclusions drawn about the efficiency of any intermittent schedule must be altered. Illustratively, given several different operants all of which produce the same reinforcer, and all linked to different intermittent reinforcement schedules, the future occurrence of any one of the operants should be a function of whether its production of the reinforcer more closely approximates continuous reinforcement than the others. Thus, under certain conditions, e.g., when intervals between reinforcements are sufficiently attenuated, a FI schedule might be much more efficient than VR schedules, VI schedules or DRL schedules. Succinctly, regardless of the reinforcement schedule employed, that operant linked with the schedule producing the greatest amount, frequency, and probability of reinforcement should be the operant most likely to occur-the schedule linked with the operant, defined as most efficient. Commensurately, the most important variable to be considered regarding reinforcement schedules is not the generic type, but rather, the degree to which the schedule employed provides reinforcement on a continuous basis prior to the onset of satiation.
    Operant conditioning
    Differential reinforcement
    Citations (0)
    Offline reinforcement learning—learning a policy from a batch of data—is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
    Component (thermodynamics)
    Offline learning
    Online and offline
    State action
    Citations (1)
    This study examined the effects of reinforcement and reinforcement plus information on both appropriate and inappropriate behavior in subjects provided with direct reinforcement and those seated adjacent to them. Four female kindergarten subjects who were of average intelligence were chosen on the basis of engaging in a relatively high percentage of inappropriate behavior. The subjects were randomly assigned to one of two pairs and within each pair, one subject was randomly designated as the one to be administered direct reinforcement (target subject). The remaining subject in each pair (non-target subject) received no direct reinforcement but was seated adjacent to the target subject. Each pair of the subjects were then exposed to seven experimental conditions: baseline, reinforcement for appropriate behavior, reversal, reinforcement f or inappropriate behavior, reinforcement for appropriate behavior with information about the contingencies, reinforcement for inappropriate behavior with information about the contingencies, reinforcement for appropriate behavior with information about the contingencies. Changes in the non-target subjects were observed as a function of witnessing a target subject receive reinforcement for appropriate behavior. When inappropriate behavior was reinforced in the target subjects, only slight changes were observed in the non-target subjects. Information about the contingencies increased the effectiveness of reinforcement in all subjects. This was particularly relevant to inappropriate behavior. The results are discussed with regard to the vicarious reinforcement literature and with regard to the efficacy of providing information along with reinforcement in order to augment it.
    Citations (0)
    In order to obtain the study of the bonding properties between the reinforcement-concrete and give full play to the material properties, a lot of research has been carried out on reinforcement-concrete. Existing reinforcement-concrete studies contain mainly reinforcement-concrete bonds, reinforcement lap, and anchorage of reinforcement. The reinforcement-concrete bond test mainly measures the bond-slip curve between the two to determine the bond strength between reinforcement and concrete. The reinforcement lap test is mainly used for the performance study of the anchorage length of reinforcement in concrete, whether the lap bars are in contact with each other, which can be divided into two forms: contact lap and indirect lap. The anchorage test of reinforcement is conducted to study the reduction of the connection length between reinforcement and concrete while meeting the force requirements. According to a large number of tests, the bond strength of the reinforcement is affected by the shape of the mixed reinforcement, the thickness of the protective layer of the diameter concrete, the spacing of the reinforcement, the transverse reinforcement restraint, and the material properties of the reinforcement and concrete. This paper discusses the test methods, influencing factors, and the lack of existing research in the study of the performance of reinforcement-concrete bonding, and lap and anchorage properties.