Stochastic Multi-path Routing Problem with Non-stationary Rewards: Building PayU's Dynamic Routing

2018 
Payment transaction engine at PayU processes multimillion trans- actions every day through multiple payment gateways. Routing a transaction through an appropriate payment gateway is crucial to the engine for optimizing the availability and cost. The problem is that every transaction needs to choose one of K available payment gateways characterized by an unknown probability reward distri- bution. The reward for a gateway is a combination of its health and cost factors. The reward for a gateway is only realized when transaction is processed by the gateway i.e. by its success or failure. The objective of dynamic routing is to maximize the cumulative expected rewards over some given horizon of transactions' life. To do this, the dynamic switching system needs to acquire informa- tion about gateways (exploration) while simultaneously optimizing immediate rewards by selecting the best gateway at the moment (exploitation); the price paid due to this trade o is referred to as the regret. The main objective is to minimize the regret and maximize the rewards. The basic idea is to choose a gateway according to its probability of being the best gateway. The routing problem is a direct formulation of reinforcement learning (RL) problem. In an RL problem, an agent interacts with a dynamic, stochastic, and incompletely known environment, with the goal of finding an action-selection strategy, or policy, that optimizes some long-term performance measure. Thompson Sampling algorithm has experimentally been shown to be close to optimal.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    2
    Citations
    NaN
    KQI
    []