Risk-Sensitive Piecewise-Linear Policy Iteration for Stochastic Shortest Path Markov Decision Processes

Henrique Dias Pastor,Igor Oliveira Borges,Valdinei Freire,Karina Valdivia Delgado,Leliane Nunes de Barros

Risk-Sensitive Piecewise-Linear Policy Iteration for Stochastic Shortest Path Markov Decision Processes

2020

A Markov Decision Process (MDP) is commonly used to model a sequential decision-making problem where an agent interacts with an uncertain environment while looking for minimizing the expected cost accumulated along the process. If the process horizon is infinite, a discount factor \(\gamma \) \(\in \) [0, 1] is used to indicate the importance the agent gives to future states. If the agent’s mission is to achieve a goal state, the process becomes a Stochastic Shortest Path MDP (SSP-MDP), the in fact model used for probabilistic planning in AI. Although several efficient solutions have been proposed to solve SSP-MDPs, there are little research carried out when we consider the “risk” in such processes. A Risk Sensitive MDP (RS-MDP) allows modeling the agent’s risk-averse and risk-prone attitudes, by including a risk and a discount factor in the MDP definition. The proof of convergence of known solutions based on dynamic programming adapted for RS-MDPs, such as risk-sensitive value iteration (VI) and risk-sensitive policy iteration (PI), rely on the discount factor. However, when solving an SSP-MDP we look for a proper policy, i.e. a policy that guarantees to reach the goal while minimizing the accumulated expected cost, which is naturally modeled without discount factor. Besides, it has been shown that the discount factor can modify the chosen risk attitude when solving a risk sensitive SSP-MDP. Thus, in this work we aim to formally proof the convergence of the PI algorithm for a Risk Sensitive SSP-MDP based on operators that use a piecewise-linear transformation function, without a discount factor. We also run experiments in the benchmark River domain showing how the intended risk attitude, in an interval of extreme risk-averse and extreme risk-prone, varies with the discount factor \(\gamma \), i.e. how an optimal policy for Risk Sensitive SSP-MDP can go from being a risk-prune policy to a risk-averse one, depending on the discount factor.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations