Enhancing upper confidence bounds for trees with temporal difference values

2014 
Upper confidence bounds for trees (UCT) is one of the most popular and generally effective Monte Carlo tree search (MCTS) algorithms. However, in practice it is relatively weak when not aided by additional enhancements. Improving its performance without reducing generality is a current research challenge. We introduce a new domain-independent UCT enhancement based on the theory of reinforcement learning. Our approach estimates state values in the UCT tree by employing temporal difference (TD) learning, which is known to outperform plain Monte Carlo sampling in certain domains. We present three adaptations of the TD(λ) algorithm to the UCT's tree policy and backpropagation step. Evaluations on four games (Gomoku, Hex, Connect Four, and Tic Tac Toe) reveal that our approach increases UCT's level of play comparably to the rapid action value estimation (RAVE) enhancement. Furthermore, it proves highly compatible with a modified all moves as first heuristic, where it considerably outperforms RAVE. The findings suggest that integration of TD learning into MCTS deserves further research, which may form a new class of MCTS enhancements.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    5
    Citations
    NaN
    KQI
    []