Nearly Minimax Optimal Reinforcement Learning for Linear Mixture MDPs

Dongruo Zhou; Quanquan Gu; Csaba Szepesvari

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture MDPs

Dongruo Zhou , Quanquan Gu , Csaba Szepesvari

[Proceedings link] [PDF]

Session: Bandits, RL and Control 2 (B)

Session Chair: Sattar Vakili

Poster: Poster Session 3

Abstract

Abstract: We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. For the fixed-horizon episodic setting with inhomogeneous transition kernels, we propose a new, computationally efficient algorithm that uses the basis kernels to approximate value functions. We show that the new algorithm, which we call ${\text{UCRL-VTR}^{+}}$, attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the number of basis kernels, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that ${\text{UCRL-VTR}^{+}}$ is minimax optimal up to logarithmic factors. At the core of our results are (1) a weighted least squares estimator for the unknown transitional probability; and (2) a new Bernstein-type concentration inequality for self-normalized vector-valued martingales with bounded increments. Together, these new tools enable tight control of the Bellman error and lead to a nearly minimax regret. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for RL with linear function approximation.

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture MDPs

Dongruo Zhou , Quanquan Gu , Csaba Szepesvari

Summary presentation

Full presentation

Discussion