Nearly Minimax Optimal Reinforcement Learning for Linear Mixture MDPs

Dongruo Zhou , Quanquan Gu , Csaba Szepesvari

[Proceedings link] [PDF]

Session: Bandits, RL and Control 2 (B)

Session Chair: Sattar Vakili

Poster: Poster Session 3

Abstract: We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. For the fixed-horizon episodic setting with inhomogeneous transition kernels, we propose a new, computationally efficient algorithm that uses the basis kernels to approximate value functions. We show that the new algorithm, which we call ${\text{UCRL-VTR}^{+}}$, attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the number of basis kernels, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that ${\text{UCRL-VTR}^{+}}$ is minimax optimal up to logarithmic factors. At the core of our results are (1) a weighted least squares estimator for the unknown transitional probability; and (2) a new Bernstein-type concentration inequality for self-normalized vector-valued martingales with bounded increments. Together, these new tools enable tight control of the Bellman error and lead to a nearly minimax regret. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for RL with linear function approximation.

Summary presentation

Full presentation