Abstract: This work considers the sample and computational complexity of obtaining an $\eps$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and\nunresolved question in model based planning: is this na\"ive ``plug-in'' approach --- where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP --- non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler\napproach towards minimax optimal planning: in comparison to prior model-free results, we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel ``absorbing MDP'' construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.

Summary presentation

Full presentation