# For authors

Awards Accepted papers Call for papers Submission instructions Call for open problems

## AWARDS

### Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Yuanzhi Li, Tengyu Ma and Hongyang Zhang.
We show that the (stochastic) gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $$\tilde{O}(dr^{2})$$ random linear measurements of a rank $$r$$ positive semidefinite matrix $$X^*$$, we can recover $$X^*$$ by parameterizing it by $$UU^\top$$ with $$U \in \mathbb{R}^{d\times d}$$ and minimizing the squared loss, even if $$r$$ is much less than $$d$$. We prove that starting from a small initialization, gradient descent recovers $$X^*$$ in $$\tilde{O}(\sqrt{r})$$ iterations approximately. The results solve the conjecture of Gunasekar et al. 17 under the restricted isometry property. The technique can be applied to analyzing neural networks with quadratic activations with some technical modifications.

### Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure

Matthew Brennan, Guy Bresler and Wasim Huleihel.
Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A recent line of work initiated in \cite{berthet2013complexity} has aimed to explain these gaps through reductions to conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include:
1. Planted Independent Set: We show tight lower bounds for recovering a planted independent set of size $$k$$ in a sparse Erd\H{o}s-R\'{e}nyi graph of size $$n$$ with edge density $$\tilde{\Theta}(n^{-\alpha})$$.
2. Planted Dense Subgraph: If $$p > q$$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $$q = \tilde{\Theta}(n^{-\alpha})$$ and $$p - q = \tilde{\Theta}(n^{-\gamma})$$ where $$\gamma \ge \alpha$$, matching the lower bounds predicted in \cite{chen2016statistical}. Our lower bounds apply to a deterministic community size $$k$$, resolving a question raised in \cite{hajek2015computational}.
3. Biclustering: We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $$k \times k$$ submatrix.
4. Sparse Rank-1 Submatrix: We show that sparse rank-1 submatrix detection is often harder than biclustering, and obtain two different tight lower bounds for these problems with different reductions from planted clique.
5. Sparse PCA: We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $$k \gg \sqrt{n}$$, when the spectral algorithm is optimal over the natural SDP. This yields the first tight characterization of a computational barrier for sparse PCA over an entire parameter regime. We also give an alternate reduction recovering the lower bounds of \cite{berthet2013complexity} and \cite{gao2017sparse} in the canonical simple hypothesis testing variant of sparse PCA, the spiked covariance model.
6. New Models: We demonstrate a subtlety in the complexity of sparse PCA and planted dense subgraph by introducing two variants of these problems, biased sparse PCA and planted stochastic block model, and showing that they have different hard regimes than the originals.
Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case for reductions between deterministic problems. Our main technical contribution is to introduce a set of cloning techniques that maintain the level of signal in an instance of a problem while increasing the size of its planted structure. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we introduce.

### Logistic Regression: The Importance of Being Improper

Dylan Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri and Karthik Sridharan.
Learning linear predictors with the logistic loss --- both in stochastic and online settings ---is a fundamental task in learning and statistics, with direct connections to classification and boosting. Existing "fast rates" for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logisic loss is $$1$$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of mcmahan2012open when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We show that the improved dependency on predictor norm is also near-optimal. Leveraging this improved dependency on the predictor norm also yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $$\tilde{O}(\sqrt{n})$$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parameteric or even nonparametric classes.

## ACCEPTED PAPERS

### Actively Avoiding Nonsense in Generative Models

Steve Hanneke, Adam Kalai, Gautam Kamath and Christos Tzamos.
A generative model may generate utter nonsense when it is fit to maximize the likelihood of observed data. This happens due to model error,'' i.e., when the true data generating distribution does not fit within the class of generative models being learned. To address this, we propose a model of active distribution learning using a binary invalidity oracle that identifies some examples as clearly invalid, together with random positive examples sampled from the true distribution. The goal is to maximize the likelihood of the positive examples subject to the constraint of (almost) never generating examples labeled invalid by the oracle. Guarantees are agnostic compared to a class of probability distributions. We show that, while proper learning often requires exponentially many queries to the invalidity oracle, improper distribution learning can be done using polynomially many queries.

### A Faster Approximation Algorithm for the Gibbs Partition Function

We consider the problem of estimating the partition function $$Z(\beta)=\sum_x \exp(-\beta H(x))$$ of a Gibbs distribution with a Hamilton $$H(\cdot)$$, or more precisely the logarithm of the ratio $$q=\ln Z(0)/Z(\beta)$$. It has been recently shown how to approximate $$q$$ with high probability assuming the existence of an oracle that produces samples from the Gibbs distribution for a given parameter value in $$[0,\beta]$$. The current best known approach due to Huber (2015) uses $$O(q\ln n\cdot[\ln q + \ln \ln n+\varepsilon^{-2}])$$ oracle calls on average where $$\varepsilon$$ is the desired accuracy of approximation and $$H(\cdot)$$ is assumed to lie in $$\{0\}\cup[1,n]$$. We improve the complexity to $$O(q\ln n\cdot\varepsilon^{-2})$$ oracle calls. We also show that the same complexity can be achieved if exact oracles are replaced with approximate sampling oracles that are within $$O(\frac{\varepsilon^2}{q\ln n})$$ variation distance from exact oracles.Finally, we prove a lower bound of $$\Omega(q\cdot \varepsilon^{-2})$$ oracle calls under a natural model of computation.

### Exponential convergence of testing error for stochastic gradient methods

Loucas Pillaud-Vivien, Alessandro Rudi and Francis Bach.
We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed. To achieve these rates of convergence we show sharper high-probability bounds with respect to the number of observations for stochastic gradient descent.

### Size-Independent Sample Complexity of Neural Networks

Noah Golowich, Alexander Rakhlin and Ohad Shamir.
We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

### Underdamped Langevin MCMC: A non-asymptotic analysis

Xiang Cheng, Niladri S. Chatterji, Peter Bartlett and Michael Jordan.
We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $$\varepsilon$$ error (in 2-Wasserstein distance) in $$\mathcal{O}(\sqrt{d}/\varepsilon)$$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $$\mathcal{O}(d/\varepsilon^2)$$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.

### Online Variance Reduction for Stochastic Optimization

Zalán Borsos, Andreas Krause and Kfir Y. Levy.
Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform importance sampling techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling datapoints, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.

### Information Directed Sampling and Bandits with Heteroscedastic Noise

Johannes Kirschner and Andreas Krause.
In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy function evaluations. Typically, the observation noise is assumed to be independent of the evaluation and to satisfy a tail bound uniformly on the domain, which is a restrictive assumption for many applications. In this work, we consider the setting of heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret framework, that is similar to the Bayesian analysis of Russo and Van Roy (2014), and we prove a new high-probability regret bound for general, possibly randomized policies, which depends on a quantity we refer to as regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize a surrogate of the regret-information ratio over all possible action sampling distributions. In order to construct the surrogate function, we generalize known concentration inequalities for online least squares regression in separable Hilbert spaces to the case of heteroscedastic noise. This allows us to formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding novel algorithms for Bayesian optimization. We also prove frequentist regret bounds, which in the homoscedastic case are comparable to known bounds for UCB, but can be much better when the noise is heteroscedastic. Empirically, we demonstrate in a linear setting, that some of our methods can outperform UCB and Thompson Sampling, even when the noise is homoscedastic.

### Testing Symmetric Markov Chains From a Single Trajectory

Constantinos Daskalakis, Nishanth Dikkala and Nick Gravin.
Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a single trajectory of a Markov Chain. In particular, we observe a single trajectory $$X_0 ,\ldots,X_t ,\ldots$$ of an unknown, symmetric, and finite state Markov Chain M. We do not control the starting state $$X_0$$, and we cannot restart the chain. Given our single trajectory, the goal is to test whether M is identical to a model Markov Chain $$M_0$$, or far from it under an appropriate notion of difference. We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [Kaz78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain $$M_0$$ is $$O(n)$$ in the size of the state space $$n$$.

### Detection limits in the high-dimensional spiked rectangular model

Ahmed El Alaoui and Michael Jordan.
We study the problem of detecting the presence of a single unknown spike in a rectangular data matrix, in a high-dimensional regime where the spike has fixed strength and the aspect ratio of the matrix converges to a finite limit. This situation comprises Johnstone's spiked covariance model. We analyze the likelihood ratio of the spiked model against an all noise" null model of reference, and show it has asymptotically Gaussian fluctuations in a region below---but in general not up to---the so-called BBP threshold from random matrix theory. Our result parallels earlier findings of Onatski et al.\ (2013) and Johnstone-Onatski (2015) for spherical spikes. We present a probabilistic approach capable of treating generic product priors. In particular, sparsity in the spike is allowed. Our approach operates through the general principle of the cavity method from spin-glass theory. The question of the maximal parameter region where asymptotic normality is expected to hold is left open. This region, not necessarily given by BBP, is shaped by the prior in a non-trivial way. We conjecture that this is the entire paramagnetic phase of an associated spin-glass model, and is defined by the vanishing of the replica-symmetric solution of Lesieur et al. (2015).

### Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

Max Simchowitz, Horia Mania, Stephen Tu, Michael Jordan and Benjamin Recht.
We prove that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory. Our upper bound analysis relies on a generalization of Mendelson’s small-ball method to dependent data, eschewing the use of standard mixing-time arguments. Our lower bounds reveal that these upper bounds match up to logarithmic factors. In particular, we capture the correct signal-to-noise behavior of the problem, showing that more unstable linear systems are easier to estimate. This behavior is qualitatively different from arguments which rely on mixing-time calculations that suggest that unstable systems are more difficult to estimate. We generalize our technique to provide bounds for a more general class of linear response time-series.

### Active Tolerant Testing

Avrim Blum and Lunjia Hu.
In this work, we give the first algorithms for tolerant testing of nontrivial classes in the active model: estimating the distance of a target function to a hypothesis class C with respect to some arbitrary distribution D, using only a small number of label queries to a polynomial-sized pool of unlabeled examples drawn from D. Specifically, we show that for the class C of unions of d intervals on the line, we can estimate the error rate of the best hypothesis in the class to an additive error epsilon from only $$O(\frac{1}{\varepsilon^6}\log \frac{1}{\varepsilon})$$ label queries to an unlabeled pool of size $$O(\frac{d}{\varepsilon^2}\log \frac{1}{\varepsilon})$$. The key point here is the number of labels needed is independent of the VC-dimension of the class. This extends the work of Balcan et al. (2012) who solved the non-tolerant testing problem for this class (distinguishing the zero-error case from the case that the best hypothesis in the class has error greater than epsilon). We also consider the related problem of estimating the performance of a given learning algorithm A in this setting. That is, given a large pool of unlabeled examples drawn from distribution D, can we, from only a few label queries, estimate how well A would perform if the entire dataset were labeled? We focus on k-Nearest Neighbor style algorithms, and also show how our results can be applied to the problem of hyperparameter tuning (selecting the best value of k for the given learning problem).

### Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

Yan Shuo Tan and Roman Vershynin.
The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $$E$$ in $$\mathbb{R}^n$$ so that data points projected onto $$E$$ follow a non-Gaussian distribution. Vempala and Xiao (2011) proposed a local search algorithm, and showed that it was able to estimate $$E$$ accurately with polynomial time and sample complexity, if the dimension of $$E$$ is treated as a constant and with the assumption that all one-dimensional marginals of the non-Gaussian distribution over $$E$$ have non-Gaussian moments. In this paper, we propose a simple spectral algorithm called \textsc{Reweighted PCA}, and prove that it possesses the same guarantee. The principle that underlies this approach is a new characterization of multivariate Gaussian distributions.

### Calibrating Noise to Variance in Adaptive Data Analysis

Vitaly Feldman and Thomas Steinke.
Datasets are often used multiple times and each successive analysis may depend on the outcome of previous analyses. Standard techniques for ensuring generalization and statistical validity do not account for this adaptive dependence. A recent line of work studies the challenges that arise from such adaptive data reuse by considering the problem of answering a sequence of queries'' about the data distribution where each query may depend arbitrarily on answers to previous queries. The strongest results obtained for this problem rely on differential privacy -- a strong notion of algorithmic stability with the important property that it composes'' well when data is reused. However the notion is rather strict, as it requires stability under replacement of an arbitrary data element. The simplest algorithm is to add Gaussian (or Laplace) noise to distort the empirical answers. However, analysing this technique using differential privacy yields suboptimal accuracy guarantees when the queries have low variance. Here we propose a relaxed notion of stability that also composes adaptively. We demonstrate that a simple and natural algorithm based on adding noise scaled to the standard deviation of the query provides our notion of stability. This implies an algorithm that can answer statistical queries about the dataset with substantially improved accuracy guarantees for low-variance queries. The only previous approach that provides such accuracy guarantees is based on a more involved differentially private median-of-means algorithm and its analysis exploits stronger group'' stability of the algorithm.

### Accelerating Stochastic Gradient Descent for Least Squares Regression

Prateek Jain, Sham Kakade, Rahul Kidambi, Praneeth Netrapalli and Aaron Sidford.
There is widespread sentiment that fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical errors [Paige71, Proakis74, Polyak87, Greenbaum89, Roy and Shynk90, Sharma et al. 98, dAspremont08, Devolder et al. 13,14, Yuan et al. 16]. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes this conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.

### Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Wenlong Mou, Liwei Wang, Xiyu Zhai and Kai Zheng.
We study the generalization errors of non-convex regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of $$O(L/n*\sqrt{b*T_n})$$ where $$L$$ is Lipschitz parameter, $$b$$ is inverse temperature, and $$T_N$$ is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower $$O(1/\sqrt{n})$$ rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of parameter, which characterize how the algorithm itself controls the statistical learning behavior in non-convex problems.

### Optimal approximation of continuous functions by very deep ReLU networks

Dmitry Yarotsky.
We prove that deep ReLU neural networks with conventional fully-connected architectures with $$W$$ weights can approximate continuous $$\nu$$-variate functions $$f$$ with uniform error not exceeding $$a_\nu\omega_f(c_\nu W^{-2/\nu}),$$ where $$\omega_f$$ is the modulus of continuity of $$f$$ and $$a_\nu, c_\nu$$ are some $$\nu$$-dependent constants. This bound is tight. Our construction is inherently deep and nonlinear: the obtained approximation rate cannot be achieved by networks with fewer than $$\Omega(W/\ln W)$$ layers or by networks with weights continuously depending on $$f$$.

### Averaged Stochastic Gradient Descent on Riemannian Manifolds

Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach and Michael Jordan.
We consider the minimization of a function defined on a Riemannian manifold M only accessible through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on M to an averaged iterate sequence with a robust and fast O(1/n) convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we show how these ideas apply to the case of streaming k-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

### Fitting a putative manifold to noisy data

Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas and Hariharan Narayanan.
In the present work, we give a solution to the following question from manifold learning. Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded manifold, and corrupted by a small amount of gaussian noise. How can we produce a manifold whose Hausdorff distance to the true manifold is small and whose reach is not much smaller than the reach of the true manifold?

### Private Sequential Learning

John Tsitsiklis, Kuang Xu and Zhi Xu.
We formulate a private learning model to study an intrinsic tradeoff between privacy and query complexity in sequential learning. Our model involves a learner who aims to determine a scalar value, $$v^*$$, by sequentially querying an external database with binary responses. In the meantime, an adversary observes the learner's queries, though not the responses, and tries to infer from them the value of $$v^*$$. The objective of the learner is to obtain an accurate estimate of $$v^*$$ using only a small number of queries, while simultaneously protecting her privacy by making $$v^*$$ provably difficult to learn for the adversary. Our main results provide tight upper and lower bounds on the learner's query complexity as a function of desired levels of privacy and estimation accuracy. We also construct explicit query strategies whose complexity is optimal up to an additive constant.

### Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models

Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane and Lenka Zdeborová
Generalized linear models (GLMs) arise in high-dimensional machine learning, statistics, communications and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes or benchmarks models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal inference and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and dimensions are large and their ratio is fixed. Non-rigorous predictions for the optimal inference and generalization errors existed for special cases of GLMs, e.g. for the perceptron in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance, and locate the associated sharp phase transitions separating learnable and non-learnable regions

### Exact and Robust Conformal Inference Methods for Predictive Machine Learning With Dependent Data

Victor Chernozhukov, Kaspar Wuthrich and Yinchu Zhu.
We extend conformal inference to general settings that allow for time series data. Our proposal is developed as a randomization method and accounts for potential serial dependence by including block structures in the permutation scheme. As a result, the proposed method retains the exact, model-free validity when the data are i.i.d. or more generally exchangeable, similar to usual conformal inference methods. When exchangeability fails, as is the case for common time series data, the proposed approach is approximately valid under weak assumptions on the conformity score.

### Nonstochastic Bandits with Composite Anonymous Feedback

Nicolò Cesa-Bianchi, Claudio Gentile and Yishay Mansour.
We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

### Lower Bounds for Higher-Order Convex Optimization

State-of-the-art methods in mathematical optimization employ higher-order derivative information. We explore the limitations of higher-order optimization and prove that even for convex optimization, a polynomial dependence on the approximation guarantee and higher-order smoothness parameters is necessary. This refutes the hope that higher-order smoothness and higher-order derivatives can lead to dimension free polynomial time algorithms for convex optimization. As a special case, we show Nesterov's accelerated cubic regularization method and higher-order methods to be nearly tight.

### Log-concave sampling: Metropolis-Hastings algorithms are fast!

Raaz Dwivedi, Yuansi Chen, Martin Wainwright and Bin Yu.
We consider the problem of sampling from a strongly log-concave density in $$\mathbb{R}^d$$, and prove a non- asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds reveal that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most δ for a density with condition number κ, we show that MALA requires $$O(\kappa d \log(1/\delta))$$ steps, as compared to the $$O(\kappa^2d/\delta^2)$$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes $$O(\kappa d)$$ slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the potential gains of Metropolis-Hastings adjustment for Langevin-type algorithms.

### Incentivizing Exploration by Heterogeneous Users

Bangrui Chen, Peter Frazier and David Kempe.
We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent normally distributed noise. Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms' attribute vectors derived from observations of other agents' past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents' preferences are drawn, but not the specific draws, can offer incentive payments for pulling specific arms in order to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment. We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $$p>0$$ of agents, we show that this algorithm achieves expected cumulative regret of $$O(N\exp(2/p) + N \log^3(T))$$, using expected cumulative payments of $$O(N^2\exp(2/p))$$. If $$p$$ is known or the distribution over agent preferences is discrete, the exponential term $$\exp(2/p)$$ can be replaced with suitable polynomials in $$N$$ and $$1/p$$. For discrete preferences, the regret dependence on $$T$$ can be eliminated entirely, giving constant (depending only polynomially on $$N$$ and $$1/p$$ expected regret and payments. This constant regret stands in contrast to the $$\Theta(\log(T))$$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences allows exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration.

### Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms

Ilias Diakonikolas, Jerry Li and Ludwig Schmidt.
Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms We study the problem of robustly learning multi-dimensional histograms. A $$d$$-dimensional function $$h: D \to \mathbb{R}$$ is called a $$k$$-histogram if there exists a partition of the domain $$D \subseteq \mathbb{R}^d$$ into $$k$$ axis-aligned rectangles such that $$h$$ is constant within each such rectangle. Let $$f: D \to \mathbb{R}$$ be an arbitrary $$d$$-dimensional probability density function, and suppose that $$f$$ is $$\mathrm{OPT}$$-close, in $$L_1$$-distance, to an unknown $$k$$-histogram (with unknown partition). Our goal is to output a hypothesis that is $$O(\mathrm{OPT}) + \varepsilon$$ close to $$f$$, in $$L_1$$-distance. We give an efficient algorithm for this learning problem that, for any fixed dimension, uses near-optimal sample complexity (up to logarithmic factors), and runs in sample near-linear time. Prior to our work, the case of $$d=1$$ was well-understood, but vast gaps in our understanding remained even for $$d=2$$.

### Time-Space Tradeoffs for Learning Finite Functions from Random Tests, with Applications to Polynomials

Paul Beame, Shayan Oveis Gharan and Xin Yang.
We develop an extension of recent analytic methods for obtaining time-space tradeoff lower bounds for problems of learning from uniform random test samples. With our methods we can obtain bounds for learning finite learning functions from test even when the space of tests is significantly smaller than the space of inputs and the feedback from each test can come from an arbitrary finite set. At the core of our results, we reduce the time-space complexity of learning from random tests to the question of how much the corresponding evaluation matrix amplifies the 2-norms of almost uniform'' probability distributions. To analyze the latter, we formulate it as a semidefinite program, and we analyze its dual. In order to handle multivalued feedback from tests, we apply this norm amplification analysis to complex matrices. As applications that follow from our new techniques, we show that any algorithm that learns $$m$$-variate polynomial functions of degree at most d over $$F_2$$ with success at least $$2^{-O(m)}$$ from evaluations on randomly chosen inputs either requires space $$\Omega(mn/d)$$ or $$2^{\Omega(m/d)}$$ time where $$n=(m/d)^{\Theta(d)}$$ is the dimension of the space of such polynomials. These bounds are asymptotically optimal for polynomials of arbitrary constant degree since they match the tradeoffs achieved by natural learning algorithms for the problems. We also extend these results to learning linear and quadratic polynomials over any odd prime field $$F_p$$ where we show that $$\Omega(mn\log_2 p)$$ space or time $$p^{\Omega(m)}$$ is required. To derive our bounds for learning degree d polynomials over finite fields, we show that analyzing the dual of the semidefinite program corresponds to studying the distribution of the bias of all degree $$d$$ polynomials with respect to uniformly random inputs. For example, in the case of polynomials over $$F_2$$ the distribution of bias corresponds to the weight distribution of Reed-Muller codes over $$F_2$$.

### Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability

Belinda Tzen, Tengyuan Liang and Maxim Raginsky.
We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz. For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, one of the two mutually exclusive events will occur: either the Langevin trajectory ends up somewhere outside the $$\varepsilon$$-neighborhood of this particular optimum within a short recurrence time; or it enters this $$\varepsilon$$-neighborhood by the recurrence time and stays there until an exponentially long escape time. We call this phenomenon empirical metastability. This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the recurrence time is dimension-independent, and resembles the convergence time of deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

### Hardness of Learning Noisy Halfspaces using Polynomial Thresholds

Arnab Bhattacharyya, Suprovat Ghoshal and Rishi Saket.
We prove the hardness of weakly learning halfspaces in the presence of adversarial noise using polynomial threshold functions (PTFs). In particular, we prove that for any constants $$d$$ in $${Z}^+$$ and $$\varepsilon > 0$$, it is NP-hard to decide: given a set of $$\{-1,1\}$$-labeled points in $${R}^n$$ whether (YES Case) there exists a halfspace that classifies (1 - eps)-fraction of the points correctly, or (NO Case) any degree-d PTF classifies at most (1/2 + eps)-fraction of the points correctly. This strengthens to all constant degrees the previous NP-hardness of learning using degree-2 PTFs shown by Diakonikolas et al. (2011). The latter result had remained the only progress over the works of Feldman et al. (2006) and Guruswami et al. (2006) ruling out weakly proper learning adversarially noisy halfspaces.

### Best of Both Worlds: Stochastic & Adversarial Best-Arm Identification

Yasin Abbasi-Yadkori, Peter Bartlett, Victor Gabillon, Alan Malek and Michal Valko.
We study a version of the bandit best-arm identification problem with potentially adversarial rewards. A simple random uniform strategy obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is sub-optimal when the rewards are sampled stochastically. Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such learner is impossible in general: to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We show a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial problems.

### Learning Patterns for Detection with Multiscale Scan Statistics

James Sharpnack.
This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis---that the data is only noise---is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an $$\varepsilon$$-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.

### Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another in which the task is to invert a generative neural network given only compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. Hence, there are no local minima, saddle points, or other stationary points outside these neighborhoods. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and they bridge the gap between the empirical success of enforcing deep generative priors and a rigorous understanding of non-linear inverse problems.

### Small-loss bounds for online learning with partial information

Thodoris Lykouris, Karthik Sridharan and Eva Tardos.
We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called "small-loss'' $$o(\alpha L^{\star})$$ regret bounds with high probability, where $$\alpha$$ is the independence number of the graph, and $$L^{\star}$$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $$\widetilde{O}(\sqrt{dL^{\star}})$$ for actual regret, where $$d$$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $$\widetilde{O}(\sqrt{\kappa L^{\star}})$$ regret guarantee for fixed feedback graphs with clique-partition number at most $$\kappa$$.

### Empirical bounds for functions with weak interactions

Andreas Maurer and Massimiliano Pontil.
We provide sharp empirical estimates of expectation, variance and normal approximation for a class of statistics whose variation in any argument does not change too much when another argument is modified. Examples of such weak interactions are furnished by $$U$$- and $$V$$-statistics, Lipschitz $$L$$- statistics and various error functionals of $$L_2$$-regularized algorithms and Gibbs algorithms.

### Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression

Shiva Kasiviswanathan and Mark Rudelson.
High-dimensional settings, where the data dimension (d) far exceeds the number of observations ($$n$$), are common in many statistical and machine learning applications. Methods based on $$\ell_1$$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix $$X \in \mathbb{R}^{n \times d}$$ satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution $$X \Phi^\top \Phi$$, where $$\Phi \in \mathbb{R}^{m \times d}$$ is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily verifiable condition into the design process, and allows for generation of compressed design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix $$X$$.

Chi Jin, Praneeth Netrapalli and Michael Jordan.
Nesterov's accelerated gradient descent (AGD), an instance of the general family of momentum methods,'' provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern nonconvex applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov's AGD, and shows that it escapes saddle points and finds a second-order stationary point in $$\tilde{O}(1/\varepsilon^{7/4})$$ iterations, matching the best known convergence rate, which is faster than the $$\tilde{O}(1/\varepsilon^{2})$$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting---all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

### Convex Optimization with Unbounded Nonconvex Oracles Using Simulated Annealing

Oren Mangoubi and Nisheeth Vishnoi.
We consider the problem of minimizing a convex objective function $$F$$ when one can only evaluate its noisy approximation $$\hat{F}$$. Unless one assumes some structure on the noise, $$\hat{F}$$ may be an arbitrary nonconvex function, making the task of minimizing $$F$$ intractable. To overcome this, prior work has often focused on the case when $$F(x)-\hat{F}(x)$$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $$\alpha F(x) + \beta$$ for some $$\alpha, \beta > 0$$, and present a polynomial time algorithm that finds an approximate minimizer of $$F$$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.

### Learning Mixtures of Linear Regressions with Nearly Optimal Complexity

Yuanzhi Li and Yingyu Liang.
Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $$N = \tilde{O}(d)$$ as well as nearly optimal computational complexity $$\tilde{O}(Nd)$$, where $$d$$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.

### Detecting Correlations with Little Memory and Communication

We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of \cite{shamir2014fundamental}, which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems.

### Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

Gal Dalal, Balazs Szorenyi, Gugan Thoppe and Shie Mannor.
Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as lock-in probability''. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

### Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities

Timothy Carpenter, Ilias Diakonikolas, Anastasios Sidiropoulos and Alistair Stewart.
We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the \emph{first} upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on $$\mathbb{R}^d$$, for all $$d \geq 4$$. Prior to this work, no finite sample upper bound was known for this estimator in more than $$3$$ dimensions. In more detail, we prove that for any $$d \geq 1$$ and $$\varepsilon>0$$, given $$\widetilde{O}_d((1/\varepsilon)^{(d+3)/2})$$ samples drawn from an unknown log-concave density $$f_0$$ on $$\mathbb{R}^d$$, the MLE outputs a hypothesis $$h$$ that with high probability is $$\varepsilon$$-close to $$f_0$$, in squared Hellinger loss. A sample complexity lower bound of $$\Omega_d((1/\varepsilon)^{(d+1)/2})$$ was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an $$\tilde{O}(1/\varepsilon)$$ factor.

Chen-Yu Wei and Haipeng Luo.
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss \emph{and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.

### Efficient Convex Optimization with Membership Oracles

Yin Tat Lee, Aaron Sidford and Santosh Vempala.
We consider the problem of minimizing a convex function over a convex set given access only to an evaluation oracle for the function and a membership oracle for the set. We give a simple algorithm which solves this problem with $$O~(n^2)$$ oracle calls and $$O~(n^3)$$ additional arithmetic operations. Using this result, we obtain more efficient reductions among the five basic oracles for convex sets and functions defined by Grötschel et al. (1988).

### A General Approach to Multi-Armed Bandits Under Risk Criteria

Asaf Cassel, Assaf Zeevi and Shie Mannor.
Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.

### An Optimal Learning Algorithm for Online Unconstrained Submodular Maximization

Tim Roughgarden and Joshua Wang.
We consider a basic problem at the interface of two fundamental fields: submodular optimization and online learning. In the online unconstrained submodular maximization (online USM) problem, there is a universe $$[n]=\{1,2,\ldots,n\}$$ and a sequence of $$T$$ nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of $$[n]$$ at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial-time no-$$\tfrac 12$$-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm's expected total value is at least $$\tfrac{1}{2}$$ times that of the best subset in hindsight, up to an error term sublinear in $$T$$. The factor of $$\tfrac 12$$ cannot be improved upon by any polynomial-time online algorithm, unless $$NP = RP$$. Prior to our work, the best result known was that picking a subset uniformly at random in every time step achieves no $$\tfrac 14$$-regret. A byproduct of our techniques is an explicit subroutine for the two-experts problem that has an unusually strong regret guarantee: the total value of its choices is comparable to twice the total value of either expert on rounds it did not pick that expert. This subroutine may be of independent interest.

### The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity

Vishesh Jain, Frederic Koehler and Elchanan Mossel.
The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by Borgs, Chayes, Lov{\'a}sz, S{\'o}s, and Vesztergombi and another recent work by Basak and Mukherjee. Our bound is tight up to lower order terms. Building on the methods used to prove the bound, along with techniques from combinatorics and optimization, we study the algorithmic problem of estimating the (variational) free energy for Ising models and general Markov random fields. For a graph $$G$$ on $$n$$ vertices and interaction matrix $$J$$ with Frobenius norm $$\| J \|_F$$, we provide algorithms that approximate the free energy within an additive error of $$\varepsilon n \|J\|_F$$ in time $$\exp(poly(1/\varepsilon))$$. We also show that approximation within $$(n \|J\|_F)^{1-\delta}$$ is NP-hard for every $$\delta > 0$$. Finally, we provide more efficient approximation algorithms, which find the optimal mean field approximation, for ferromagnetic Ising models and for Ising models satisfying Dobrushin's condition.

### Approximation beats concentration? An approximation view on inference with smooth radial kernels

Mikhail Belkin.
Positive definite kernels and their associated Reproducing Kernel Hilbert Spaces provide a mathematically compelling and practically competitive framework for learning from data. In this paper we take the approximation theory point of view to explore various aspects of smooth kernels related to their inferential properties. We analyze eigenvalue decay of kernels operators and matrices, properties of eigenfunctions/eigenvectors and Fourier'' coefficients of functions in the kernel space restricted to a discrete set of data points. We also investigate the fitting capacity of kernels, giving explicit bounds on the fat shattering dimension of the balls in Reproducing Kernel Hilbert spaces. Interestingly, the same properties that make kernels very effective approximators for functions in their native'' kernel space, also limit their capacity to represent arbitrary functions. We discuss various implications, including those for gradient descent type methods. It is important to note that most of our bounds are measure independent. Moreover, at least in moderate dimension, the bounds for eigenvalues are much tighter than the bounds which can be obtained from the usual matrix concentration results. For example, we see that the eigenvalues of kernel matrices show nearly exponential decay with constants depending only on the kernel and the domain. We call this approximation beats concentration'' phenomenon as even when the data are sampled from a probability distribution, some of their aspects are better understood in terms of approximation theory.

### Non-Convex Matrix Completion Against a Semi-Random Adversary

Yu Cheng and Rong Ge.
Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies heavily on the assumption that every entry is observed with exactly the same probability $$p$$, which is not realistic in practice. In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is at least $$p$$. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima. In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes similar'' to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix.

### The Vertex Sample Complexity of Free Energy is Polynomial

Vishesh Jain, Frederic Koehler and Elchanan Mossel.
The free energy is a key quantity which is associated to Markov random fields. Classical results in statistical physics show how, given an analytic formula of the free energy, it is possible to compute many key quantities associated with Markov random fields including quantities such as magnetization and the location of various phase transitions. Given a massive Markov random field on $$n$$ nodes, can a small sample from it provide a rough approximation to the free energy $$F_n = \log{Z_n}$$? Results in graph limit literature by Borgs, Chayes, Lov{\'a}sz, S{\'o}s, and Vesztergombi show that for Ising models on $$n$$ nodes and interactions of strength $$\Theta(1/n)$$, an $$\varepsilon$$ approximation to $$\log Z_n / n$$ can be achieved by sampling a randomly induced model on $$2^{O(1/\varepsilon^2)}$$ nodes. We show that the sampling complexity of this problem is \emph{polynomial in }$$1/\varepsilon$$. We further show a polynomial dependence on $$\varepsilon$$ cannot be avoided. Our results are very general as they apply to higher order Markov random fields. For Markov random fields of order $$r$$, we obtain an algorithm that achieves $$\varepsilon$$ approximation using a number of samples polynomial in $$r$$ and $$1/\varepsilon$$ and running time that is $$2^{O(1/\varepsilon^2)}$$ up to polynomial factors in $$r$$ and $$\varepsilon$$. For ferromagnetic Ising models, the running time is polynomial in $$1/\varepsilon$$. Our results are intimately connected to recent research on the regularity lemma and property testing, where the interest is in finding which properties can tested within $$\varepsilon$$ error in time polynomial in $$1/\varepsilon$$. In particular, our proofs build on results from a recent work by Alon, de la Vega, Kannan and Karpinski, who also introduced the notion of polynomial vertex sample complexity. Another critical ingredient of the proof is an effective bound by the authors of the paper relating the variational free energy and the free energy.

### Efficient Algorithms for Outlier-Robust Regression

Adam Klivans, Pravesh K Kothari and Raghu Meka.
We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn iid from distribution $$D$$ and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is _certifiably_ hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussians, affine transformations of isotropic strongly log-concave distributions and, the uniform distribution on the hypercube, among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.

### Action-Constrained Markov Decision Processes With Kullback-Leibler Cost

Ana Busic and Sean Meyn.
This paper concerns computation of optimal policies in which the one-step cost function contains a term that models Kullback-Leibler divergence with respect to nominal dynamics. This technique was introduced by Todorov in 2007, where it was shown under general conditions that the solution to the average-cost optimality equations reduce to a simple eigenvector problem. Since then many authors have sought to apply this technique to control problems and models of bounded rationality in economics. A crucial assumption is that the input process is essentially unconstrained. For example, if the nominal dynamics include randomness from nature (e.g., the impact of wind on a moving vehicle), then the optimal control solution does not respect the exogenous nature of this disturbance. This paper introduces a technique to solve a more general class of action-constrained MDPs. The main idea is to solve an entire parameterized family of MDPs, in which the parameter is a scalar weighting the one-step cost or reward function. The approach is new and practical even in the original unconstrained formulation.

### Fundamental Limits of Weak Recovery with Applications to Phase Retrieval

Marco Mondelli and Andrea Montanari.
In phase retrieval we want to recover an unknown signal $$x\in\mathbb C^d$$ from $$n$$ quadratic measurements of the form $$y_i = ||^2+w_i$$ where $$a_i\in \mathbb C^d$$ are known sensing vectors and $$w_i$$ is measurement noise. We ask the following weak recovery question: what is the minimum number of measurements $$n$$ needed to produce an estimator $$\hat{x}(y)$$ that is positively correlated with the signal $$x$$? We consider the case of Gaussian vectors $$a_i$$. We prove that -- in the high-dimensional limit -- a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $$n\le d-o(d)$$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $$n\ge d+o(d)$$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well. Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval to measurements $$y_i$$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm.

### Cutting plane methods can be extended into nonconvex optimization

Oliver Hinder.
We show that it is possible to obtain an $$O(\varepsilon^{-4/3})$$ rate (including computational cost) for finding stationary points of nonconvex functions using cutting plane methods. This improves on the best known rate achieved by cubic regularization of $$O(\varepsilon^{-3/2})$$. Our techniques utilize the convex until proven guilty principle proposed by Carmon, Duchi, Hinder, and Sidford.

### An Analysis of the t-SNE Algorithm for Data Visualization

Sanjeev Arora, Wei Hu and Pravesh K Kothari.
A first line of attack in exploratory data analysis is _data visualization_, i.e., generating a 2-dimensional representation of data that makes _clusters_ of similar points visually identifiable. Standard JL dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of _data visualization_ -- finding a 2 or 3-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the ground-truth'' clustering (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of arbitrary well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in _partially_ recovering cluster structure even when the above deterministic condition is not met.

### Adaptivity to Smoothness in X-armed bandits

Andrea Locatelli and Alexandra Carpentier.
We study the stochastic continuum-armed bandit problem from the angle of adaptivity to unknown regularity of the reward function f. We prove that there exists no strategy for the cumulative regret that adapts optimally to the smoothness of f. We show however that such minimax optimal adaptive strategies exist if the learner is given extra-information about f. Finally, we complement our positive results with matching lower bounds.

### Black-Box Reductions for Parameter-free Online Learning in Banach Spaces

Ashok Cutkosky and Francesco Orabona.
We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms.

### A Data Prism: Semi-verified learning in the small-alpha regime

Michela Meister and Gregory Valiant.
We consider a simple model of unreliable or crowdsourced data where there is an underlying set of $$n$$ binary variables, each evaluator'' contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $$r$$ of the variables, and the learner is given the true value of a \emph{constant} number of variables. We show that, provided an $$\alpha$$-fraction of the evaluators are good'' (either correct, or with independent noise rate $$p < 1/2$$), then the true values of a $$(1-\varepsilon)$$ fraction of the $$n$$ underlying variables can be deduced as long as $$r > \log_{2-2p}(1/\alpha)$$. For example, if the fraction of good'' evaluators is larger than $$1/16$$ and there is no noise in their responses, then accurate recovery is possible provided each worker evaluates a random set of $$4$$ items. This result is optimal in that if $$r \leq \log_{2-2p}(1/\alpha)$$ the large dataset can contain no information. This setting can be viewed as an instance of the \emph{semi-verified} learning model introduced in CSV17, which explores the tradeoff between the number of items evaluated by each worker and the fraction of good'' evaluators. In the standard adversarial setting, our algorithm requires $$\widetilde{O}\left(n^{\log_{2-2p}(1/\alpha)}\right)$$ evaluators. However, the algorithm runs in near linear time, $$\widetilde{O}_{r,\varepsilon}(n)$$, and hence would require only a near-linear number of evaluations in the weaker model in which the adversary's responses to each $$r$$-tuple of items must be independent of the set of evaluations collected. These settings and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, the setting where one collects a dataset on customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a \emph{constant} (polynomial in $$1/\alpha$$ but independent of $$n$$), number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a data prism'' allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

### Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Yuanzhi Li, Tengyu Ma and Hongyang Zhang.
We show that the (stochastic) gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given $$\tilde{O}(dr^{2})$$ random linear measurements of a rank $$r$$ positive semidefinite matrix $$X^*$$, we can recover $$X^*$$ by parameterizing it by $$UU^\top$$ with $$U \in \mathbb{R}^{d\times d}$$ and minimizing the squared loss, even if $$r$$ is much less than $$d$$. We prove that starting from a small initialization, gradient descent recovers $$X^*$$ in $$\tilde{O}(\sqrt{r})$$ iterations approximately. The results solve the conjecture of Gunasekar et al. 17 under the restricted isometry property. The technique can be applied to analyzing neural networks with quadratic activations with some technical modifications.

### A Direct Sum for Information Learners

Ido Nachum, Jonathan Shafer and Amir Yehudayoff.
How many bits of information are required to PAC learn a class of hypotheses of VC dimension d? The mathematical setting we follow is of Bassily et al. (2018) where the value of interest is the mutual information $$I(S;A(S))$$ between the input sample S and the hypothesis outputted by the learning algorithm A. We introduce a class of functions of VC dimension d over the domain X with information complexity at least $$\Omega(d \log \log (|X|/d))$$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $$d=1$$. The above result is in fact a special case of a more general phenomenon we explore. We define the notion of \emph{information complexity) of a given class of functions $$\mathcal{H}$$. Roughly speaking, it is the minimum amount of information that an algorithm for $$\mathcal{H}$$ must use to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.

### Online learning over a finite action set with limited switching

Jason Altschuler and Kunal Talwar.
We study the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied setting of PFE with switching costs. Many algorithms are known to achieve the optimal minimax order of $$O(\sqrt{T \log n})$$ in expectation for both regret and number of switches, where $$T$$ is the number of iterations and $$n$$ the number of actions. However, no high probability guarantees are known. Our main technical contribution is an algorithms whose regret and number of switches are both of this optimal order with high probability. This settles an open problem of Devroye, Lugosi and Neu(2015), and also directly implies the first high probability guarantees for several problems of interest. Next, to investigate the value of switching actions at a more granular level, we introduce the setting of switching budgets, in which the algorithm is limited to $$S \leq T$$ switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches allowed in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both the PFE and MAB problems, for all switching budgets $$S \in [1, T]$$, and for both expectation and high probability guarantees. For PFE, we show that the optimal rate is of order $$\tilde{\Theta}(\sqrt{T\log n})$$ for $$S = \Omega(\sqrt{T\log n})$$, and $$\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T)$$ for $$S = O(\sqrt{T \log n})$$. Interestingly, the bandit setting does not have such a phase transition; instead we show the minimax rate decays steadily as $$\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T)$$ for all ranges of $$S \in [1,T]$$. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting.

### Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent

Gautam Goel, Niangjun Chen and Adam Wierman.
We study smoothed online convex optimization, a version of online convex optimization where the learner incurs a penalty for changing her actions between rounds. Given an $$\Omega(\sqrt{d})$$ lower bound on the competitive ratio of any online algorithm, where d is the dimension of the action space, we ask under what conditions this bound can be beaten. We introduce a novel algorithmic framework for this problem, Online Balanced Descent (OBD), which works by iteratively projecting the previous point onto a carefully chosen level set of the current cost function so as to balance the switching costs and hitting costs. We demonstrate the generality of the OBD framework by showing how, with different choices of “balance,” OBD can improve upon state-of-the-art performance guarantees for both competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, $$3+O(1/\alpha)$$, for locally polyhedral costs, where α measures the “steepness” of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performed in the dual space that are dimension-free and imply that OBD has sublinear static regret.

### Faster Rates for Convex-Concave Games

Jacob Abernethy, Kevin A. Lai, Kfir Y. Levy and Jun-Kun Wang.
We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $$O(T^{-1/2})$$, recent work \citep{RS13,SALS15} has established $$O(1/T)$$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $$O(1/T^2)$$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citep{D15}. We also show that such no-regret techniques can even achieve a linear rate, $$O(\exp(-T))$$, for equilibrium computation under additional curvature assumptions.

### Logistic Regression: The Importance of Being Improper

Dylan Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri and Karthik Sridharan.
Learning linear predictors with the logistic loss --- both in stochastic and online settings ---is a fundamental task in learning and statistics, with direct connections to classification and boosting. Existing "fast rates" for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logisic loss is $$1$$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of mcmahan2012open when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We show that the improved dependency on predictor norm is also near-optimal. Leveraging this improved dependency on the predictor norm also yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $$\tilde{O}(\sqrt{n})$$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parameteric or even nonparametric classes.

### L1 Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent

David Durfee, Kevin A. Lai and Saurabh Sawlani.
We present preconditioned stochastic gradient descent (SGD) algorithms for the $$\ell_1$$ minimization problem $$\min_{x}\|A x - b\|_1$$ in the overdetermined case, where there are far more constraints than variables. Specifically, we have $$A \in \mathbb{R}^{n \times d}$$ for $$n \gg d$$. Commonly known as the Least Absolute Deviations problem, $$\ell_1$$ regression can be used to solve many important combinatorial problems, such as minimum cut and shortest path. SGD-based algorithms are appealing for their simplicity and practical efficiency. Our primary insight is that careful preprocessing can yield preconditioned matrices $$\tilde{A}$$ with strong properties (besides good condition number and low-dimension) that allow for faster convergence of gradient descent. In particular, we precondition using Lewis weights to obtain an isotropic matrix with fewer rows and strong upper bounds on all row norms. We leverage these conditions to find a good initialization, which we use along with recent smoothing reductions and accelerated stochastic gradient descent algorithms to achieve $$\varepsilon$$ relative error in $$\tilde{O}(nnz(A) + d^{2.5} \varepsilon^{-2})$$ time with high probability, where $$nnz(A)$$ is the number of non-zeros in $$A$$. This improves over the previous best result using gradient descent for $$\ell_1$$ regression. We also match the best known running times for interior point methods in several settings. Finally, we also show that if our original matrix $$A$$ is approximately isotropic and the row norms are approximately equal, we can give an algorithm that avoids using fast matrix multiplication and obtains a running time of $$\tilde{O}(nnz(A) + s d^{1.5}\varepsilon^{-2} + d^2\varepsilon^{-2})$$, where $$s$$ is the maximum number of non-zeros in a row of $$A$$. In this setting, we beat the best interior point methods for certain parameter regimes.

### Optimal Single Sample Tests for Structured versus Unstructured Network Data

Guy Bresler and Dheeraj Nagaraj.
We study the problem of testing, using only a single sample, between mean field distributions (like Curie-Weiss, Erd\H{o}s-R\'enyi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs). Our goal is to test without knowing the parameter values of the underlying models: only the \emph{structure} of dependencies is known. We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can distinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses. The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures. The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction radically changes the behavior of the functions under consideration, resulting in a much smaller variance than in the independent setting; this makes it impossible to directly apply standard methods (i.e., Stein's method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere.

### A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Jalaj Bhandari, Daniel Russo and Raghav Singal.
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a \emph{simple and explicit finite time analysis} of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature.

### Privacy-preserving Prediction

Cynthia Dwork and Vitaly Feldman.
Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all prediction private algorithms.

### An Estimate Sequence for Geodesically Convex Optimization

Hongyi Zhang and Suvrit Sra.
In this paper, we propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (R-AGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood whose radius depends on the condition number as well as the sectional curvature of the manifold, R-AGD converges to the minimum with acceleration. Unlike the algorithm in (Liu et al., 2017) which requires the exact solution to a nonlinear equation which in turn may be intractable, our algorithm is constructive and efficiently implementable. Our proof exploits a new estimate sequence and a novel bound on the nonlinear metric distortion, both ideas may be of independent interest.

### The Externalities of Exploration and How Data Diversity Helps Exploitation

Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan and Zhiwei Steven Wu.
Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration---the undesirable side effects that the presence of one party may impose on another---under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence one population of users (the majority) impacts the rewards of another (the minority). We show that this impact can, in some cases, be negative, and that, in a certain sense, no algorithm can avoid it. We then move on to study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $$\tilde{O}(T^{1/3})$$. Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish if one runs the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

### Efficient Contextual Bandits in Non-stationary Worlds

Haipeng Luo, Chen-Yu Wei, Alekh Agarwal and John Langford.
Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $$\mathcal{O}(\sqrt{ST})$$ if there are $$T$$ rounds with $$S$$ stationary periods, or more generally $$\mathcal{O}(\Delta^{1/3}T^{2/3})$$ where $$\Delta$$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity $$S$$ or $$\Delta$$, we also develop a parameter free algorithm achieving regret $$\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$$. This improves and generalizes the best existing result $$\Delta^{0.18}T^{0.82}$$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.

### Langevin Monte Carlo and JKO splitting

Espen Bernton.
Algorithms based on discretizing Langevin diffusion are popular tools for sampling from high-dimensional distributions. We develop novel connections between such Monte Carlo algorithms, the theory of Wasserstein gradient flow, and the operator splitting approach to solving PDEs. In particular, we show that a proximal version of the Unadjusted Langevin Algorithm corresponds to a scheme that alternates between solving the gradient flows of two specific functionals on the space of probability measures. Using this perspective, we derive some new non-asymptotic results on the convergence properties of this algorithm.

### Subpolynomial trace reconstruction for random strings and arbitrary deletion probability

Nina Holden, Robin Pemantle and Yuval Peres.
The deletion-insertion channel takes as input a string $$x$$ of $$n$$ bits and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover $$x$$ from many independent outputs (called traces'') of the deletion-insertion channel applied to $$x$$. We show that if $$x$$ is chosen uniformly at random, then $$\exp(O(\log^{1/3} n))$$ traces suffice to reconstruct $$x$$ with high probability. The earlier upper bounds were $$\exp(O(\log^{1/2} n))$$ for the deletion channel with deletion probability $$q <1/2$$, and $$\exp(O( n^{1/3}))$$ for the deletion channel with $$q \ge 1/2$$ or when insertions are allowed. A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of $$x$$. The alignment is done by viewing the strings as random walks, and comparing the increments in the walk associated with the input string and the trace, respectively

### An explicit analysis of the entropic penalty in linear programming

Jonathan Weed.
Solving linear programs by using entropic penalization is an old idea with applications throughout optimization and statistics. Recently, this technique has attracted new interest in machine learning, where it has been used as the basis for the fastest-known algorithms for the optimal transport problem. Crucial to these applications has been an analysis of how quickly solutions to the penalized program approach true optima to the original linear program. Cominetti and San Martín showed that this convergence is exponentially fast; however, their proof does not yield explicit constants, which makes it difficult to determine how this rate depends on various parameters of the original linear program. In this work, we give a new proof of this exponential convergence. Our proof yields user-friendly constants, and has the virtue of being extremely simple. We also show examples where our analysis is tight and give an application to the linear assignment problem.

### Efficient Active Learning of Sparse Halfspaces

Chicheng Zhang.
We study the problem of efficient PAC active learning of d-dimensional linear classifiers, where the goal is to learn a classifier with low error, using as few label queries as possible. Given the extra assumption that there is a t-sparse linear separator that peforms well on the data (t << d), we would like our active learning algorithm to be {\em attribute efficient}, i.e. to have label complexities sublinear in d. In this paper, we provide an efficient algorithm that achieve this goal. Under certain distributional assumptions on the data, our algorithm achieves a label complexity of $$O(t \text{polylog}(d, \frac 1 \varepsilon))$$. In contrast, existing algorithms in this setting either are computationally inefficient, or have label complexity bounds polynomial in $$d$$ or $$\frac 1 \varepsilon$$.

### Marginal Singularity, and the Benefits of Labels in Covariate-Shift

Guillaume Martinet and Samory Kpotufe.
We present new minimax results that concisely capture the relative benefits of source and target labeled data, under \emph{covariate-shift}. Namely, we show that the benefits of target labels are controlled by a \emph{transfer-exponent} $$\gamma$$ that encodes how \emph{singular} $$Q$$ is locally w.r.t. $$P$$, and interestingly allows situations where transfer did not seem possible under previous insights. In fact, our new minimax analysis -- in terms of $$\gamma$$ -- reveals a \emph{continuum of regimes} ranging from situations where target labels have little benefit, to regimes where target labels dramatically improve classification. We then show that a recently proposed semi-supervised procedure can be extended to adapt to unknown $$\gamma$$, and therefore requests target labels only when beneficial, while achieving minimax transfer rates.

### Learning Single Index Models in Gaussian Space

Rishabh Dudeja and Daniel Hsu.
We consider regression problems where the response is a smooth but non-linear function of a $$k$$-dimensional projection of $$p$$ normally-distributed covariates, contaminated with additive Gaussian noise. The goal is to recover the range of the $$k$$-dimensional projection, i.e., the index space. This model is called the multi-index model, and the $$k=1$$ case is called the single-index model. For the single-index model, we characterize the population landscape of a natural semi-parametric maximum likelihood objective in terms of the link function and prove that it has no spurious local minima. We also propose and analyze an efficient iterative procedure that recovers the index space up to error $$\varepsilon$$ using a sample size $$\tilde{O}(p^{O(R^2/\mu)} + p/\varepsilon^2)$$, where $$R$$ and $$\mu$$, respectively, parameterize the smoothness of the link function and the signal strength. When a multi-index model is incorrectly specified as a single-index model, we prove that essentially the same procedure, with sample size $$\tilde{O}(p^{O(kR^2/\mu)} + p/\varepsilon^2)$$, returns a vector that is $$\varepsilon$$-close to being completely in the index space.

### Hidden Integrality and Exponential Rates of SDP Relaxations for Sub-Gaussian Mixture Models

Yingjie Fei and Yudong Chen.
We consider the problem of finding discrete clustering structures under Sub-Gaussian Mixture Models. We establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded by the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models. A special case of this result shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our result establishes sufficient conditions for the SDP to correctly recover the cluster memberships of at least $$(1-\delta)$$ fraction of the points for any $$\delta\in(0,1)$$. Error bounds for estimating cluster centers can also be derived directly from our results.

### Counting Motifs with Graph Sampling

Jason Klusowski and Yihong Wu.
Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are \emph{subgraph sampling}, where we sample each vertex independently with probability $$p$$ and observe the subgraph induced by the sampled vertices, and \emph{neighborhood sampling}, where we additionally observe the edges between the sampled vertices and their neighbors. Despite their ubiquity, theoretical understanding of these sampling models in the context of statistical estimation has been lacking. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $$G$$ with maximal degree $$d$$, for any connected motif $$h$$ on $$k$$ vertices, to estimate the number of copies of $$h$$ in $$G$$, denoted by $$s=s(h,G)$$, with a multiplicative error of $$\varepsilon$$,
-For subgraph sampling, the optimal sampling ratio $$p$$ is $$\Theta_{k}\big(\max\{(s\varepsilon^2)^{-\frac{1}{k}}, \; \frac{d^{k-1}}{s\varepsilon^{2}}\}\big)$$, which only depends on the size of the motif but \emph{not} its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs.
- For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieve the sampling ratio $$O_{k}\big(\min\{( \frac{d}{s\varepsilon^2})^{\frac{1}{k-1}}, \; \sqrt{\frac{d^{k-2}}{s\varepsilon^2}}\}\big)$$, which again only depends on the size of $$h$$. This is shown to be optimal for all motifs up to $$4$$ vertices and cliques of all sizes.
The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These theoretical results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.

### Approximate Nearest Neighbors in Limited Space

Piotr Indyk and Tal Wagner.
We consider the $$(1+\varepsilon)$$-approximate nearest neighbor search problem: given a set $$X$$ of $$n$$ points in a $$d$$-dimensional space, build a data structure that, given any query point $$y$$, finds a point $$x \in X$$ whose distance to $$y$$ is at most $$(1+\varepsilon) \min_{x \in X} \|x-y\|$$ for an accuracy parameter $$\varepsilon \in (0,1)$$. Our main result is a data structure that occupies only $$O(\varepsilon^{-2} n \log(n) \log(1/\varepsilon))$$ bits of space, assuming all point coordinates are integers in the range $$\{-n^{O(1)} \ldots n^{O(1)}\}$$, i.e., the coordinates have $$O(\log n)$$ bits of precision. This improves over the best previously known space bound of $$O(\varepsilon^{-2} n \log(n)^2)$$, obtained via the randomized dimensionality reduction method of \cite{johnson1984extensions}. We also consider the more general problem of estimating all distances from a collection of query points to all data points $$X$$, and provide almost tight upper and lower bounds for the space complexity of this problem.

### Breaking the $$1/\sqrt{n}$$ Barrier: Faster Rates for Permutation-based Models in Polynomial Time

Cheng Mao, Ashwin Pananjady and Martin Wainwright.
Many applications, including rank aggregation and crowd-labeling, can be modeled in terms of a bivariate isotonic matrix with unknown permutations acting on its rows and columns. We consider the problem of estimating such a matrix based on noisy observations of a subset of its entries, and design and analyze polynomial-time algorithms that improve upon the state of the art. In particular, our results imply that any such $$n \times n$$ matrix can be estimated efficiently in the normalized Frobenius norm at rate $$\widetilde O(n^{-3/4})$$, thus narrowing the gap between $$\widetilde O(n^{-1})$$ and $$\widetilde O(n^{-1/2})$$, which were hitherto the rates of the most statistically and computationally efficient methods, respectively.

### Reductions in Fair Learning and Optimization

Daniel Alabi, Nicole Immorlica and Adam Kalai.
We show that minimizing arbitrary continuous functions of linear losses over a constant number of (possibly overlapping) groups is no harder than linear optimization. Our reduction is efficient (polynomial-time) for a constant number of groups. Of particular interest is its application to learning problems such as fair binary classification with an arbitrary composite loss function that can depend continuously on the error rates in different overlapping groups (or group false positive and false negative rates). We also illustrate applications to fair regression and combinatorial optimization problems such as fair facility location. There are two key ideas: first is to repeatedly use the linear optimization oracle to simulate a projection oracle in projected gradient descent, and second is a gridding approach used to contend with the multiple local (non-global) minima that arbitrary Lipschitz-continuous functions may possess.

### The Many Faces of Exponential Weights in Online Learning

Dirk van der Hoeven, Tim van Erven and Wojciech Kotłowski.
A standard introduction to online learning might place Online Gradient Descent at its center and then proceed to develop generalizations and extensions like Online Mirror Descent and second-order methods. Here we explore the alternative approach of putting exponential weights (EW) first. We show that many standard methods and their regret bounds then follow as a special case by plugging in suitable surrogate losses and playing the EW posterior mean. For instance, we easily recover Online Gradient Descent by using EW with a Gaussian prior on linearized losses, and, more generally, all instances of Online Mirror Descent based on regular Bregman divergences also correspond to EW with a prior that depends on the mirror map. Furthermore, appropriate quadratic surrogate losses naturally give rise to Online Gradient Descent for strongly convex losses and to Online Newton Step. We further interpret several recent adaptive methods (iProd, Squint, and a variation of Coin Betting for experts) as a series of closely related reductions to exp-concave surrogate losses that are then handled by Exponential Weights. Finally, a benefit of our EW interpretation is that it opens up the possibility of sampling from the EW posterior distribution instead of playing the mean. As already observed by Bubeck and Eldan (2015), this recovers the best-known rate in Online Bandit Linear Optimization.

### Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure

Matthew Brennan, Guy Bresler and Wasim Huleihel.
Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A recent line of work initiated in \cite{berthet2013complexity} has aimed to explain these gaps through reductions to conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include:
1. Planted Independent Set: We show tight lower bounds for recovering a planted independent set of size $$k$$ in a sparse Erd\H{o}s-R\'{e}nyi graph of size $$n$$ with edge density $$\tilde{\Theta}(n^{-\alpha})$$.
2. Planted Dense Subgraph: If $$p > q$$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $$q = \tilde{\Theta}(n^{-\alpha})$$ and $$p - q = \tilde{\Theta}(n^{-\gamma})$$ where $$\gamma \ge \alpha$$, matching the lower bounds predicted in \cite{chen2016statistical}. Our lower bounds apply to a deterministic community size $$k$$, resolving a question raised in \cite{hajek2015computational}.
3. Biclustering: We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $$k \times k$$ submatrix.
4. Sparse Rank-1 Submatrix: We show that sparse rank-1 submatrix detection is often harder than biclustering, and obtain two different tight lower bounds for these problems with different reductions from planted clique.
5. Sparse PCA: We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $$k \gg \sqrt{n}$$, when the spectral algorithm is optimal over the natural SDP. This yields the first tight characterization of a computational barrier for sparse PCA over an entire parameter regime. We also give an alternate reduction recovering the lower bounds of \cite{berthet2013complexity} and \cite{gao2017sparse} in the canonical simple hypothesis testing variant of sparse PCA, the spiked covariance model.
6. New Models: We demonstrate a subtlety in the complexity of sparse PCA and planted dense subgraph by introducing two variants of these problems, biased sparse PCA and planted stochastic block model, and showing that they have different hard regimes than the originals.
Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case for reductions between deterministic problems. Our main technical contribution is to introduce a set of cloning techniques that maintain the level of signal in an instance of a problem while increasing the size of its planted structure. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we introduce.

### Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

Andre Wibisono.
We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backward method is implementable in general, except for Gaussian data. We propose the symmetrized Langevin algorithm, which should have a smaller bias than ULA, at the price of implementing a proximal gradient step in space. We show SLA is in fact consistent for Gaussian target measure, whereas ULA is not. We also illustrate various algorithms explicitly for Gaussian target measure.

### Online Learning: Sufficient Statistics and the Burkholder Method

Dylan Foster, Alexander Rakhlin and Karthik Sridharan.
We uncover a fairly general principle: If regret can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that only depends on these sufficient statistics, not the entire data sequence. This function can be used algorithmically to achieve the regret bound, and the online strategy is only required to keep these sufficient statistics in memory. The characterization is achieved by bringing the full power of the Burkholder Method --- originally developed for certifying probabilistic martingale inequalities --- to bear on the online learning setting. To demonstrate the power of the proposed method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We present a linear-time/space prediction strategy for parameter free supervised learning with linear classes and general smooth norms.

### Minimax Bounds on Stochastic Batched Convex Optimization

John Duchi, Feng Ruan and Chulhee Yun.
We study the stochastic batched convex optimization problem, in which we use many \emph{parallel} observations to optimize a convex function given limited rounds of interaction. In each of $$M$$ rounds, an algorithm may query for information at n points, and after issuing all $$n$$ queries, it receives unbiased noisy function and/or (sub)gradient evaluations at the $$n$$ points. After $$M$$ such rounds, the algorithm must output an estimator. We provide lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for the collections of Lipschitz convex and smooth strongly convex functions. The rates we provide exhibit two interesting phenomena: (1) in terms of the batch size $$n$$, the rate of convergence of batched algorithms (nearly) achieves the conventional fully sequential rate once $$M = O(d \log \log n)$$, where d is the dimension of the domain, while (2) the rate may exponentially degrade as the dimension d increases, in distinction from fully sequential settings.

### Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints

Yanjun Han, Ayfer Ozgur and Tsachy Weissman.
We consider parameter estimation in distributed networks, where each node in the network observes an independent sample from an underlying distribution and has $$k$$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter of the distribution. We develop lower bounds for the minimax risk of estimating the underlying parameter under squared $$\ell_2$$ loss for a large class of distributions. Our results show that under mild regularity conditions, the communication constraint reduces the effective sample size by a factor of $$d$$ when $$k$$ is small, where $$d$$ is the dimension of the estimated parameter. Furthermore, this penalty reduces at most exponentially with increasing $$k$$, which is the case for some models, e.g., estimating high-dimensional distributions. For other models however, we show that the sample size reduction is re-mediated only linearly with increasing $$k$$ when some sub-Gaussian structure is available. We apply our results to the distributed setting with product Bernoulli model, multinomial model, and both dense/sparse Gaussian location models which recover or strengthen existing results. Our approach significantly deviates from existing approaches for developing information-theoretic lower bounds for communication-efficient estimation. We circumvent the need for strong data processing inequalities used in prior work and develop a geometric approach which builds on a new representation of the communication constraint. This approach allows us to strengthen and generalize existing results with tight logarithmic factors, as well as simpler and more transparent proofs.

### Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance

Yanjun Han, Jiantao Jiao and Tsachy Weissman.
We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is universal" in estimating symmetric functionals of discrete distributions. Instead of doing best polynomial approximation explicitly as in existing literature of functional estimation, the plug-in approach conducts polynomial approximation implicitly and attains the optimal sample complexity for the entropy, power sum and support size functionals.

### Iterate averaging as regularization for stochastic gradient descent

Gergely Neu and Lorenzo Rosasco.
We propose and analyze a variant of the classic Polyak--Ruppert averaging scheme, broadly used in stochastic gradient methods. Rather than a uniform average of the iterates, we consider a weighted average, with weights decaying in a geometric fashion. In the context of linear least squares regression, we show that this averaging scheme has a the same regularizing effect, and indeed is asymptotically equivalent, to ridge regression. In particular, we derive-finite sample bounds for the proposed approach that match the best known results for regularized stochastic gradient methods.

### Smoothed Analysis for Efficient Semi-definite Programming

Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain and Praneeth Netrapalli.
Semi-definite programs are important in learning and combinatorial optimization with numerous applications. In pursuit of low rank solutions and low complexity algorithms, we consider the Burer-Monteiro factorization approach for solving SDPs. Existing results for such factorization based technique either apply to a very restricted class of problems \citep{boumal2016non} or do not provide polynomial time computational complexity \citep{journee2010low}. In this work, we show that Burer-Monteiro factorization can be applied to \emph{any} canonical form SDP with a feasible solution, and show that we can find approximate global optima for any rank-constrained SDP as long as the number of constraints scale sub-quadratically with the desired rank of the optimal solution. Our result leverages a simple penalty function based formulation of SDP along with smooth analysis of perturbed SDP to avoid worst-case cost matrices. Applications to two important problems, MAXCUT and matrix-completion, demonstrate generality of our technique and how we can ensure polynomial time algorithm despite producing optimal low-rank solutions.

### Learning from Unreliable Datasets

Themis Gouleakis, Christos Tzamos and Manolis Zampetakis.
A wide range of learning tasks require human input in labeling massive data. The collected data though are usually low quality and contain inaccuracies and errors. As a result, modern science and business face the problem of learning from unreliable data sets. In this work, we provide a generic approach that is based on verification of only few records of the data set to guarantee high quality learning outcomes for various optimization objectives. Our method, identifies small sets of critical records and verifies their validity. We show that many problems only need $$\text{poly}(1/\varepsilon)$$ verifications, to ensure that the output of the computation is at most a factor of $$(1 \pm \varepsilon)$$ away from the truth. For any given instance, we provide an instance optimal solution that verifies the minimum possible number of records to approximately certify correctness. Then using this instance optimal formulation of the problem we prove our main result: ”every function that satisfies some Lipschitz continuity condition can be certified with a small number of verifications”. We show that the required Lipschitz continuity condition is satisfied even by some NP-complete problems, which illustrates the generality and importance of this theorem. In case this certification step fails, an invalid record will be identified. Removing these records and repeating until success, guarantees that the result will be accurate and will depend only on the verified records. Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the invalid records, since any invalid record that affects the output will be detected and verified.

## OPEN PROBLEMS

### The Dependence of Sample Complexity Lower Bounds on Planning Horizon

Nan Jiang and Alekh Agarwal
In reinforcement learning (RL), problems with long planning horizons are perceived as very challenging. The recent advances in PAC RL, however, show that the sample complexity of RL does not depend on planning horizon except at a superficial level. How can we explain such a difference? In this document, we argue that the technical assumptions in the upper bounds might have hidden away the challenges of long horizons. We therefore ask the question of whether one can prove a lower bound with horizon dependence when such assumptions are removed, and provide a few observations on the desired characteristics of the lower bound construction.

### Improper Learning of Mixtures of Gaussians

We ask whether there exists an efficient unsupervised learning algorithm for mixture of Gaussians in the over-complete case (number of mixtures is larger than the dimension). The notion of learning is taken to be worst-case compression-based, to allow for improper learning.

## Call for papers

The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:
-Design and analysis of learning algorithms
-Statistical and computational complexity of learning
-Optimization methods for learning
-Unsupervised, semi-supervised, online and active learning
-Interactions with other mathematical fields
-Interactions with statistical physics
-Artificial neural networks, including deep learning
-High-dimensional and non-parametric statistics
-Learning with algebraic or combinatorial structure
-Geometric and topological data analysis
-Bayesian methods in learning
-Planning and control, including reinforcement learning
-Learning with system constraints: e.g. privacy, memory or communication budget
-Learning from complex data: e.g., networks, time series, etc.
-Learning in other settings: e.g. social, economic, and game-theoretic

Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.
All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.

### Paper Awards:

COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.

### Dual Submissions:

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.

### Formatting:

Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.

### Rebuttal Phase:

As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.

### Open Problems Session:

We also invite submission of open problems (see below).

### Important Dates:

-Paper submission deadline: February 16, 2018, 11:00 PM EST
-Author feedback: April 9-15, 2018
-Conference: July 6-9, 2018 (welcome reception on the 5th)

### Submission:

Papers should be submitted through EasyChair at https://easychair.org/conferences/?conf=colt2018

## Submissions Instructions

### Formatting

Submissions are limited to 12 single-column JMLR-formatted pages, plus additional pages for references and appendices.
Please use the following: LaTeX style files and template.

### Dual submission policy

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.

### Anonymization

COLT reviewing is not double-blind, and program committee members will have access to author identities. However, in order to aid in minimizing the effect of initial impressions based on author identity, and in order to allow program committee members to seek double-blind reviews from sub-referees, submissions should NOT include author names or other identifying information. Use the [anon] option, as in the LaTeX template, to suppress author names from appearing in the submission. To the extent possible, avoid including directly identifying information. You should still include all relevant references, discussion, and scientific content, even if this might provide significant hints as to the author identity. You should generally refer to your own prior work in third person. Avoid including acknowledgments in the submission (these can be added later). It is expressly allowed to make available online a non-anonymized version of the submission or any other material that might reveal the author identities, e.g. on the author’s web pages, on arXiv, or on any other site.