Associate Professor
MIT, Mathematics
Philippe Rigollet works at the intersection of statistics, machine learning, and optimization, focusing primarily on the design and analysis of statistical methods for high-dimensional problems. His recent research focuses on the statistical limitations of learning under computational constraints.
He has received an NSF CAREER award, a Howard B. Wentz Jr. Junior Faculty Award at Princeton and the best paper award at COLT 2013.
He received a PhD in mathematical statistics in under the supervision of Alexandre Tsybakov. He has held positions as a visiting assistant professor at the Georgia Institute of Technology and then as an assistant professor at Princeton University.
He is currently associate editor for Bernoulli, Electronic Journal of Statistics, Mathematical Statistics and Learning, the SIAM Journal on Mathematics of Data Science and serves as Program Committe co-Chair (with Sebastien Bubeck) for COLT 2018.
MIT, Mathematics
MIT, Mathematics
Princeton University, ORFE
Georgia Tech, Mathematics
Ph.D. in Mathematics
University of Paris VI
B. Sc. in Applied Mathematics
University of Paris VI
B. Sc. in Statistics
Institute for Statistics (ISUP)
The information era has witnessed an explosion in the collection of data in a wide range of applications such as biology, sociology, marketing and finance. Along with this expansion come new theoretical and algorithmic challenges. Statistical learning theory is a field that precisely provides a theoretical framework for the design and analysis of predictive algorithms.
Rigollet's scientific contributions are broad and span many problems from statistical learning theory. They are articulated around one common theme: optimality. This question is a driving force behind the design of new and efficient statistical procedures throughout his research. Conversely, the empirical behavior of certain methods has led me to consider refined notions of optimality to discriminate between good and poor performance. This back-and-forth between theory and methodology, together with statistical intuition is paramount to advance on both fronts.
If you are a prospective Ph.D. student interested in working with me, please apply through MIT Department of Mathematics and indicate Statistics as your primary field interest.
If you are a prospective Postdoc interested in working with me, there are several positions you may apply to
Lucy Xia (co-advised by J. Fan)
Current position: Stein Fellow, Department of Statistics, Stanford
Quentin Berthet
Current position: Lecturer, University of Cambridge
Xin Tong (co-advised by J. Fan)
Current position: Assistant Professor, Marshall School of Business, Unversity of Southern California
Afonso Bandeira
Current position: Assistant Professor, Courant Institute and Center for Data Science, NYU
Irene Waldspurger
Current position: CNRS Researcher, Universite Paris Dauphine
In our papers on sparse PCA, we have redefined the classical notion of optimality to incorporate the computational aspects of statistical methods and showed the existence of a gap between the old and the new notion. This is the first result of its kind and it opens the door for a much better understanding of many more statistical problems, especially in high dimensions. In particular, it allows a better discrimination between statistical procedures that are currently sorted out using ad-hoc and often unrealistic assumptions.
Practically, our approach relies on the use of polynomial time reductions. While such ideas have been developed by theoretical computer scientists for decades, virtually all the current literature on reductions is about worst case complexity, while statistical problems fall naturally in the framework of average case complexity. Unfortunately, the latter is still poorly understood, as current reduction techniques tend to break the fragile distributional assumptions attached to a computational problem. Building on insight from statistics, we have introduced a notion of robustness that allows us to bypass this limitation and, for the first time, to present an average case reduction between two problems that are not artificial.
Looking forward, we aim to (i) push this analysis to other high-dimensional statstical problems and (ii) develop a toolbox to derive computational lower bounds in a more systematic fashion.
Consider the following basic question: given two populations (arms), at each time t=1,...,T, sample from only one of the populations and receive a random reward dictated by the properties of the sampled population. The objective is to devise a policy that maximizes the expected cumulative reward over the T rounds. While the terminology ``arm" comes from a well known analogy with slot machines the original motivation for this questions comes from clinical trials where each arm corresponds to a treatment and rewards measure the efficacy of this treatment on a patient. When devising a strategy to solve this problem, one faces the so-called exploration versus exploitation dilemma. Indeed, there is a clear tradeoff between discovering which treatment is the most effective (exploration) and administering the best treatment to as many patients as possible (exploitation).
In practice, it may not be possible to process past rewards after each patient to decide which treatment to administer next. Rather, clinical trials are performed in a small number of phases, typically two: in the first phase treatments are allocated randomly during a pilot study and in the second phase, treatments are re-allocated according to the results of the pilot study. In this two-phase approach, the pilot study focuses on exploration while he second one focuses on exploitation. This contrasts with the basic problem described above that consists in T phases, one per patient. This begs the following questions: how does one reunite the two frameworks? Do extra phases significantly improve upon the basic two-phase approach?
These questiosn extend beyond the framework of bandit problems to the case of inference, in particular for multiple hypothesis testing where the goal is to identify significant features.
A canonical example fo this research project is the multi-reference alignment problem where one observes noisy shifted versions of an unknown signal. The goal is to estimate the signal up to a gloabl shift. The algebraic stucture here is rather simple: it is the group of periodic shifts of vectors of a given length.
This problem has receive attention from the computational point of view where connections to the Unique Games Conjecture were established and exploited. However, the the statistical understanding of this problem is rather limited and we study optimal rates of estimations in this context.
Isotonic regression is a standard problem in shape-constrained estimation where the goal is to estimate an unknown nondecreasing regression function \(f\) from independent pairs \((x_i, y_i)\) where \(\mathbb{E}[y_i]=f(x_i), i=1, \ldots n\). While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given only the unordered sets \(\{x_1, \ldots, x_n\}\) and \(\{y_1, \ldots, y_n\}\). In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on \(y_i\) and to give an efficient algorithm achieving optimal rates. Both upper and lower bounds employ moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution.
We propose a new method to estimate Wasserstein distances and optimal transport plans between two probability distributions from samples in high-dimension. Unlike plug-in rules that simply replace the true distributions by their empirical counterpart, we propose a new regularization method that leverages the clustered nature of data that is often encountered in practice. Our theoretical results indicate that this method overcomes the curse of dimensionality. This theoretical guarantee is supported by numerical experiments on high-dimensional data for various tasks, including domain adaptation in single-cell RNA sequencing data.
Independent component analysis (ICA) is a cornerstone of modern data analysis. Its goal is to recover a latent random vector S with independent components from samples of X=AS where A is an unknown mixing matrix. Critically, all existing methods for ICA rely on and exploit strongly the assumption that S is not Gaussian as otherwise A becomes unidentifiable. In this paper, we show that in fact one can handle the case of Gaussian components by imposing structure on the matrix A. Specifically, we assume that A is sparse and generic in the sense that it is generated from a sparse Bernoulli-Gaussian ensemble. Under this condition, we give an efficient algorithm to recover the columns of A given only the covariance matrix of X as input even when S has several Gaussian components.
Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time points over 16 days during reprogramming of fibroblasts to iPSCs. We construct a high-resolution map of reprogramming that rediscovers known features; uncovers new alternative cell fates including neural- and placental-like cells; predicts the origin and fate of any cell class; highlights senescent-like cells that may support reprogramming through paracrine signaling; and implicates regulatory models in particular trajectories. Of these findings, we highlight Obox6, which we experimentally show enhances reprogramming efficiency. Our approach provides a general framework for investigating cellular differentiation.
The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Among them, multi-reference alignment (MRA) is a simple model that captures fundamental aspects of the statistical and algorithmic challenges arising from this new paradigm. In this model, an unknown signal is subject to two types of corruption: a latent cyclic shift and the more traditional additive white noise. The goal is to recover the signal at a certain precision from independent samples. While at high signal-to-noise ratio (SNR), the number of observations needed to recover a generic signal is proportional to 1/SNR, we show that it rises to 1/SNR^3 in the more realistic low SNR regime. We propose an algorithm that achieves this optimal dependence on the SNR. Furthermore, we extend our results to cover a heterogeneous MRA model where the samples come from a mixture of signals, as is often the case in applications such as Cryo-Electron Microscopy, where molecules may have different conformations. We provide the first known procedure that provably achieves signal recovery in the low SNR regime for heterogeneous MRA.
This paper describes optimal rates of adaptive estimation of a vector in the multi-reference alignment model, a problem with important applications in fields such as signal processing, image processing, and computer vision, among others. We describe how this model can be viewed as a multivariate Gaussian mixture model under the constraint that the centers belong to the orbit of a group. This enables us to derive matching upper and lower bounds that feature an interesting dependence on the signal-to-noise ratio of the model. Both upper and lower bounds are articulated around a tight local control of Kullback-Leibler divergences that showcases the central role of moment tensors in this problem.
We call a learner super-teachable if a teacher can trim down an iid training set while making the learner learn even better. We provide sharp super-teaching guarantees on two learners: the maximum likelihood estimator for the mean of a Gaussian, and the large margin classifier in 1D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems.
There has been a recent surge of interest in studying permutation-based models for ranking from pairwise comparison data. Despite being structurally richer and more robust than parametric ranking models, permutation-based models are less well understood statistically and generally lack efficient learning algorithms. In this work, we study a prototype of permutation-based ranking models, namely, the noisy sorting model. We establish the optimal rates of learning the model under two sampling procedures. Furthermore, we provide a fast algorithm to achieve near-optimal rates if the observations are sampled independently. Along the way, we discover properties of the symmetric group which are of theoretical interest.
This course is mainly about learning a regression function from a collection of observations. In this chapter, after defining this task formally, we give an overview of the course and the questions around regression. We adopt the statistical learning point of view where the task of prediction prevails. Last updated in June 2017.
Computing optimal transport distances such as the earth mover's distance is a fundamental problem in machine learning, statistics, and computer vision. Despite the recent introduction of several algorithms with good empirical performance, it is unknown whether general optimal transport distances can be approximated in near-linear time. This paper demonstrates that this ambitious goal is in fact achieved by Cuturi's Sinkhorn Distances, and provides guidance towards parameter tuning for this algorithm. This result relies on a new analysis of Sinkhorn iterations that also directly suggests a new algorithm Greenkhorn with the same theoretical guarantees. Numerical simulations illustrate that Greenkhorn significantly outperforms the classical Sinkhorn algorithm in practice.
Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the \emph{cycle sparsity}; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.
eterminantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. The difficulty is that it is a non-concave maximization problem, and such functions are notoriously difficult to understand in high dimensions, despite their importance in modern machine learning. Here we study both the local and global geometry of the expected log-likelihood function. We prove several rates of convergence for the MLE and give a complete characterization of the case where these are parametric. We also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE scales exponentially with the dimension of the problem. Moreover, we exhibit an exponential number of saddle points, and give evidence that these may be the only critical points.
We prove that Kendall's Rank correlation matrix converges to the Marcenko-Pastur law, under the assumption that the observations are i.i.d random vectors X_1, …, X_n with components that are independent and absolutely continuous with respect to the Lebesgue measure. This is the first result on the empirical spectral distribution of a multivariate U-statistic.
We consider the problem associated to recovering the block structure of an Ising model given independent observations on the binary hypercube. This new model, called the Ising blockmodel, is a perturbation of the mean field approximation of the Ising model known as the Curie-Weiss model: the sites are partitioned into two blocks of equal size and the interaction between those of the same block is stronger than across blocks, to account for more order within each block. We study probabilistic, statistical and computational aspects of this model in the high-dimensional case when the number of sites may be much larger than the sample size.
Given a matrix the seriation problem consists in permuting its rows in such way that all its columns have the same shape, for example, they are monotone increasing. We propose a statistical approach to this problem where the matrix of interest is observed with noise and study the corresponding minimax rate of estimation of the matrices. Specifically, when the columns are either unimodal or monotone, we show that the least squares estimator is optimal up to logarithmic factors and adapts to matrices with a certain natural structure. Finally, we propose a computationally efficient estimator in the monotonic case and study its performance both theoretically and experimentally. Our work is at the intersection of shape constrained estimation and recent work that involves permutation learning, such as graph denoising and ranking.
Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.
Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, H\"older smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.
Invited book review for the Journal of the American Statistical Association.
Invited comment on the discussion paper "Hypothesis testing by convex optimization" by Alexander Goldenshluger, Anatoli Juditsky and Arkadi Nemirovski.
High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other \(\ell_r\) norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.
We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results (Dai, Rigollet, Zhang, 2012) indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced \(Q\)-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including \(\ell_q\)-aggregation, \(q \in (0,1)\), with high probability.
We consider a general supervised learning problem with strongly convex and Lipschitz loss and study the problem of model selection aggregation. In particular, given a finite dictionary functions (learners) together with the prior, we generalize the results obtained by Dai, Rigollet and Zhang (2012) for Gaussian regression with squared loss and fixed design to this learning setup. Specifically, we prove that the Q-aggregation procedure outputs an estimator that satisfies optimal oracle inequalities both in expectation and with high probability. Our proof techniques somewhat depart from traditional proofs by making most of the standard arguments on the Laplace transform of the empirical process to be controlled.
In the context of sparse principal component detection, we bring evidence towards the existence of a statistical price to pay for computational efficiency. We measure the performance of a test by the smallest signal strength that it can detect and we propose a computationally efficient method based on semidefinite programming. We also prove that the statistical performance of this test cannot be strictly improved by any computationally efficient method. Our results can be viewed as complexity theoretic lower bounds conditionally on the assumptions that some instances of the planted clique problem cannot be solved in randomized polynomial time.
We study the stochastic multi-armed bandit problem when one knows the value \(\mu^{(\star)}\) of an optimal arm, as a well as a positive lower bound on the smallest positive gap \(\Delta\). We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows \(\Delta\), and bounded regret of order \(1/\Delta\) is not possible if one only knows \(\mu^{(\star)}\).
We perform a finite sample analysis of the detection levels for sparse principal components of a high-dimensional covariance matrix. Our minimax optimal test is based on a sparse eigenvalue statistic. Alas, computing this test is known to be NP-complete in general, and we describe a computationally efficient alternative test using convex relaxations. Our relaxation is also proved to detect sparse principal components at near optimal detection levels, and it performs well on simulated datasets. Moreover, using polynomial time reductions from theoretical computer science, we bring significant evidence that our results cannot be improved, thus revealing an inherent trade off between statistical and computational performance.
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably localized static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition on the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.
Discussion of ``Minimax Estimation of Large Covariance Matrices under L1-Norm" by Tony Cai and Harrison Zhou.
Given a finite family of functions, the goal of model selection is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.
In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss \(\varphi\). Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its \(\varphi\)-type I error is below a pre-specified level and (ii), it has \(\varphi\)-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error.
In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, non necessarily linear, regression with Gaussian noise and study a related question that is to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called \emph{Exponential Screening} that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of non-zero entries in the combination (\(\ell_0\) norm) or in terms of the global weight of the combination (\(\ell_1\) norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type~I error is below a pre-specified level and (ii), it has probability of type ~II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type~I error. This strict enforcement of the constraint has interesting consequences on the control of the type~II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.
Me consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably ``localized'' bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature.
In the context of density level set estimation, we study the convergence of general plug-in methods under two main assumptions on the density for a given level \(\lambda\). More precisely, it is assumed that the density (i) is smooth in a neighborhood of \(\lambda\) and (ii) has \(\gamma\)-exponent at level \(\lambda\). Condition (i) ensures that the density can be estimated at a standard nonparametric rate and condition (ii) is similar to Tsybakov's margin assumption which is stated for the classification framework. Under these assumptions, we derive optimal rates of convergence for plug-in estimators. Explicit convergence rates are given for plug-in estimators based on kernel density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving optimality of the rates in a minimax sense when the density is Holder smooth are also provided.
Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation.
We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.
We study the problem of finding the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lower bounds showing that these procedures attain optimal rates of aggregation. As an example, we consider aggregation of multivariate kernel density estimators with different bandwidths. We show that linear and convex aggregates mimic the kernel oracles in asymptotically exact sense. We prove that, for Pinsker's kernel, the proposed aggregates are sharp asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we provide simulations demonstrating performance of the convex aggregation procedure.
We study the problem of nonparametric estimation of a probability density of unknown smoothness in \(L_2(\mathbb{R})\). Expressing mean integrated squared error (MISE) in the Fourier domain, we show that it is close to mean squared error in the Gaussian sequence model. Then applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality. Two consequences of this oracle inequality are that the proposed estimator is sharp minimax adaptive over a scale of Sobolev classes of densities, and that its MISE is asymptotically smaller than or equal to that of kernel density estimators with any bandwidth provided that the kernel belongs to a large class of functions including many standard kernels.
Short note on aggregation published in Oberwolfach Reports following the Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.
We study the problem of the nonparametric estimation of a probability density in \(L_2(\mathbb{R})\). Expressing the mean integrated squared error in the Fourier domain, we show that it is close to the mean squared error in the Gaussian sequence model. Then, applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality and a kernel oracle inequality. As a consequence, the proposed estimator is sharp minimax adaptive (i.e. up to a constant) on a scale of Sobolev classes of densities.
The goal of this course is to cover in a cohesive and systematic way the tools from high-dimensional probability that have become pervasive to a variety of applications, primarily in data science.
This class offers an in-depth introduction to the theoretical foundations of statistical methods that are useful in many applications. The goal is to understand the role of mathematics in the research and development of efficient statistical methods.
This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.
Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional linear regression, 3. Misspecified linear models, 4. Matrix estimation, 5. Minimax lower bounds
This course assumes introductory knowledge of statistics and probability.
This class offers an in-depth introduction to the theoretical foundations of statistical methods that are useful in many applications. The goal is to understand the role of mathematics in the research and development of efficient statistical methods.
Broadly speaking, Machine Learning refers to the automated identification of patterns in data. As such it has been a fertile ground for new statistical and algorithmic developments. The purpose of this course is to provide a mathematically rigorous introduction to these developments with emphasis on methods and their analysis. The class will be split in three main parts: (1) The statistical theory of machine learning, (2) Algorithms and convexity and (3) Online learning.
The course website is accesible via this link.
This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.
Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional regression, 3. Matrix estimation, 4. Sparse PCA, 5. Minimax lower bounds, 6. Computational limitations.
This course assumes introductory knowledge of statistics, probability and computational complexity.
A first introduction to probability and statistics. This course provides the background to understand and produce rigorous statistical analysis including estimation, confidence intervals, hypothesis testing and regression. Applicability and limitations of these methods are illustrated in the light of modern data sets and manipulation of the statistical software R. Precepts are based on real data analysis.
A graduate-level introduction to statistical theory and methods and some of the most important and commonly-used principles of statistical inference. Covers the statistical theory and methods for point estimation, confidence intervals, and hypothesis testing.
A theoretical introduction to statistical learning problems related to regression and classification. Topics covered include: Principle Component Analysis, nonparametric estimation, sparse regression, and Classification and Statistical learning,
I am/have been giving time to my community by serving in the following qualities.
I would be happy to talk to you if you need my assistance in your research or academic curriculum. Though I have limited time for students, please send me an email with a detailed explanantion of your interests.
2-279 in the Department of Mathematics
E17-467 in the Institute for Data Systems and Society
The best way to reach me is to send me an email at rigo...@math.mit.edu