Academic Positions

  • Present 2016

    Associate Professor

    MIT, Mathematics

  • 2016 2015

    Assistant Professor

    MIT, Mathematics

  • 2014 2008

    Assistant Professor

    Princeton University, ORFE

  • 2008 2007

    Visiting Assistant Professor

    Georgia Tech, Mathematics

Education & Training

  • Ph.D. 2006

    Ph.D. in Mathematics

    University of Paris VI

  • B. Sc.2002

    B. Sc. in Applied Mathematics

    University of Paris VI

  • B. Sc..2001

    B. Sc. in Statistics

    Institute for Statistics (ISUP)


  • 2018
    MIT-Skoltech Seed Fund
    Statistical Optimal Transport.
  • 2018
    Chan-Zuckerberg Initiative: Human Cell Atlas
    Reconstruction of cellular trajectories via optimal transport.
  • 2017
    NSF grant TRIPODS-1740751
    TRIPODS: Institute for Foundations of Data Science.
  • 2017
    NSF grant DMS-1712596
    Statistical Estimation with Algebraic Structure.
  • 2017
    Office of Naval Research
    Statistical Learning Theory of Complex Causal Models.
  • 2016
    DARPA -- Fundamentals of Learning
    Foundations of Scalable Statistical Learning.
  • 2016
    NEC Corporation Fund for Research in Computers and Communications
    Awarded by the MIT School of Science to support research on "Average-Case Complexity of Statistical Problems".
  • 2015
    Invited lecturer. Machine Learning Summer School.
    The machine learning summer school (MLSS) series promulgates modern methods of statistical machine learning and inference. Machine learning summer schools present topics which are at the core of modern Machine Learning, from fundamentals to state-of-the-art practice. The speakers are leading experts in their field who talk with enthusiasm about their subjects. More information about MLSS here.
  • 2013
    Best Paper Award. Conference on Learning Theory (COLT)
    Berthet and Rigollet received "best paper award" at the Conference On Learning Theory (COLT) for their paper "Complexity Theoretic Lower Bounds for Sparse Principal Component Detection". In this work, they show that employing computationally efficient statistical methods in the context of sparse principal component analysis leads to an ineluctable deterioration of statistical performance, regardless of the efficient method employed. In particular, their paper draw new connections between statistical learning and computational complexity. An extended version of the paper is available here.
  • 2013
    Howard B. Wentz Jr. Junior Faculty Award
    The Howard B. Wentz, Jr. Junior Faculty Award from the School of Engineering and Applied Science (SEAS) at Princeton University “intended to recognize and assist promising junior faculty members”.
  • 2013
    NSF grant DMS-1317308
    Statistical and Computational Tradeoffs in High Dimensional Learning.
  • 2011
    NSF CAREER Award DMS-1053987
    Large Scale Stochastic Optimization and Statistics.
  • 2010
    250th Anniversary Fund for Innovation in Undergraduate Education
    Competitive award given from the office of the President at Princeton University. The criteria for the award are: innovation and excellence, the positive impact of the course on the undergraduate curriculum, and its effectiveness in demonstrating the faculty's continuing commitment to undergraduate teaching.
  • 2009
    NSF grant DMS-0906424
    Optimal sequential allocation in dynamic environments.
  • 2009
    Dean's commendation list for outstanding teaching.
    Awarded on the basis of outstanding teaching evaluations for the course "Statistical Learning Theory and Nonparametric Estimation".


Nilin Abrahamsen

Ph.D Student


Jan-Christian Huetter

Ph.D Student


Jonathan Weed

Ph.D Student


Cheng Mao

Ph.D Student


Victor-Emmanuel Brunel

Instructor in Statistics


Andrej Risteski

Wiener Postdoc


Elina Robeva

Instructor in Statistics


Goeff Schiebinger



Marco Avella Medina



Join us!


If you are a prospective Ph.D. student interested in working with me, please apply through MIT Department of Mathematics and indicate Statistics as your primary field interest.


If you are a prospective Postdoc interested in working with me, there are several positions you may apply to

  • Norbert Wiener Fellowship via the Statistics and Data Science Center (deadline Feb. 1, 2018)
  • Postdoctoral Fellowship on applications of optimal transport to the analysis of single-cell genomics data. Skills required: mathematical (optimal transport, gradient flows, statistics), computational (algorithms, coding), interest in biology. Please send me an email for more details.


  • Ph.D.2015

    Lucy Xia (co-advised by J. Fan)

    Current position: Stein Fellow, Department of Statistics, Stanford

  • Ph.D.2014

    Quentin Berthet

    Current position: Lecturer, University of Cambridge

  • Ph.D. 2012

    Xin Tong (co-advised by J. Fan)

    Current position: Assistant Professor, Marshall School of Business, Unversity of Southern California

  • Post-Doc 2016

    Afonso Bandeira

    Current position: Assistant Professor, Courant Institute and Center for Data Science, NYU

  • Post-Doc2016

    Irene Waldspurger

    Current position: CNRS Researcher, Universite Paris Dauphine

Current Research Projects

  • image

    Statistical and Computational Tradeoffs in High-dimensional Statistics

    Computational limitations of statistical problems have largely been ignored or simply overcome by ad hoc relaxations techniques. If optimal methods cannot be computed in reasonable time, what is the best possible statistical performance of a computationally efficient procedure?

    In our papers on sparse PCA, we have redefined the classical notion of optimality to incorporate the computational aspects of statistical methods and showed the existence of a gap between the old and the new notion. This is the first result of its kind and it opens the door for a much better understanding of many more statistical problems, especially in high dimensions. In particular, it allows a better discrimination between statistical procedures that are currently sorted out using ad-hoc and often unrealistic assumptions.

    Practically, our approach relies on the use of polynomial time reductions. While such ideas have been developed by theoretical computer scientists for decades, virtually all the current literature on reductions is about worst case complexity, while statistical problems fall naturally in the framework of average case complexity. Unfortunately, the latter is still poorly understood, as current reduction techniques tend to break the fragile distributional assumptions attached to a computational problem. Building on insight from statistics, we have introduced a notion of robustness that allows us to bypass this limitation and, for the first time, to present an average case reduction between two problems that are not artificial.

    Looking forward, we aim to (i) push this analysis to other high-dimensional statstical problems and (ii) develop a toolbox to derive computational lower bounds in a more systematic fashion.

  • image

    Multiphase inference

    Traditional sequential allocation problems such as bandit problems allow the user to collect information after each action. In practice, for example in clinical trial, such a feedaback cannot exist and one proceed in phases or stages. The question is then to find the size of each pahse and analyze the benefit of adding extra phases to the problem.

    Consider the following basic question: given two populations (arms), at each time t=1,...,T, sample from only one of the populations and receive a random reward dictated by the properties of the sampled population. The objective is to devise a policy that maximizes the expected cumulative reward over the T rounds. While the terminology ``arm" comes from a well known analogy with slot machines the original motivation for this questions comes from clinical trials where each arm corresponds to a treatment and rewards measure the efficacy of this treatment on a patient. When devising a strategy to solve this problem, one faces the so-called exploration versus exploitation dilemma. Indeed, there is a clear tradeoff between discovering which treatment is the most effective (exploration) and administering the best treatment to as many patients as possible (exploitation).

    In practice, it may not be possible to process past rewards after each patient to decide which treatment to administer next. Rather, clinical trials are performed in a small number of phases, typically two: in the first phase treatments are allocated randomly during a pilot study and in the second phase, treatments are re-allocated according to the results of the pilot study. In this two-phase approach, the pilot study focuses on exploration while he second one focuses on exploitation. This contrasts with the basic problem described above that consists in T phases, one per patient. This begs the following questions: how does one reunite the two frameworks? Do extra phases significantly improve upon the basic two-phase approach?

    These questiosn extend beyond the framework of bandit problems to the case of inference, in particular for multiple hypothesis testing where the goal is to identify significant features.

  • image

    Algebraic regularization

    A key aspect of high-dimensional statistics, specifically when the number of paramters to estimate is larger than the sample size, consists in assuming a lower dimensinal structure. For example, the sparsity assumption often employes assumes a lower dimensional linear structure. In several examples, the data follows a simple algebraic structure that can be exploited to bypass the curse of dimensionality.

    A canonical example fo this research project is the multi-reference alignment problem where one observes noisy shifted versions of an unknown signal. The goal is to estimate the signal up to a gloabl shift. The algebraic stucture here is rather simple: it is the group of periodic shifts of vectors of a given length.

    This problem has receive attention from the computational point of view where connections to the Unique Games Conjecture were established and exploited. However, the the statistical understanding of this problem is rather limited and we study optimal rates of estimations in this context.

Filter by type:

Sort by year:

Uncoupled isotonic regression via minimum Wasserstein deconvolution

Philippe Rigollet and Jonathan Weed (2018)
Preprint arXiv:1806.10648


Isotonic regression is a standard problem in shape-constrained estimation where the goal is to estimate an unknown nondecreasing regression function \(f\) from independent pairs \((x_i, y_i)\) where \(\mathbb{E}[y_i]=f(x_i), i=1, \ldots n\). While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given only the unordered sets \(\{x_1, \ldots, x_n\}\) and \(\{y_1, \ldots, y_n\}\). In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on \(y_i\) and to give an efficient algorithm achieving optimal rates. Both upper and lower bounds employ moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution.

Statistical Optimal Transport via Geodesic Hubs

Aden Forrow, Jan-Christian Hütter, Mor Nitzan, Geoffrey Schiebinger, Philippe Rigollet and Jonathan Weed (2018)
Preprint arXiv:1806.07348


We propose a new method to estimate Wasserstein distances and optimal transport plans between two probability distributions from samples in high-dimension. Unlike plug-in rules that simply replace the true distributions by their empirical counterpart, we propose a new regularization method that leverages the clustered nature of data that is often encountered in practice. Our theoretical results indicate that this method overcomes the curse of dimensionality. This theoretical guarantee is supported by numerical experiments on high-dimensional data for various tasks, including domain adaptation in single-cell RNA sequencing data.

Sparse Gaussian ICA

Nilin Abrahamsen and Philippe Rigollet (2018)
Preprint arXiv:1804.00408


Independent component analysis (ICA) is a cornerstone of modern data analysis. Its goal is to recover a latent random vector S with independent components from samples of X=AS where A is an unknown mixing matrix. Critically, all existing methods for ICA rely on and exploit strongly the assumption that S is not Gaussian as otherwise A becomes unidentifiable. In this paper, we show that in fact one can handle the case of Gaussian components by imposing structure on the matrix A. Specifically, we assume that A is sparse and generic in the sense that it is generated from a sparse Bernoulli-Gaussian ensemble. Under this condition, we give an efficient algorithm to recover the columns of A given only the covariance matrix of X as input even when S has several Gaussian components.

Reconstruction of developmental landscapes by optimal-transport analysis of single-cell gene expression sheds light on cellular reprogramming.

Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Siyan Liu, Stacie Lin, Peter Berube, Lia Lee, Jenny Chen, Justin Brumbaugh, Philippe Rigollet, Konrad Hochedlinger, Rudolf Jaenisch, Aviv Regev, Eric Lander (2017)
Preprint bioRxiv:191056


Understanding the molecular programs that guide cellular differentiation during development is a major goal of modern biology. Here, we introduce an approach, WADDINGTON-OT, based on the mathematics of optimal transport, for inferring developmental landscapes, probabilistic cellular fates and dynamic trajectories from large-scale single-cell RNA-seq (scRNA-seq) data collected along a time course. We demonstrate the power of WADDINGTON-OT by applying the approach to study 65,781 scRNA-seq profiles collected at 10 time points over 16 days during reprogramming of fibroblasts to iPSCs. We construct a high-resolution map of reprogramming that rediscovers known features; uncovers new alternative cell fates including neural- and placental-like cells; predicts the origin and fate of any cell class; highlights senescent-like cells that may support reprogramming through paracrine signaling; and implicates regulatory models in particular trajectories. Of these findings, we highlight Obox6, which we experimentally show enhances reprogramming efficiency. Our approach provides a general framework for investigating cellular differentiation.

The sample complexity of multi-reference alignment

Amelia Perry, Jonathan Weed, Afonso Bandeira, Philippe Rigollet and Amit Singer (2017)
Preprint arXiv:1707.00943


The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Among them, multi-reference alignment (MRA) is a simple model that captures fundamental aspects of the statistical and algorithmic challenges arising from this new paradigm. In this model, an unknown signal is subject to two types of corruption: a latent cyclic shift and the more traditional additive white noise. The goal is to recover the signal at a certain precision from independent samples. While at high signal-to-noise ratio (SNR), the number of observations needed to recover a generic signal is proportional to 1/SNR, we show that it rises to 1/SNR^3 in the more realistic low SNR regime. We propose an algorithm that achieves this optimal dependence on the SNR. Furthermore, we extend our results to cover a heterogeneous MRA model where the samples come from a mixture of signals, as is often the case in applications such as Cryo-Electron Microscopy, where molecules may have different conformations. We provide the first known procedure that provably achieves signal recovery in the low SNR regime for heterogeneous MRA.

Optimal rates of estimation for multi-reference alignment

Afonso Bandeira, Philippe Rigollet and Jonathan Weed (2017)
Preprint arXiv:1702.08546


This paper describes optimal rates of adaptive estimation of a vector in the multi-reference alignment model, a problem with important applications in fields such as signal processing, image processing, and computer vision, among others. We describe how this model can be viewed as a multivariate Gaussian mixture model under the constraint that the centers belong to the orbit of a group. This enables us to derive matching upper and lower bounds that feature an interesting dependence on the signal-to-noise ratio of the model. Both upper and lower bounds are articulated around a tight local control of Kullback-Leibler divergences that showcases the central role of moment tensors in this problem.

Teacher improves learning by selecting a training subset

Yuzhe Ma, Robert Nowak, Philippe Rigollet , Xuezhou Zhang and Xiaojin Zhu (2018)
Conference Proceedings of AISTATS 2018. PMLR 84:1366-1375


We call a learner super-teachable if a teacher can trim down an iid training set while making the learner learn even better. We provide sharp super-teaching guarantees on two learners: the maximum likelihood estimator for the mean of a Gaussian, and the large margin classifier in 1D. For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. Empirical experiments show that our algorithm is able to find good super-teaching sets for both regression and classification problems.

Minimax rates and efficient algorithms for noisy sorting

Cheng Mao, Jonathan Weed and Philippe Rigollet (2018)
Conference Proceedings of ALT 2018. PMLR 83:821-847


There has been a recent surge of interest in studying permutation-based models for ranking from pairwise comparison data. Despite being structurally richer and more robust than parametric ranking models, permutation-based models are less well understood statistically and generally lack efficient learning algorithms. In this work, we study a prototype of permutation-based ranking models, namely, the noisy sorting model. We establish the optimal rates of learning the model under two sampling procedures. Furthermore, we provide a fast algorithm to achieve near-optimal rates if the observations are sampled independently. Along the way, we discover properties of the symmetric group which are of theoretical interest.

High Dimensional Statistics

Philippe Rigollet and Jan-Christian Hütter (2017)
Lecture Notes


This course is mainly about learning a regression function from a collection of observations. In this chapter, after defining this task formally, we give an overview of the course and the questions around regression. We adopt the statistical learning point of view where the task of prediction prevails. Last updated in June 2017.

Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration

Jason Altschuler, Jonathan Weed and Philippe Rigollet (2017)
Conference Proceedings of NIPS 2017 (to appear) Spotlight


Computing optimal transport distances such as the earth mover's distance is a fundamental problem in machine learning, statistics, and computer vision. Despite the recent introduction of several algorithms with good empirical performance, it is unknown whether general optimal transport distances can be approximated in near-linear time. This paper demonstrates that this ambitious goal is in fact achieved by Cuturi's Sinkhorn Distances, and provides guidance towards parameter tuning for this algorithm. This result relies on a new analysis of Sinkhorn iterations that also directly suggests a new algorithm Greenkhorn with the same theoretical guarantees. Numerical simulations illustrate that Greenkhorn significantly outperforms the classical Sinkhorn algorithm in practice.

Learning determinantal point processes with moments and cycles

John Urschel, Victor-Emmanuel Brunel, Ankur Moitra and Philippe Rigollet (2017)
Conference Proceedings of ICML 2017, PMLR 70:3511-3520.


Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the \emph{cycle sparsity}; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.

Maximum likelihood estimation of determinantal point processes

Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet and John Urschel (2017)
Conference Proceedings of COLT 2017, PMLR 65:343-345.


eterminantal point processes (DPPs) have wide-ranging applications in machine learning, where they are used to enforce the notion of diversity in subset selection problems. Many estimators have been proposed, but surprisingly the basic properties of the maximum likelihood estimator (MLE) have received little attention. The difficulty is that it is a non-concave maximization problem, and such functions are notoriously difficult to understand in high dimensions, despite their importance in modern machine learning. Here we study both the local and global geometry of the expected log-likelihood function. We prove several rates of convergence for the MLE and give a complete characterization of the case where these are parametric. We also exhibit a potential curse of dimensionality where the asymptotic variance of the MLE scales exponentially with the dimension of the problem. Moreover, we exhibit an exponential number of saddle points, and give evidence that these may be the only critical points.

Marcenko-Pastur Law for Kendall's Tau

Afonso Bandeira, Asad Lohdia and Philippe Rigollet (2017)
Journal Electron. Commun. Probab., 22(32), 1--7.


We prove that Kendall's Rank correlation matrix converges to the Marcenko-Pastur law, under the assumption that the observations are i.i.d random vectors X_1, …, X_n with components that are independent and absolutely continuous with respect to the Lebesgue measure. This is the first result on the empirical spectral distribution of a multivariate U-statistic.

Exact recovery in the Ising blockmodel

Quentin Berthet, Philippe Rigollet and Piyush Srivastava (2016)
Journal Ann. Statist. (to appear)


We consider the problem associated to recovering the block structure of an Ising model given independent observations on the binary hypercube. This new model, called the Ising blockmodel, is a perturbation of the mean field approximation of the Ising model known as the Curie-Weiss model: the sites are partitioned into two blocks of equal size and the interaction between those of the same block is stronger than across blocks, to account for more order within each block. We study probabilistic, statistical and computational aspects of this model in the high-dimensional case when the number of sites may be much larger than the sample size.

Optimal rates of Statistical Seriation

Nicolas Flammarion, Cheng Mao and Philippe Rigollet (2016)
Journal Bernoulli (to appear)


Given a matrix the seriation problem consists in permuting its rows in such way that all its columns have the same shape, for example, they are monotone increasing. We propose a statistical approach to this problem where the matrix of interest is observed with noise and study the corresponding minimax rate of estimation of the matrices. Specifically, when the columns are either unimodal or monotone, we show that the least squares estimator is optimal up to logarithmic factors and adapts to matrices with a certain natural structure. Finally, we propose a computationally efficient estimator in the monotonic case and study its performance both theoretically and experimentally. Our work is at the intersection of shape constrained estimation and recent work that involves permutation learning, such as graph denoising and ranking.

Online learning in repeated auctions

Jonathan Weed, Vianney Perchet and Philippe Rigollet (2016)
Conference Proceedings of COLT 2016, PMLR 49:1562-1583.


Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

Optimal rates for total variation denoising

Jan-Christian Hütter and Philippe Rigollet (2016)
Conference Proceedings of COLT 2016, PMLR 49:1115-1146.


Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates of estimation for a large class of image models such as bi-isotonic, H\"older smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two-dimensional grid such as eigenvector delocalization and spectral decay. We also present extensions to more than two dimensions as well as several other graphs.

Batched Bandit Problems

Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg (2016)
Journal Ann. Statist., 44(2), 660-681.


Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.

Review of the book "Introduction to High-Dimensional Statistics" by C. Giraud.

Philippe Rigollet (2015)
Other JASA Book Reviews, 110(512), 1821


Invited book review for the Journal of the American Statistical Association.

Comment on "Hypothesis testing by convex optimization"

Philippe Rigollet (2015)
Other Electron. J. Statist., 9(2), 1723-1726


Invited comment on the discussion paper "Hypothesis testing by convex optimization" by Alexander Goldenshluger, Anatoli Juditsky and Arkadi Nemirovski.

Estimation of Functionals of Sparse Covariance Matrices

Jianqing Fan, Philippe Rigollet, and Weichen Wang (2015)
Journal Ann. Statist., 43(6), 2706-2737


High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other \(\ell_r\) norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.

Batched Bandit Problems

Vianney Perchet, Philippe Rigollet , Sylvain Chassang and Erik Snowberg (2015)
Conference Proceedings of COLT 2015, PMLR 40:1456-1456.


Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.

Aggregation of Affine Estimators

Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang (2014)
Journal Electon. J. Stat., 8, 302-327.


We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results (Dai, Rigollet, Zhang, 2012) indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced \(Q\)-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including \(\ell_q\)-aggregation, \(q \in (0,1)\), with high probability.

Optimal learning with Q-aggregation

Guillaume Lecue and Philippe Rigollet (2014)
Journal Ann. Statist., 42(1), 211-224.


We consider a general supervised learning problem with strongly convex and Lipschitz loss and study the problem of model selection aggregation. In particular, given a finite dictionary functions (learners) together with the prior, we generalize the results obtained by Dai, Rigollet and Zhang (2012) for Gaussian regression with squared loss and fixed design to this learning setup. Specifically, we prove that the Q-aggregation procedure outputs an estimator that satisfies optimal oracle inequalities both in expectation and with high probability. Our proof techniques somewhat depart from traditional proofs by making most of the standard arguments on the Laplace transform of the empirical process to be controlled.

Complexity Theoretic Lower Bounds for Sparse Principal Component Detection

Quentin Berthet and Philippe Rigollet (2013)
ConferenceProceedings of COLT 2013, PMLR 30:1046-1066. Best Paper Award


In the context of sparse principal component detection, we bring evidence towards the existence of a statistical price to pay for computational efficiency. We measure the performance of a test by the smallest signal strength that it can detect and we propose a computationally efficient method based on semidefinite programming. We also prove that the statistical performance of this test cannot be strictly improved by any computationally efficient method. Our results can be viewed as complexity theoretic lower bounds conditionally on the assumptions that some instances of the planted clique problem cannot be solved in randomized polynomial time.

Bounded regret in stochastic multi-armed bandits

Sebastien Bubeck, Vianney Perchet and Philippe Rigollet (2013)
ConferenceProceedings of COLT 2013, PMLR 30:122-134.


We study the stochastic multi-armed bandit problem when one knows the value \(\mu^{(\star)}\) of an optimal arm, as a well as a positive lower bound on the smallest positive gap \(\Delta\). We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows \(\Delta\), and bounded regret of order \(1/\Delta\) is not possible if one only knows \(\mu^{(\star)}\).

Optimal detection of sparse principal components in high dimension

Quentin Berthet and Philippe Rigollet (2013)
JournalAnn. Statist., 41(1), 1780-1815.


We perform a finite sample analysis of the detection levels for sparse principal components of a high-dimensional covariance matrix. Our minimax optimal test is based on a sparse eigenvalue statistic. Alas, computing this test is known to be NP-complete in general, and we describe a computationally efficient alternative test using convex relaxations. Our relaxation is also proved to detect sparse principal components at near optimal detection levels, and it performs well on simulated datasets. Moreover, using polynomial time reductions from theoretical computer science, we bring significant evidence that our results cannot be improved, thus revealing an inherent trade off between statistical and computational performance.

The multi-armed bandit problem with covariates

Vianney Perchet and Philippe Rigollet (2013)
JournalAnn. Statist., 41(2), 693-721.


We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably localized static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.

Sparse estimation by exponential weighting

Philippe Rigollet and Alexandre Tsybakov (2012)
JournalStatist. Sci., 27(4), 558-575


Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition on the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.

Estimation of Covariance Matrices under Sparsity Constraints

Philippe Rigollet and Alexandre Tsybakov (2012)
JournalStatist. Sinica, 22(4), 1319-1378.


Discussion of ``Minimax Estimation of Large Covariance Matrices under L1-Norm" by Tony Cai and Harrison Zhou.

Deviation Optimal Learning using Greedy Q-aggregation

Dong Dai, Philippe Rigollet and Tong Zhang (2012)
JournalAnn. Statist., 40(3), 1878-1905.


Given a finite family of functions, the goal of model selection is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.

Kullback-Leibler aggregation and misspecified generalized linear models

Philippe Rigollet (2012)
JournalAnn. Statist., 40(2), 639-665.


In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense.

Neyman-Pearson classification, convexity and stochastic constraints

Philippe Rigollet and Xin Tong (2011)
JournalJ. Mach. Learn. Res., 12(Oct):2831-2855.


Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss \(\varphi\). Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its \(\varphi\)-type I error is below a pre-specified level and (ii), it has \(\varphi\)-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error.

Exponential Screening and optimal rates of sparse estimation

Philippe Rigollet and Alexandre Tsybakov (2011)
JournalAnn. Statist., 39(2), 731-771.


In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, non necessarily linear, regression with Gaussian noise and study a related question that is to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called \emph{Exponential Screening} that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of non-zero entries in the combination (\(\ell_0\) norm) or in terms of the global weight of the combination (\(\ell_1\) norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.

Neyman-Pearson classification under a strict constraint

Philippe Rigollet and Xin Tong (2011)
ConferenceProceedings of COLT 2011, PMLR 19:595-614.


Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i), its probability of type~I error is below a pre-specified level and (ii), it has probability of type ~II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical objective subject to an empirical constraint. The novelty of the method is that the classifier output by this problem is shown to satisfy the original constraint on type~I error. This strict enforcement of the constraint has interesting consequences on the control of the type~II error and we develop new techniques to handle this situation. Finally, connections with chance constrained optimization are evident and are investigated.

Optimal rates of sparse estimation and universal aggregation

Philippe Rigollet and Alexandre Tsybakov (2010)
OtherOberwolfach reports, 7(1), 924-927 In: Modern Nonparametric Statistics: Going Beyond Asymptotic Minimax, Mar.-Apr. 2010.

Nonparametric Bandits with Covariates

Philippe Rigollet and Assaf Zeevi (2010)
ConferenceProceedings of COLT 2010. Omnipress, 54-66.


Me consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably ``localized'' bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature.

Optimal rates for plug-in estimators of density level sets

Philippe Rigollet and Regis Vert (2009)
JournalBernoulli, 15(4), 1154-1178.


In the context of density level set estimation, we study the convergence of general plug-in methods under two main assumptions on the density for a given level \(\lambda\). More precisely, it is assumed that the density (i) is smooth in a neighborhood of \(\lambda\) and (ii) has \(\gamma\)-exponent at level \(\lambda\). Condition (i) ensures that the density can be estimated at a standard nonparametric rate and condition (ii) is similar to Tsybakov's margin assumption which is stated for the classification framework. Under these assumptions, we derive optimal rates of convergence for plug-in estimators. Explicit convergence rates are given for plug-in estimators based on kernel density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving optimality of the rates in a minimax sense when the density is Holder smooth are also provided.

Learning by mirror averaging

Anatoli Juditsky, Philippe Rigollet and Alexandre Tsybakov (2009)
JournalAnn. Statist., 36(5), 2183-2206.


Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation.

Generalization error bounds in semi-supervised classification under the cluster assumption

Philippe Rigollet (2009)
JournalJ. Mach. Learn. Res., 8(Jul), 1369-1392.


We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.

Linear and convex aggregation of density estimators

Philippe Rigollet and Alexandre Tsybakov (2007)
Journal Math. Methods of Statist., 15(3), 260-280


We study the problem of finding the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lower bounds showing that these procedures attain optimal rates of aggregation. As an example, we consider aggregation of multivariate kernel density estimators with different bandwidths. We show that linear and convex aggregates mimic the kernel oracles in asymptotically exact sense. We prove that, for Pinsker's kernel, the proposed aggregates are sharp asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we provide simulations demonstrating performance of the convex aggregation procedure.

Adaptive density estimation using the blockwise Stein method

Philippe Rigollet (2006)
Journal Bernoulli, 12(2), 351-370


We study the problem of nonparametric estimation of a probability density of unknown smoothness in \(L_2(\mathbb{R})\). Expressing mean integrated squared error (MISE) in the Fourier domain, we show that it is close to mean squared error in the Gaussian sequence model. Then applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality. Two consequences of this oracle inequality are that the proposed estimator is sharp minimax adaptive over a scale of Sobolev classes of densities, and that its MISE is asymptotically smaller than or equal to that of kernel density estimators with any bandwidth provided that the kernel belongs to a large class of functions including many standard kernels.

Mirror averaging, aggregation and model selection

Anatoli Juditsky, Philippe Rigollet and Alexandre Tsybakov (2005)
Other Oberwolfach reports, 2(4), 2688-2691. In: Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.


Short note on aggregation published in Oberwolfach Reports following the Meeting on Statistical and Probabilistic Methods of Model Selection, October 2005.

Inegalites d'oracle pour l'estimation d'une densite de probabilite

Philippe Rigollet (2005)
Journal C. R. Math. Acad. Sci. Paris, 340(1), 59-62


We study the problem of the nonparametric estimation of a probability density in \(L_2(\mathbb{R})\). Expressing the mean integrated squared error in the Fourier domain, we show that it is close to the mean squared error in the Gaussian sequence model. Then, applying a modified version of Stein's blockwise method, we obtain a linear monotone oracle inequality and a kernel oracle inequality. As a consequence, the proposed estimator is sharp minimax adaptive (i.e. up to a constant) on a scale of Sobolev classes of densities.

Currrent Teaching

  • 2018 Spring

    High dimensional Probability & Applications (18.657)

    The goal of this course is to cover in a cohesive and systematic way the tools from high-dimensional probability that have become pervasive to a variety of applications, primarily in data science.

    See the course webpage for more information.

Teaching History (Selected Courses)

  • 2017 Fall

    Fundamentals of Statistics (18.650)

    This class offers an in-depth introduction to the theoretical foundations of statistical methods that are useful in many applications. The goal is to understand the role of mathematics in the research and development of efficient statistical methods.

  • 2017 Spring

    High-dimensional Statistics (18.657)

    This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.

    Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional linear regression, 3. Misspecified linear models, 4. Matrix estimation, 5. Minimax lower bounds
    This course assumes introductory knowledge of statistics and probability.

    Download the updated lecture notes here

  • 2016 Fall

    Statistics for Applications (18.650)

    This class offers an in-depth introduction to the theoretical foundations of statistical methods that are useful in many applications. The goal is to understand the role of mathematics in the research and development of efficient statistical methods.

  • 2015 Fall

    Mathematics of Machine Learning

    Broadly speaking, Machine Learning refers to the automated identification of patterns in data. As such it has been a fertile ground for new statistical and algorithmic developments. The purpose of this course is to provide a mathematically rigorous introduction to these developments with emphasis on methods and their analysis. The class will be split in three main parts: (1) The statistical theory of machine learning, (2) Algorithms and convexity and (3) Online learning.

    The course website is accesible via this link.

  • 2015 Spring

    High-dimensional statistics (MIT)

    This course offers an introduction to the finite sample analysis of high-dimensional statistical methods. The goal is to present various proof techniques for state-of-the-art methods in regression, matrix estimation and principal component analysis (PCA) as well as optimality guarantees. The course ends with research questions that are currently open.

    Topics covered include: 1. Sub-Gaussian concentration, 2. High-dimensional regression, 3. Matrix estimation, 4. Sparse PCA, 5. Minimax lower bounds, 6. Computational limitations.
    This course assumes introductory knowledge of statistics, probability and computational complexity.

    Download the lecture notes here

  • 2013 2008

    Fundamentals of Statistics (Princeton University)

    A first introduction to probability and statistics. This course provides the background to understand and produce rigorous statistical analysis including estimation, confidence intervals, hypothesis testing and regression. Applicability and limitations of these methods are illustrated in the light of modern data sets and manipulation of the statistical software R. Precepts are based on real data analysis.

  • 2014 2009

    Mathematical Statistics (Princeton University)

    A graduate-level introduction to statistical theory and methods and some of the most important and commonly-used principles of statistical inference. Covers the statistical theory and methods for point estimation, confidence intervals, and hypothesis testing.

  • 2014 2009

    Statistical Learning and Nonparametric Estimation (Princeton University)

    A theoretical introduction to statistical learning problems related to regression and classification. Topics covered include: Principle Component Analysis, nonparametric estimation, sparse regression, and Classification and Statistical learning,

Associate Editor

  • SIAM Journal on Mathematics of Data Science (2018-present)
  • Mathematical Statistics and Learning (2018-present)
  • Electronic Journal of Statistics (2016-present)
  • Bernoulli (2013-present)
  • Statistical Inference for Stochastic Processes (2015-16)
  • Journal of Statistical Planning and Inference (2012-15)

Conferences Program Committee

  • Conference on Learning Theory, COLT (2012-present)

At My Office

2-279 in the Department of Mathematics

E17-467 in the Institute for Data Systems and Society


The best way to reach me is to send me an email at

photo courtesy of Peter Vanderwarker