ScreeNOT: Optimal singular value thresholding and principal component selection in correlated noise

Date

Tue June 6th 2023, 4:30pm

Location

Sloan 380C

Speaker

Elad Romanov, Stanford Statistics

Principal Component Analysis (PCA) is a fundamental and ubiquitous tool in statistics and data analysis. The bare-bones idea is this. Given a data set of n points y_1, ..., y_n, form their sample covariance S. Eigenvectors corresponding to large eigenvalues --- namely directions along which the variation within the data set is large --- are usually thought of as "important" or "signal-bearing"; in contrast, weak directions are often interpreted as "noise", and discarded along the proceeding steps of the data analysis pipeline. Principal component (PC) selection is an important methodological question: how large should an eigenvalue be so as to be considered "informative"?

Our main deliverable is ScreeNOT: a novel, mathematically-grounded procedure for PC selection. It is intended as a fully algorithmic replacement for the heuristic and somewhat vaguely-defined procedures that practitioners often use--for example, the popular "scree test".

Towards tackling PC selection systematically, we model the data matrix as a low-rank signal plus noise matrix Y = X + Z; accordingly, PC selection is cast as an estimation problem for the unknown low-rank signal matrix X, with the class of permissible estimators being singular value thresholding rules. We consider a formulation of the problem under the spiked model. This asymptotic setting captures some important qualitative features observed across numerous real-world data sets: most of the singular values of Y are arranged neatly in a "bulk", with very few large outlying singular values exceeding the bulk edge. We propose an adaptive algorithm that, given a data matrix, finds the optimal truncation threshold in a data-driven manner under essentially arbitrary noise conditions: we only require that Z has a compactly supported limiting spectral distribution--which may be a priori unknown. Under the spiked model, our algorithm is shown to have rather strong oracle optimality properties: not only does it attain the best error asymptotically, but it also achieves (w.h.p.) the best error --- compared to all alternative thresholds --- at finite n.

This is joint work with Matan Gavish (Hebrew University of Jerusalem) and David Donoho (Stanford).