Nonlinear embedding methods in modern dimension reduction: Theory and practice
Learning and representing low-dimensional structures from noisy and possibly high-dimensional data is an indispensable component of modern data science. Recently, a special class of nonlinear embedding methods has become particularly influential, most notably, the t-distributed stochastic neighbor embedding (t-SNE) and the uniform manifold approximation and projection (UMAP). Despite their empirical success in many research fields, these algorithms are oftentimes subject to criticisms such as lack of theoretical understanding, unclear interpretations, sensitivity to tuning parameters, etc. In this talk, we start by presenting a novel theoretical framework for understanding and explaining the exceptional performance of t-SNE and other related algorithms for visualizing high-dimensional clustered data. Our theory uncovers the intrinsic mechanism, the large-sample limits, and several fundamental principles behind the algorithms; the theory also has practical implications for applying these algorithms in real-world applications, such as enabling efficient selection of tuning parameters, improving normativity of analytic praxis, and avoiding common interpretive pitfalls. Recognizing the limitations of current nonlinear embedding methods, we then introduce some of our new approaches and ideas that may lead to more accountable and reliable dimension reduction and visualization.