Spectral methods in machine learning and new strategies for very large datasets

13 January 2009

journal article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 106 (2), 369-374
https://doi.org/10.1073/pnas.0810600105

Abstract

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.

Keywords

This publication has 9 references indexed in Scilit:

Randomized algorithms for the low-rank approximation of matrices
Proceedings of the National Academy of Sciences of the United States of America, 2007
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps
Proceedings of the National Academy of Sciences of the United States of America, 2005
Spectral grouping using the nystrom method
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2004
Monte Carlo Statistical Methods
Springer Texts in Statistics, 2004
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation
Neural Computation, 2003
Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data
Proceedings of the National Academy of Sciences of the United States of America, 2003
A Global Geometric Framework for Nonlinear Dimensionality Reduction
Science, 2000
Normalized cuts and image segmentation
Ieee Transactions On Pattern Analysis and Machine Intelligence, 2000
An identity for the Schur complement of a matrix
Proceedings of the American Mathematical Society, 1969

Cited by 77 articles