论文标题
稀疏主组件分析的新基础
A New Basis for Sparse Principal Component Analysis
论文作者
论文摘要
稀疏主组件分析(PCA)的先前版本假定特征 - 基质($ p \ times k $矩阵)大约稀疏。我们提出了一种假设$ p \ times k $矩阵在$ k \ times k $旋转后变为稀疏的方法。该算法的最简单版本初始化了领先的$ k $主组件。然后,将主要组件用$ k \ times k $正交旋转旋转,以使其大约稀疏。最后,将软阈值应用于旋转的主组件。这种方法与先前的方法有所不同,因为它使用正交旋转来近似稀疏的基础。结果是,稀疏组件不必成为领先的特征向量,而是它们的混合物。这样,我们为稀疏PCA提出了一个新的(旋转)基础。此外,我们的方法避免了“放气”和所需的多个调谐参数。我们稀疏的PCA框架用途广泛。例如,它自然地扩展到对数据矩阵的双向分析,以同时降低行的行和列。我们提供的证据表明,对于相同水平的稀疏性,提出的稀疏PCA方法更稳定,与替代方法相比,可以解释更多的方差。通过三个应用程序 - 图像的稀疏编码,转录组测序数据的分析以及社交网络的大规模聚类,我们证明了稀疏PCA在探索多变量数据中的现代实用性。
Previous versions of sparse principal component analysis (PCA) have presumed that the eigen-basis (a $p \times k$ matrix) is approximately sparse. We propose a method that presumes the $p \times k$ matrix becomes approximately sparse after a $k \times k$ rotation. The simplest version of the algorithm initializes with the leading $k$ principal components. Then, the principal components are rotated with an $k \times k$ orthogonal rotation to make them approximately sparse. Finally, soft-thresholding is applied to the rotated principal components. This approach differs from prior approaches because it uses an orthogonal rotation to approximate a sparse basis. One consequence is that a sparse component need not to be a leading eigenvector, but rather a mixture of them. In this way, we propose a new (rotated) basis for sparse PCA. In addition, our approach avoids "deflation" and multiple tuning parameters required for that. Our sparse PCA framework is versatile; for example, it extends naturally to a two-way analysis of a data matrix for simultaneous dimensionality reduction of rows and columns. We provide evidence showing that for the same level of sparsity, the proposed sparse PCA method is more stable and can explain more variance compared to alternative methods. Through three applications -- sparse coding of images, analysis of transcriptome sequencing data, and large-scale clustering of social networks, we demonstrate the modern usefulness of sparse PCA in exploring multivariate data.