基于快速地標(biāo)采樣的大規(guī)模譜聚類算法
doi: 10.11999/JEIT160260
基金項(xiàng)目:
國(guó)家973計(jì)劃(2012CB315905), 國(guó)家自然科學(xué)基金(61502527, 61379150)
Large Scale Spectral Clustering Based on Fast Landmark Sampling
Funds:
The National 973 Program of China (2012CB315905), The National Natural Science Foundation of China (61502527, 61379150)
-
摘要: 為避免傳統(tǒng)譜聚類算法高復(fù)雜度的應(yīng)用局限,基于地標(biāo)表示的譜聚類算法利用地標(biāo)點(diǎn)與數(shù)據(jù)集各點(diǎn)間的相似度矩陣,有效降低了譜嵌入的計(jì)算復(fù)雜度。在大數(shù)據(jù)集情況下,現(xiàn)有的隨機(jī)抽取地標(biāo)點(diǎn)的方法會(huì)影響聚類結(jié)果的穩(wěn)定性,k均值中心點(diǎn)方法面臨收斂時(shí)間未知、反復(fù)讀取數(shù)據(jù)的問(wèn)題。該文將近似奇異值分解應(yīng)用于基于地標(biāo)點(diǎn)的譜聚類,設(shè)計(jì)了一種快速地標(biāo)點(diǎn)采樣算法。該算法利用由近似奇異向量矩陣行向量的長(zhǎng)度計(jì)算的抽樣概率來(lái)進(jìn)行抽樣,同隨機(jī)抽樣策略相比,保證了聚類結(jié)果的穩(wěn)定性和精度,同k均值中心點(diǎn)策略相比降低了算法復(fù)雜度。同時(shí)從理論上分析了抽樣結(jié)果對(duì)原始數(shù)據(jù)的信息保持性,并對(duì)算法的性能進(jìn)行了實(shí)驗(yàn)驗(yàn)證。
-
關(guān)鍵詞:
- 地標(biāo)點(diǎn)采樣 /
- 大數(shù)據(jù) /
- 譜聚類 /
- 近似奇異值分解
Abstract: The applicability of traditional spectral clustering is limited by its high complexity in large-scale data sets. Through construction of affinity matrix between landmark points and data points, the Landmark-based Spectral Clustering (LSC) algorithm can significantly reduce the computational complexity of spectral embedding. It is vital for clustering results to apply the suitable strategies of the generation of landmark points. While considering big data problems, the existing generation strategies of landmark points face some deficiencies: the unstable results of random sampling, along with the unknown convergence time and the repeatability of data reading in k-means centers method. In this paper, a rapid landmark-sampling spectral clustering algorithm based on the approximate singular value decomposition is designed, which makes the sampling probability of each landmark point decided by the row norm of the approximate singular vector matrix. Compared with LSC algorithm based on random sampling, the clustering result of new algorithm is more stable and accurate; compared with LSC algorithm based on k-means centers, the new algorithm reduces the computational complexity. Moreover, the preservation of information in original data is analyzed for the landmark-sampling results theoretically. At the same time, the performance of new approach is verified by the experiments in some public data sets. -
何清, 李寧, 羅文娟, 等. 大數(shù)據(jù)下的機(jī)器學(xué)習(xí)算法綜述[J]. 模式識(shí)別與人工智能, 2014, 27(4): 327-336. HE Qing, LI Ning, LUO Wenjuan, et al. A survey of machine learning algorithms for big data[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4): 327-336. DING S, JIA H, ZHANG L, et al. Research of semi-supervised spectral clustering algorithm based on pairwise constraints[J]. Neural Computing and Applications, 2014, 24(1): 211-219. doi: 10.1007/s00521-012-1207-8. NG A Y, JORDAN M I, and WEISS Y. On spectral clustering: Analysis and an algorithm[C]. Neural Information Processing Systems: Natural and Synthetic, Vancouver, Canada, 2001: 849-856. FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nystrom method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2): 214-225. doi: 10.1109/TPAMI.2004.1262185. LI M, KWOK J T, and LU B L. Making large-scale Nystrm approximation possible[C]. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 631-638. LI M, BI W, KWORK J T, et al. Large-scale Nystrm kernel matrix approximation using randomized SVD[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(1): 152-164. doi: 10.1109/TNNLS.2014.2359798. 丁世飛, 賈洪杰, 史忠植. 基于自適應(yīng)Nystrm 采樣的大數(shù)據(jù)譜聚類算法[J]. 軟件學(xué)報(bào), 2014, 25(9): 2037-2049. doi: 10.13328/j.cnki.jos.004643. DING Shifei, JIA Hongjie, and SHI Zhongzhi. Spectral clustering algorithm based on adaptive Nystrm sampling for big data analysis[J]. Journal of Software, 2014, 25(9): 2037-2049. doi: 10.13328/j.cnki.jos.004643. YAN D, HUANG L, and JORDAN M I. Fast approximate spectral clustering[C]. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009: 907-916. doi: 10.1145/1557019.1557118. CHEN X and CAI D. Large scale spectral clustering with landmark-based representation[C]. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 2011: 313-318. CAI D and CHEN X. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cybernetics, 2015, 45(8): 1669-1680. doi: 10.1109/TCYB. 2014.2358564. BOUTSIDIS C, ZOUZIAS A, MAHONEY M W, et al. Randomized dimensionality reduction for-means clustering[J]. IEEE Transactions on Information Theory, 2015, 61(2): 1045-1062. doi: 10.1109/TIT.2014.2375327. COHEN M, ELDER S, MUSCO C, et al. Dimensionality reduction for k-means clustering and low rank approximation[C]. Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, Portland, OR, USA, 2015: 163-172. doi: 10.1145/2746539.2746569. KHOA N L D and CHAWLA S. A scalable approach to spectral clustering with SDD solvers[J]. Journal of Intelligent Information Systems, 2015, 44(2): 289-308. doi: 10.1007/ s10844-013-0285-0. FRIEZE A, KANNAN R, and VEMPALA S. Fast Monte-Carlo algorithms for finding low-rank approximations[C]. Proceedings of the 39th Annual Symposium on Foundations of Computer Science, Palo Alto, California, USA, 1998: 370-378. doi: 10.1109/SFCS. 1998.743487. DRINEAS P, MAHONEY M W, and MUTHUKRISHNAN S. Sampling algorithms for l2 regression and applications[C]. Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, Miami, Florida, USA, 2006: 1127-1136. DRINES P, MAHONEY M W, and MUTHUKRISHNAN S. Subspace sampling and relative-error matrix approximation: Column-based methods[C]. 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems and 10th International Workshop on Randomization and Computation, Barcelona, Spain, 2006: 316-326. doi: 10.1007/11830924_30. BOUTSIDIS C, DRINEAS P, and MAGDON-ISMAIL M. Near-optimal column-based matrix reconstruction [J]. SIAM Journal on Computing, 2014, 43(2): 687-717. doi: 10.1137/12086755X. HALKO N, MARTINSSON P G, and TROPP J A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions[J]. SIAM Review, 2011, 53(2): 217-288. doi: 10.1137/090771806. SARLOIS T. Improved approximation algorithms for large matrices via random projections[C]. Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley, California, USA, 2006: 143-152. doi: 10.1109/FOCS.2006.37. CHEN W Y, SONG Y, BAI H, et al. Parallel spectral clustering in distributed systems[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 568-586. doi: 10.1109/TPAMI.2010.88. AFAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 267-279. doi: 10.1109/TETC. 2014.2330519. STREHL A and GHOSH J. Cluster ensemblesA knowledge reuse framework for combining multiple partitions[J]. The Journal of Machine Learning Research, 2003, 3: 583-617. doi: 10.1162/153244303321897735. -
計(jì)量
- 文章訪問(wèn)數(shù): 1837
- HTML全文瀏覽量: 204
- PDF下載量: 548
- 被引次數(shù): 0