一種基于MapReduce的知識(shí)聚類與統(tǒng)計(jì)機(jī)制
doi: 10.11999/JEIT150247
基金項(xiàng)目:
國(guó)家自然科學(xué)基金(61202004, 61472192),教育部科技發(fā)展中心網(wǎng)絡(luò)時(shí)代的科技論文快速共享專項(xiàng)研究(2013116),江蘇省高校自然科學(xué)研究計(jì)劃(14KJB520014)
Knowledge Clustering and Statistics Based on MapReduce
Funds:
The National Natural Science Foundation of China (61202004, 61472192), The Special Fund for Fast Sharing of Science Paper in Net Era by CSTD (2013116), The Natural Science Fund of Higher Education of Jiangsu Province (14KJB520014)
-
摘要: 網(wǎng)絡(luò)文獻(xiàn)知識(shí)庫(kù)中的海量資源及其分類的粗粒度,導(dǎo)致學(xué)習(xí)者容易在文獻(xiàn)檢索和閱讀過(guò)程出現(xiàn)認(rèn)知迷航和知識(shí)過(guò)載問(wèn)題。該文提出一種基于MapReduce的知識(shí)聚類與統(tǒng)計(jì)機(jī)制:首先,提出基于MapReduce的共現(xiàn)矩陣構(gòu)建算法MR-CoMatrix;其次,將共現(xiàn)矩陣與相似度系數(shù)結(jié)合構(gòu)建相似度矩陣;然后,通過(guò)Z Scores對(duì)相似度矩陣進(jìn)行標(biāo)準(zhǔn)化;最后,使用離差平方和法(Ward,s method)對(duì)相似度矩陣進(jìn)行聚類,生成樹狀的知識(shí)聚類譜系圖;基于聚類結(jié)果,提出基于MapReduce的知識(shí)文獻(xiàn)統(tǒng)計(jì)算法MR-Statistics,對(duì)每個(gè)分類的知識(shí)屬性進(jìn)行統(tǒng)計(jì)。實(shí)驗(yàn)結(jié)果表明:將MR-CoMatrix和MR-Statistics方法應(yīng)用于網(wǎng)絡(luò)文獻(xiàn)知識(shí)庫(kù)進(jìn)行知識(shí)聚類和統(tǒng)計(jì),達(dá)到較理想的聚類精度和計(jì)算效率,實(shí)現(xiàn)了細(xì)粒度知識(shí)聚類和多維統(tǒng)計(jì),同時(shí)減少了時(shí)間開銷。
-
關(guān)鍵詞:
- 數(shù)據(jù)挖掘 /
- 聚類 /
- 知識(shí) /
- 共現(xiàn)矩陣 /
- 統(tǒng)計(jì) /
- MapReduce
Abstract: The large scale and the coarse classification granularity of resources in literature knowledge bases lead to disorientation and overloading when learners retrieve and read papers. This paper proposes a mechanism of knowledge clustering and knowledge statistics based on MapReduce. Firstly, this paper presents a Co-occurrence Matrix building algorithm based on MapReduce (MR-CoMatrix). Secondly, it makes combination of the co-occurrence matrix and similarity coefficient to build the similarity matrix. Thirdly, the similarity matrix is standardized with Z scores. Finally, knowledge clusters are constructed with the Ward,s method. After knowledge clustering, this paper introduces a knowledge Statistics algorithm based on MapReduce (MR-Statistics) to dig the hidden information in each cluster. The experimental results show that the literature knowledge base with MR- CoMatrix and MR-Statistics can realize the accurate and fine clustering, multi-dimension statistics, computational efficiency, and less cost of time.-
Key words:
- Data mining /
- Cluster /
- Knowledge /
- Co-occurrence matrix /
- Statistics /
- MapReduce
-
SERET A, VERBRAKEN T, and BAESENS B. A new knowledge-based constrained clustering approach: theory and application in direct marking[J]. Applied Soft Computing, 2014, 24(3): 316-327. 朱林, 雷景生, 畢忠勤, 等. 一種基于數(shù)據(jù)流的軟子空間聚類算法[J]. 軟件學(xué)報(bào), 2013, 24(11): 2610-2627. ZHU Lin, LEI Jingsheng, BI Zhongqin, et al. Soft subspace clustering algorithm for streaming data[J]. Journal of Software, 2013, 24(11): 2610-2627. ZHU Lin, CHUNG Fulai, and WANG Shitong. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(3): 578-591. 張敏, 于劍. 基于劃分的模糊聚類算法[J]. 軟件學(xué)報(bào), 2004, 15(6): 858-866. ZHANG Min and YU Jian. Fuzzy partitional clustering algorithms[J]. Journal of Software, 2004, 15(6): 858-866. 徐森, 周天, 于化龍, 等. 一種基于矩陣低秩近似的聚類集成算法[J]. 電子學(xué)報(bào), 2013, 41(6): 1219-1223. XU Sen, ZHOU Tian, YU Hualong, et al. Matrix low rank approximation-based cluster ensemble algorithm[J]. Acta Electronica Sinica, 2013, 41(6): 1219-1223. 徐森, 盧志茂, 顧國(guó)昌. 使用譜聚類算法解決文本聚類集成問(wèn)題[J]. 通信學(xué)報(bào), 2010, 31(6): 58-66. XU Sen, LU Zhimao, and GU Guochang. Spectral clustering algorithm for document cluster ensemble problem[J]. Journal on Communications, 2010, 31(6): 58-66. ZHU Wenxing, CHEN Jianli, and LI Weiguo. An augmented Lagrangian method for VLSI global placement[J]. The Journal of Supercomputing, 2014, 69(2): 714-738. ZHOU F, TORRE F D L, and HODGINS J K. Hierarchical aligned cluster analysis for temporal clustering of human motion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(3): 582-596. MASHSHI S, NIU G, MAKOTO Y, et al. Information- maximization clustering based on squared-loss mutual information[J]. Neural Computation, 2014. 26(1): 84-131. YU Feili, CAO Liangliang, FERIS R S, et al. Designing Category-level attributes for discriminative visual recognition [C]. Preceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013: 771-776. 李建元, 周腳根, 關(guān)佶紅. 譜圖聚類算法研究進(jìn)展[J]. 智能系統(tǒng)學(xué)報(bào), 2011, 6(5): 405-414. LI Jianyuan, ZHOU Jiaogen, and GUAN Jihong. A survey of clustering algorithms based on spectra of graphs[J]. CAAI Transactions on Intelligent Systems, 2011, 6(5): 405-414. LU Zhimao and ZHANG Qi. Clustering by data competition [J]. Science China (Information Sciences), 2013, 56(1): 1-13. CHENG Bo, WANG Minhong, A I, et al. Research on e-learning in the workplace 2000-2012: A bibliometric analysis of the literature[J]. Educational Research Review, 2013, 11: 56-72. 孔萬(wàn)增, 孫志海, 楊燦. 基于基本間隙與正交特征向量的自動(dòng)譜聚類[J]. 電子學(xué)報(bào), 2010, 38(8): 1880-1891. KONG Wanzeng, SUN Zhihai, and YANG Can. Automatic spectral clustering based on eigengap and orthogonal eigenvector[J]. Acta Electronica Sinica, 2010, 38(8): 1880-1891. CARPENTIER S, SOLE A D, and KAC V G. Rational matrix pseudodifferential operators[J]. Selecta Mathematica, 2014, 20(2): 403-419. JUGL E, KUHWALD T, and IVERSEN K. Algorithm for construction of (0,1)-matrix codes[J]. Electronics Letters, 1997, 33(3): 226-229. 李建江, 崔健, 王聃, 等. MapReduce并行編程模型研究綜述[J]. 電子學(xué)報(bào), 2011, 39(11): 2635-2642. LI Jianjiang, CUI Jian, WANG Dan, et al. Survey of MapReduce parallel programming model [J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642. FERRERA P, PRADO I D, PALACIOS E, et al. Tuple MapReduce and pangool: an associated implementation[J]. Knowledge and Information Systems, 2014, 41(2): 531-557. 陳吉榮, 樂(lè)嘉錦. SingleMapReduce:?jiǎn)我惠敵鯤DFS文件的MapReduce編程模型[J]. 華南理工大學(xué)學(xué)報(bào), 2014, 42(5): 135-142. CHEN Jirong and LE Jiajin. SingleMapReduce: a MapReduce programming model based on single output file of HDFS[J]. Journal of South China University of Technology, 2014, 42(5): 135-142. 王肇國(guó), 易涵, 張為華. 基于機(jī)器學(xué)習(xí)特性的數(shù)據(jù)中心能耗優(yōu)化算法[J]. 軟件學(xué)報(bào), 2014, 25(7): 1432-1447. WANG Zhaoguo, YI Han, and ZHANG Weihua. Power saving based on characteristics of machine learning in data center[J]. Journal of Software, 2014, 25(7): 1432-1447. 易小華, 劉杰, 葉丹. 面向MapReduce數(shù)據(jù)處理流程開發(fā)方法[J]. 計(jì)算機(jī)科學(xué)與探索, 2011, 5(2): 161-168. YI Xiaohua, LIU Jie, and YE Dan. Development method of MapReduce oriented data flow processing[J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(2): 161-168. ROWBERRY J. Z Scores[M]. New York: Springer Science + Business Media, 2013: 3419-3420. VARIN T and BUREAU R. Clustering files of chemical structures using the Szekely-Rizzo generalization of Wards method[J]. Journal of Molecular Graphics and Modelling, 2009, 28(2): 187-195. LEE A. Minkowski generalizations of Wards method in hierarchical clustering[J]. Journal of Classification, 2014, 31(2): 194-218. MURTAGH F and LEGENDRE P. Wards hierarchical agglomerative clustering method: which algorithms implement Wards criterion?[J]. Journal of Classification, 2014, 31(3): 274-295. -
計(jì)量
- 文章訪問(wèn)數(shù): 1515
- HTML全文瀏覽量: 110
- PDF下載量: 592
- 被引次數(shù): 0