信息檢索中的聚類分析技術(shù)
The Clustering Analysis Technology for Information Retrieval
-
摘要: 信息檢索/搜索引擎技術(shù)的快速發(fā)展使得信息的查全率有較大提高,而查準(zhǔn)率以及人們獲取信息的效率改善卻不明顯。文本聚類和多文檔關(guān)鍵詞的自動(dòng)生成技術(shù)將有助于解決這一問(wèn)題。其基本思想是對(duì)檢索到的部分文檔進(jìn)行聚類處理,并對(duì)每類文檔自動(dòng)生成關(guān)鍵詞,從而幫助用戶判斷各個(gè)類別的文檔和檢索需求是否相關(guān)。該文提出文檔相關(guān)度和類別相關(guān)度的概念,并利用詞頻信息以及知網(wǎng)(HOWNET)中詞的概念計(jì)算模型計(jì)算類別相關(guān)度,將其作為聚類合并的依據(jù)。信息獲取的仿真實(shí)驗(yàn)表明文檔檢索效率有較大提高。
-
關(guān)鍵詞:
- 文檔聚類; 關(guān)鍵詞抽取; 知網(wǎng); 文檔相關(guān)度
Abstract: The rapid development of Information Retrieval(IR) and search engine improves recall rate greatly, whereas the enhancement on both precision rate and information retrieval efficiency is not clear. The research on document clustering and multi-document keyword extraction will help solve this problem. The basic idea is to cluster part of the documents returned by search engine, and automatically extract some keywords for each cluster. Thus user can judge whether the documents in each cluster are relevant to his need. In this paper the concept of document relevancy and cluster relevancy are proposed, and both word frequency and the concept relevancy model of HOWNET are used to compute cluster relevancy, which is used to guide the merging process of clusters. The experimental results show that the IR efficiency has improved greatly. -
計(jì)量
- 文章訪問(wèn)數(shù): 2169
- HTML全文瀏覽量: 136
- PDF下載量: 810
- 被引次數(shù): 0