一種基于樹增強(qiáng)樸素貝葉斯的分類器學(xué)習(xí)方法
doi: 10.11999/JEIT180886
-
1.
長(zhǎng)沙理工大學(xué)計(jì)算機(jī)與通信工程學(xué)院 ??長(zhǎng)沙 ??410114
-
2.
長(zhǎng)沙理工大學(xué)綜合交通運(yùn)輸大數(shù)據(jù)智能處理湖南省重點(diǎn)實(shí)驗(yàn)室 長(zhǎng)沙 410114
A Classifier Learning Method Based on Tree-Augmented Na?ve Bayes
-
1.
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China
-
2.
Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, Changsha University of Science and Technology, Changsha 410114, China
-
摘要: 樹增強(qiáng)樸素貝葉斯(TAN)結(jié)構(gòu)強(qiáng)制每個(gè)屬性結(jié)點(diǎn)必須擁有類別父結(jié)點(diǎn)和一個(gè)屬性父結(jié)點(diǎn),也沒有考慮到各個(gè)屬性與類別之間的相關(guān)性差異,導(dǎo)致分類準(zhǔn)確率較差。為了改進(jìn)TAN的分類準(zhǔn)確率,該文首先擴(kuò)展TAN結(jié)構(gòu),允許屬性結(jié)點(diǎn)沒有父結(jié)點(diǎn)或只有一個(gè)屬性父結(jié)點(diǎn);提出一種利用可分解的評(píng)分函數(shù)構(gòu)建樹形貝葉斯分類模型的學(xué)習(xí)方法,采用低階條件獨(dú)立性(CI)測(cè)試初步剔除無效屬性,再結(jié)合改進(jìn)的貝葉斯信息標(biāo)準(zhǔn)(BIC)評(píng)分函數(shù)利用貪婪搜索獲得每個(gè)屬性結(jié)點(diǎn)的父結(jié)點(diǎn),從而建立分類模型。對(duì)比樸素貝葉斯(NB)和TAN,構(gòu)建的分類器在多個(gè)分類指標(biāo)上表現(xiàn)更好,說明該方法具有一定的優(yōu)越性。
-
關(guān)鍵詞:
- 貝葉斯分類器 /
- 樹增強(qiáng)樸素貝葉斯 /
- 評(píng)分函數(shù)
Abstract: The structure of Tree-Augmented Na?ve Bayes (TAN) forces each attribute node to have a class node and a attribute node as parent, which results in poor classification accuracy without considering correlation between each attribute node and the class node. In order to improve the classification accuracy of TAN, firstly, the TAN structure is proposed that allows each attribute node to have no parent or only one attribute node as parent. Then, a learning method of building the tree-like Bayesian classifier using a decomposable scoring function is proposed. Finally, the low-order Conditional Independency (CI) test is applied to eliminating the useless attribute, and then based on improved Bayesian Information Criterion (BIC) function, the classification model with acquired the parent node of each attribute node is established using the greedy algorithm. Through comprehensive experiments, the proposed classifier outperforms Na?ve Bayes (NB) and TAN on multiple classification, and the results prove that this learning method has certain advantages.-
Key words:
- Bayesian classifier /
- Tree-Augmented Na?ve Bayes (TAN) /
- Scoring function
-
表 1 算法描述
輸入:變量集V,樣本數(shù)據(jù)集D 輸出:SETAN結(jié)構(gòu) 步驟1 For each $X_i \in V\;$, $I(C,X_i) = {\rm{Calc\_MI}}(C,X_i)$# 計(jì)算屬性
與類別之間的互信息值步驟2 將每個(gè)互信息值$I(C,X_i)$存入數(shù)組,降序 步驟3 For each $I(C,X_i) > \varepsilon $ in List $S_1 = S_1 \cup \{ X_i\} $ Add path $C - X_i$to graph $E$# 若無連接邊,則添加類
別C到屬性$X_i$的連接邊$S_2 = S_{\rm{2}} \cup \{ X_j\} $,$X_j \in \{ I(C,X_j) < \varepsilon \} $ Add path $X_i - X_j$to graph $E$# 互信息值小于閾值$\varepsilon $的
結(jié)點(diǎn)則被添加到集合$S_2$Remove $I(C,X_i)$from List End for 步驟4 For each $E' \in E$ ${\rm{Score}}(E') = {\rm{Calc\_BIC}}(E')$# 計(jì)算改進(jìn)的BIC評(píng)分 K2-Search of the optimal SETAN Structure # 利用評(píng)
分函數(shù)搜索最優(yōu)結(jié)構(gòu)End for 步驟5 Return $G = (V',E')$with best BIC score 下載: 導(dǎo)出CSV
表 2 網(wǎng)絡(luò)結(jié)構(gòu)學(xué)習(xí)實(shí)驗(yàn)結(jié)果
$\xi $ Asia網(wǎng) Alarm網(wǎng) A D R A D R 0.01 0 3 5 0 20 25 0.001 1 1 7 3 2 45 0.0001 3 0 8 15 3 45 下載: 導(dǎo)出CSV
表 3 實(shí)驗(yàn)數(shù)據(jù)集信息
數(shù)據(jù)集 樣本數(shù)量 類別分布 屬性數(shù)量 分類數(shù)量 缺失值 Balance 625 49/288/288 4 3 否 Car 1728 1210/384/69/65 6 4 否 Connect 67558 44473/16635/6449 42 3 否 Mushroom 8124 4208/3916 22 2 是 Nursery 12960 4320/2/328/4266/4044 8 5 否 SPECT 80 40/40 22 2 否 Cancer 286 85/201 9 2 否 Votes 435 168/267 16 2 是 下載: 導(dǎo)出CSV
表 4 閾值
$\varepsilon $ 信息數(shù)據(jù)集 閾值$\varepsilon $ 平均準(zhǔn)確率 Balance 0.01/0.05/0.10 0.915/0.914/0.910 Connect 0.01/0.05/0.10 0.767/0.764/0.760 SPECT 0.01/0.05/0.10 0.740/0.738/0.733 Cancer 0.01/0.05/0.10 0.710/0.710/0.698 下載: 導(dǎo)出CSV
表 5 NB, TAN和SETAN的各項(xiàng)分類指標(biāo)對(duì)比情況
數(shù)據(jù)集 算法 準(zhǔn)確率 F1值 召回率 精確率 AUC面積 NB 0.914 0.876 0.914 0.842 0.961 Balance TAN 0.861 0.834 0.861 0.836 0.904 SETAN 0.914 0.876 0.914 0.842 0.962 NB 0.857 0.849 0.857 0.854 0.976 Car TAN 0.908 0.911 0.908 0.92 0.983 SETAN 0.946 0.947 0.946 0.947 0.988 NB 0.721 0.681 0.721 0.681 0.807 Connect TAN 0.763 0.722 0.763 0.731 0.864 SETAN 0.764 0.724 0.764 0.735 0.866 NB 0.958 0.958 0.958 0.96 0.998 Mushroom TAN 0.999 1.000 0.999 1.000 1.000 SETAN 1.000 1.000 1.000 1.000 1.000 NB 0.903 0.894 0.903 0.906 0.982 Nursery TAN 0.928 0.92 0.928 0.929 0.991 SETAN 0.937 0.927 0.937 0.937 0.993 NB 0.738 0.735 0.738 0.745 0.802 SPECT TAN 0.713 0.709 0.713 0.724 0.668 SETAN 0.738 0.736 0.738 0.741 0.755 NB 0.734 0.727 0.734 0.723 0.702 Cancer TAN 0.706 0.692 0.706 0.687 0.667 SETAN 0.710 0.700 0.710 0.695 0.624 NB 0.901 0.902 0.901 0.905 0.973 Votes TAN 0.940 0.940 0.940 0.941 0.986 SETAN 0.949 0.950 0.949 0.950 0.985 下載: 導(dǎo)出CSV
-
PEARL J. Probabilistic reasoning in intelligent systems: networks of plausible inference[J]. Computer Science Artificial Intelligence, 1991, 70(2): 1022–1027. doi: 10.2307/407557 WEBB G I, CHEN Shenglei, and N A. Zaidi Scalable learning of Bayesian network classifiers[J]. Journal of Machine Learning Research, 2016, 17(1): 1515–1549. MURALIDHARAN V and SUGUMARAN V. A comparative study of Na?ve Bayes classifier and Bayes net classifier for fault diagnosis of monoblock centrifugal pump using wavelet analysis[J]. Applied Soft Computing, 2012, 12(8): 2023–2029. doi: 10.1016/j.asoc.2012.03.021 Friedman N, Geiger D, and Goldszmidt M. Bayesian network classifiers[J]. Machine Learning, 1997, 29(2-3): 131–163. doi: 10.1023/a:1007465528199 GAN Hongxiao, ZHANG Yang, and SONG Qun. Bayesian belief network for positive unlabeled learning with uncertainty[J]. Pattern Recognition Letters, 2017, 90. doi: 10.1016/j.patrec.2017.03.007 JIANG Liangxiao, CAI Zhihua, WANG Dianhong, et al. Improving Tree augmented Naive Bayes for class probability estimation[J]. Knowledge-Based Systems, 2012, 26: 239–245. doi: 10.1016/j.knosys.2011.08.010 王中鋒, 王志海. TAN分類器結(jié)構(gòu)等價(jià)類空間及其在分類器學(xué)習(xí)算法中的應(yīng)用[J]. 北京郵電大學(xué)學(xué)報(bào), 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017WANG Zhongfeng and WANG Zhihai. Equivalent classes of TAN classifier structure and their application on learning algorithm[J]. Journal of Beijing University of Posts and Telecommunications, 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017 DUAN Zhiyi and WANG Limin. K-dependence bayesian classifier ensemble[J]. Entropy, 2017, 19(12): 651. doi: 10.3390/e19120651 馮月進(jìn), 張鳳斌. 最大相關(guān)最小冗余限定性貝葉斯網(wǎng)絡(luò)分類器學(xué)習(xí)算法[J]. 重慶大學(xué)學(xué)報(bào), 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011FENG Yuejin and ZHANG Fengbi. Max-relevance min-redundancy restrictive BAN classifier learning algorithm[J]. Journal of Chongqing University:Natural Science, 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011 WONG M L and LEUNG K S. An efficient data mining method for learning bayesian networks using an evolutionary algorithm-based hybrid approach[J]. IEEE Transactions on Evolutionary Computation, 2004, 8(4): 378–404. doi: 10.1109/TEVC.2004.830334 LOU Hua, WANG Limin, DUAN Dingbo, et al. RDE: A novel approach to improve the classification performance and expressivity of KDB[J]. Plos One, 2018, 13(7): e0199822. doi: 10.1371/journal.pone.0199822 ROBINSON R W. Counting Unlabeled Acyclic Digraphs[M]. Berlin Heidelberg: Springer, 1977: 28–43. doi: 10.1007/BFb0069178. SCHWARZ G. Estimating the Dimension of a Model[J]. Annals of Statistics, 1978, 6(2): 15–18. GREINER R and ZHOU W. Structural extension to logistic regression: discriminative parameter learning of belief net classifiers[J]. Machine Learning, 2005, 59(3): 297–322. doi: 10.1007/s10994-005-0469-0 MADDEN M G. On the classification performance of TAN and general bayesian networks[J]. Knowledge-Based Systems, 2009, 22(7): 489–495. doi: 10.1016/j.knosys.2008.10.006 DRUGAN M M and WIERING M A. Feature selection for Bayesian network classifiers using the MDL-FS score[J]. International Journal of Approximate Reasoning, 2010, 51(6): 695–717. doi: 10.1016/j.ijar.2010.02.001 WU Jiehua. A generalized tree augmented naive bayes link prediction model[J]. Journal of Computational Science, 2018. doi: 10.1016/j.jocs.2018.04.006 MEHRJOU A, HOSSEINI R, and ARAABI B N. Improved Bayesian information criterion for mixture model selection[J]. Pattern Recognition Letters, 2016, 69: 22–27. doi: 10.1016/j.patrec.2015.10.004 杜瑞杰. 貝葉斯分類器及其應(yīng)用研究[D]. [碩士論文], 上海大學(xué), 2012.DU Ruijie. The Research of Bayesian Classifier and its applications[D]. [Master disertation], Shanghai University, 2012 -