一種基于樹增強(qiáng)樸素貝葉斯的分類器學(xué)習(xí)方法

陳曦; 張坤

doi:10.11999/JEIT180886

一種基于樹增強(qiáng)樸素貝葉斯的分類器學(xué)習(xí)方法

doi: 10.11999/JEIT180886

陳曦,
張坤^,

1.
長(zhǎng)沙理工大學(xué)計(jì)算機(jī)與通信工程學(xué)院 ??長(zhǎng)沙 ??410114
2.
長(zhǎng)沙理工大學(xué)綜合交通運(yùn)輸大數(shù)據(jù)智能處理湖南省重點(diǎn)實(shí)驗(yàn)室長(zhǎng)沙 410114

基金項(xiàng)目: 國(guó)家自然科學(xué)基金(61772087)

詳細(xì)信息

作者簡(jiǎn)介:
陳曦：男，1963年生，教授，碩士生導(dǎo)師，研究方向?yàn)閿?shù)據(jù)挖掘

張坤：男，1993年生，碩士生，研究方向?yàn)閿?shù)據(jù)挖掘

通訊作者:
張坤　zonkis2016@outlook.com

中圖分類號(hào): TP311.1
計(jì)量
- 文章訪問數(shù): 3064
- HTML全文瀏覽量: 1302
- PDF下載量: 108
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-09-18
- 修回日期: 2019-03-27
- 網(wǎng)絡(luò)出版日期: 2019-04-20
- 刊出日期: 2019-08-01

A Classifier Learning Method Based on Tree-Augmented Na?ve Bayes

Xi CHEN,
Kun ZHANG^,

1.
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410114, China
2.
Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, Changsha University of Science and Technology, Changsha 410114, China

Funds: The National Natural Science Foundation of China (61772087)

摘要

摘要: 樹增強(qiáng)樸素貝葉斯(TAN)結(jié)構(gòu)強(qiáng)制每個(gè)屬性結(jié)點(diǎn)必須擁有類別父結(jié)點(diǎn)和一個(gè)屬性父結(jié)點(diǎn)，也沒有考慮到各個(gè)屬性與類別之間的相關(guān)性差異，導(dǎo)致分類準(zhǔn)確率較差。為了改進(jìn)TAN的分類準(zhǔn)確率，該文首先擴(kuò)展TAN結(jié)構(gòu)，允許屬性結(jié)點(diǎn)沒有父結(jié)點(diǎn)或只有一個(gè)屬性父結(jié)點(diǎn)；提出一種利用可分解的評(píng)分函數(shù)構(gòu)建樹形貝葉斯分類模型的學(xué)習(xí)方法，采用低階條件獨(dú)立性(CI)測(cè)試初步剔除無效屬性，再結(jié)合改進(jìn)的貝葉斯信息標(biāo)準(zhǔn)(BIC)評(píng)分函數(shù)利用貪婪搜索獲得每個(gè)屬性結(jié)點(diǎn)的父結(jié)點(diǎn)，從而建立分類模型。對(duì)比樸素貝葉斯(NB)和TAN，構(gòu)建的分類器在多個(gè)分類指標(biāo)上表現(xiàn)更好，說明該方法具有一定的優(yōu)越性。
- 貝葉斯分類器 /
- 樹增強(qiáng)樸素貝葉斯 /
- 評(píng)分函數(shù)
Abstract: The structure of Tree-Augmented Na?ve Bayes (TAN) forces each attribute node to have a class node and a attribute node as parent, which results in poor classification accuracy without considering correlation between each attribute node and the class node. In order to improve the classification accuracy of TAN, firstly, the TAN structure is proposed that allows each attribute node to have no parent or only one attribute node as parent. Then, a learning method of building the tree-like Bayesian classifier using a decomposable scoring function is proposed. Finally, the low-order Conditional Independency (CI) test is applied to eliminating the useless attribute, and then based on improved Bayesian Information Criterion (BIC) function, the classification model with acquired the parent node of each attribute node is established using the greedy algorithm. Through comprehensive experiments, the proposed classifier outperforms Na?ve Bayes (NB) and TAN on multiple classification, and the results prove that this learning method has certain advantages.
- Bayesian classifier /
- Tree-Augmented Na?ve Bayes (TAN) /
- Scoring function

HTML全文

圖 1 結(jié)構(gòu)示意圖

下載: 全尺寸圖片幻燈片

圖 2 不同閾值$\varepsilon $的分類準(zhǔn)確率

下載: 全尺寸圖片幻燈片

圖 3 多分類數(shù)據(jù)集的AUC polar圖對(duì)比

下載: 全尺寸圖片幻燈片

圖 7 SETAN結(jié)構(gòu)示意圖

下載: 全尺寸圖片幻燈片

圖 4 平均分類準(zhǔn)確率對(duì)比圖

下載: 全尺寸圖片幻燈片

圖 5 二分類數(shù)據(jù)集的ROC曲線

下載: 全尺寸圖片幻燈片

圖 6 平均分類準(zhǔn)確率比較圖

下載: 全尺寸圖片幻燈片

表 1 算法描述

輸入：變量集V，樣本數(shù)據(jù)集D
輸出：SETAN結(jié)構(gòu)
步驟1　For each $X_i \in V\;$, $I(C,X_i) = {\rm{Calc\_MI}}(C,X_i)$# 計(jì)算屬性與類別之間的互信息值
步驟2　將每個(gè)互信息值$I(C,X_i)$存入數(shù)組，降序
步驟3　For each $I(C,X_i) > \varepsilon $ in List
$S_1 = S_1 \cup \{ X_i\} $
Add path $C - X_i$to graph $E$# 若無連接邊，則添加類別C到屬性$X_i$的連接邊
$S_2 = S_{\rm{2}} \cup \{ X_j\} $，$X_j \in \{ I(C,X_j) < \varepsilon \} $
Add path $X_i - X_j$to graph $E$# 互信息值小于閾值$\varepsilon $的結(jié)點(diǎn)則被添加到集合$S_2$
Remove $I(C,X_i)$from List
End for
步驟4　For each $E' \in E$
${\rm{Score}}(E') = {\rm{Calc\_BIC}}(E')$# 計(jì)算改進(jìn)的BIC評(píng)分
K2-Search of the optimal SETAN Structure # 利用評(píng) 分函數(shù)搜索最優(yōu)結(jié)構(gòu)
End for
步驟5　Return $G = (V',E')$with best BIC score

下載: 導(dǎo)出CSV

表 2 網(wǎng)絡(luò)結(jié)構(gòu)學(xué)習(xí)實(shí)驗(yàn)結(jié)果

$\xi $	Asia網(wǎng)			Alarm網(wǎng)
$\xi $	A	D	R	A	D	R
0.01	0	3	5	0	20	25
0.001	1	1	7	3	2	45
0.0001	3	0	8	15	3	45

下載: 導(dǎo)出CSV

表 3 實(shí)驗(yàn)數(shù)據(jù)集信息

數(shù)據(jù)集	樣本數(shù)量	類別分布	屬性數(shù)量	分類數(shù)量	缺失值
Balance	625	49/288/288	4	3	否
Car	1728	1210/384/69/65	6	4	否
Connect	67558	44473/16635/6449	42	3	否
Mushroom	8124	4208/3916	22	2	是
Nursery	12960	4320/2/328/4266/4044	8	5	否
SPECT	80	40/40	22	2	否
Cancer	286	85/201	9	2	否
Votes	435	168/267	16	2	是

下載: 導(dǎo)出CSV

表 4 閾值$\varepsilon $信息

數(shù)據(jù)集	閾值$\varepsilon $	平均準(zhǔn)確率
Balance	0.01/0.05/0.10	0.915/0.914/0.910
Connect	0.01/0.05/0.10	0.767/0.764/0.760
SPECT	0.01/0.05/0.10	0.740/0.738/0.733
Cancer	0.01/0.05/0.10	0.710/0.710/0.698

下載: 導(dǎo)出CSV

表 5 NB, TAN和SETAN的各項(xiàng)分類指標(biāo)對(duì)比情況

數(shù)據(jù)集	算法	準(zhǔn)確率	F1值	召回率	精確率	AUC面積
	NB	0.914	0.876	0.914	0.842	0.961
Balance	TAN	0.861	0.834	0.861	0.836	0.904
	SETAN	0.914	0.876	0.914	0.842	0.962
	NB	0.857	0.849	0.857	0.854	0.976
Car	TAN	0.908	0.911	0.908	0.92	0.983
	SETAN	0.946	0.947	0.946	0.947	0.988
	NB	0.721	0.681	0.721	0.681	0.807
Connect	TAN	0.763	0.722	0.763	0.731	0.864
	SETAN	0.764	0.724	0.764	0.735	0.866
	NB	0.958	0.958	0.958	0.96	0.998
Mushroom	TAN	0.999	1.000	0.999	1.000	1.000
	SETAN	1.000	1.000	1.000	1.000	1.000
	NB	0.903	0.894	0.903	0.906	0.982
Nursery	TAN	0.928	0.92	0.928	0.929	0.991
	SETAN	0.937	0.927	0.937	0.937	0.993
	NB	0.738	0.735	0.738	0.745	0.802
SPECT	TAN	0.713	0.709	0.713	0.724	0.668
	SETAN	0.738	0.736	0.738	0.741	0.755
	NB	0.734	0.727	0.734	0.723	0.702
Cancer	TAN	0.706	0.692	0.706	0.687	0.667
	SETAN	0.710	0.700	0.710	0.695	0.624
	NB	0.901	0.902	0.901	0.905	0.973
Votes	TAN	0.940	0.940	0.940	0.941	0.986
	SETAN	0.949	0.950	0.949	0.950	0.985

下載: 導(dǎo)出CSV

參考文獻(xiàn)(19)

PEARL J. Probabilistic reasoning in intelligent systems: networks of plausible inference[J]. Computer Science Artificial Intelligence, 1991, 70(2): 1022–1027. doi: 10.2307/407557

WEBB G I, CHEN Shenglei, and N A. Zaidi Scalable learning of Bayesian network classifiers[J]. Journal of Machine Learning Research, 2016, 17(1): 1515–1549.

MURALIDHARAN V and SUGUMARAN V. A comparative study of Na?ve Bayes classifier and Bayes net classifier for fault diagnosis of monoblock centrifugal pump using wavelet analysis[J]. Applied Soft Computing, 2012, 12(8): 2023–2029. doi: 10.1016/j.asoc.2012.03.021

Friedman N, Geiger D, and Goldszmidt M. Bayesian network classifiers[J]. Machine Learning, 1997, 29(2-3): 131–163. doi: 10.1023/a:1007465528199

GAN Hongxiao, ZHANG Yang, and SONG Qun. Bayesian belief network for positive unlabeled learning with uncertainty[J]. Pattern Recognition Letters, 2017, 90. doi: 10.1016/j.patrec.2017.03.007

JIANG Liangxiao, CAI Zhihua, WANG Dianhong, et al. Improving Tree augmented Naive Bayes for class probability estimation[J]. Knowledge-Based Systems, 2012, 26: 239–245. doi: 10.1016/j.knosys.2011.08.010

王中鋒, 王志海. TAN分類器結(jié)構(gòu)等價(jià)類空間及其在分類器學(xué)習(xí)算法中的應(yīng)用[J]. 北京郵電大學(xué)學(xué)報(bào), 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017

WANG Zhongfeng and WANG Zhihai. Equivalent classes of TAN classifier structure and their application on learning algorithm[J]. Journal of Beijing University of Posts and Telecommunications, 2012, 35(1): 72–76. doi: 10.3969/j.issn.1007-5321.2012.01.017

DUAN Zhiyi and WANG Limin. K-dependence bayesian classifier ensemble[J]. Entropy, 2017, 19(12): 651. doi: 10.3390/e19120651

馮月進(jìn), 張鳳斌. 最大相關(guān)最小冗余限定性貝葉斯網(wǎng)絡(luò)分類器學(xué)習(xí)算法[J]. 重慶大學(xué)學(xué)報(bào), 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011

FENG Yuejin and ZHANG Fengbi. Max-relevance min-redundancy restrictive BAN classifier learning algorithm[J]. Journal of Chongqing University:Natural Science, 2014, 37(6): 71–77. doi: 10.11835/j.issn.1000-582X.2014.06.011

WONG M L and LEUNG K S. An efficient data mining method for learning bayesian networks using an evolutionary algorithm-based hybrid approach[J]. IEEE Transactions on Evolutionary Computation, 2004, 8(4): 378–404. doi: 10.1109/TEVC.2004.830334

LOU Hua, WANG Limin, DUAN Dingbo, et al. RDE: A novel approach to improve the classification performance and expressivity of KDB[J]. Plos One, 2018, 13(7): e0199822. doi: 10.1371/journal.pone.0199822

ROBINSON R W. Counting Unlabeled Acyclic Digraphs[M]. Berlin Heidelberg: Springer, 1977: 28–43. doi: 10.1007/BFb0069178.

SCHWARZ G. Estimating the Dimension of a Model[J]. Annals of Statistics, 1978, 6(2): 15–18.

GREINER R and ZHOU W. Structural extension to logistic regression: discriminative parameter learning of belief net classifiers[J]. Machine Learning, 2005, 59(3): 297–322. doi: 10.1007/s10994-005-0469-0

MADDEN M G. On the classification performance of TAN and general bayesian networks[J]. Knowledge-Based Systems, 2009, 22(7): 489–495. doi: 10.1016/j.knosys.2008.10.006

DRUGAN M M and WIERING M A. Feature selection for Bayesian network classifiers using the MDL-FS score[J]. International Journal of Approximate Reasoning, 2010, 51(6): 695–717. doi: 10.1016/j.ijar.2010.02.001

WU Jiehua. A generalized tree augmented naive bayes link prediction model[J]. Journal of Computational Science, 2018. doi: 10.1016/j.jocs.2018.04.006

MEHRJOU A, HOSSEINI R, and ARAABI B N. Improved Bayesian information criterion for mixture model selection[J]. Pattern Recognition Letters, 2016, 69: 22–27. doi: 10.1016/j.patrec.2015.10.004

杜瑞杰. 貝葉斯分類器及其應(yīng)用研究[D]. [碩士論文], 上海大學(xué), 2012.

DU Ruijie. The Research of Bayesian Classifier and its applications[D]. [Master disertation], Shanghai University, 2012

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計(jì)