基于多模態(tài)生成對(duì)抗網(wǎng)絡(luò)和三元組損失的說(shuō)話人識(shí)別

陳瑩; 陳湟康

doi:10.11999/JEIT190154

基于多模態(tài)生成對(duì)抗網(wǎng)絡(luò)和三元組損失的說(shuō)話人識(shí)別

doi: 10.11999/JEIT190154

陳瑩^,,
陳湟康

江南大學(xué)輕工過(guò)程先進(jìn)控制教育部重點(diǎn)實(shí)驗(yàn)室無(wú)錫 214122

基金項(xiàng)目: 國(guó)家自然科學(xué)基金(61573168)

詳細(xì)信息

作者簡(jiǎn)介:
陳瑩：女，1976年生，教授，博士，研究方向?yàn)樾畔⑷诤?、模式識(shí)別

陳湟康：男，1994年生，碩士生，研究方向?yàn)檎f(shuō)話人識(shí)別

通訊作者:
陳瑩　chenying@jiangnan.edu.cn

中圖分類號(hào): TN912.3, TP391
計(jì)量
- 文章訪問(wèn)數(shù): 3618
- HTML全文瀏覽量: 1868
- PDF下載量: 152
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2019-03-15
- 修回日期: 2019-09-09
- 網(wǎng)絡(luò)出版日期: 2019-09-19
- 刊出日期: 2020-02-19

Speaker Recognition Based on Multimodal GenerativeAdversarial Nets with Triplet-loss

Ying CHEN^,,
Huangkang CHEN

Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education),Jiangnan University, Wuxi 214122, China

Funds: The National Natural Science Foundation of China (61573168))

摘要

摘要:
為了挖掘說(shuō)話人識(shí)別領(lǐng)域中人臉和語(yǔ)音的相關(guān)性，該文設(shè)計(jì)多模態(tài)生成對(duì)抗網(wǎng)絡(luò)(GAN)，將人臉特征和語(yǔ)音特征映射到聯(lián)系更加緊密的公共空間，隨后利用3元組損失對(duì)兩個(gè)模態(tài)的聯(lián)系進(jìn)一步約束，拉近相同個(gè)體跨模態(tài)樣本的特征距離，拉遠(yuǎn)不同個(gè)體跨模態(tài)樣本的特征距離。最后通過(guò)計(jì)算公共空間特征的跨模態(tài)余弦距離判斷人臉和語(yǔ)音是否匹配，并使用Softmax識(shí)別說(shuō)話人身份。實(shí)驗(yàn)結(jié)果表明，該方法能有效地提升說(shuō)話人識(shí)別準(zhǔn)確率。
- 說(shuō)話人識(shí)別 /
- 跨模態(tài) /
- 生成對(duì)抗網(wǎng)絡(luò) /
- 3元組損失
Abstract:
In order to explore the correlation between face and audio in the field of speaker recognition, a novel multimodal Generative Adversarial Network (GAN) is designed to map face features and audio features to a more closely connected common space. Then the Triplet-loss is used to constrain further the relationship between the two modals, with which the intra-class distance of the two modals is narrowed, and the inter-class distance of the two modals is extended. Finally, the cosine distance of the common space features of the two modals is calculated to judge whether the face and the voice are matched, and Softmax is used to recognize the speaker identity. Experimental results show that this method can effectively improve the accuracy of speaker recognition.
- Speaker recognition /
- Cross-modal /
- Generative Adversarial Network (GAN) /
- Triplet-loss

HTML全文

圖 1 本文所提網(wǎng)絡(luò)結(jié)構(gòu)圖

下載: 全尺寸圖片幻燈片

圖 2 不同margin值的ROC

下載: 全尺寸圖片幻燈片

圖 3 不同閾值的識(shí)別結(jié)果

下載: 全尺寸圖片幻燈片

圖 4 是否具有公共層的ROC曲線對(duì)比

下載: 全尺寸圖片幻燈片

圖 5 有無(wú)特征匹配判斷網(wǎng)絡(luò)識(shí)別結(jié)果對(duì)比

下載: 全尺寸圖片幻燈片

表 1 不同特征的身份識(shí)別準(zhǔn)確率(%)

特征	ID識(shí)別準(zhǔn)確率
語(yǔ)音公共特征	95.57
人臉公共特征	99.41
串聯(lián)特征	99.59

下載: 導(dǎo)出CSV

表 2 說(shuō)話人身份識(shí)別準(zhǔn)確率(%)

方法	ID識(shí)別準(zhǔn)確率	匹配準(zhǔn)確率
Multimodal Correlated NN^[6]	83.26	–
Multimodal CNN^[5]	86.12	–
Multimodal LSTM^[7]	90.15	94.35
Deep Heterogeneous Feature Fusion.^[8]	97.80	–
本文AVGATN	99.41	99.02

下載: 導(dǎo)出CSV

參考文獻(xiàn)(18)

BREDIN H and CHOLLET G. Audio-visual speech synchrony measure for talking-face identity verification[C]. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, 2007: Ⅱ-233–Ⅱ-236.

HAGHIGHAT M, ABDEL-MOTTALEB M, and ALHALABI W. Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition[J]. IEEE Transactions on Information Forensics and Security, 2016, 11(9): 1984–1996. doi: 10.1109/TIFS.2016.2569061

CHENG H T, CHAO Y H, YEH S L, et al. An efficient approach to multimodal person identity verification by fusing face and voice information[C]. 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, Netherlands, 2005: 542–545.

SOLTANE M, DOGHMANE N, and GUERSI N. Face and speech based multi-modal biometric authentication[J]. International Journal of Advanced Science and Technology, 2010, 21(6): 41–56.

HU Yongtao, REN J S J, DAI Jingwen, et al. Deep multimodal speaker naming[C]. The 23rd ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107–1110.

GENG Jiajia, LIU Xin, and CHEUNG Y M. Audio-visual speaker recognition via multi-modal correlated neural networks[C]. 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops, Omaha, USA, 2016: 123–128.

REN J, HU Yongtao, TAI Y W, et al. Look, listen and learn-a multimodal LSTM for speaker identification[C]. The 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, 2016: 3581–3587.

LIU Yuhang, LIU Xin, FAN Wentao, et al. Efficient audio-visual speaker recognition via deep heterogeneous feature fusion[C]. The 12th Chinese Conference on Biometric Recognition, Shenzhen, China, 2017: 575–583.

GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]. The 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2672–2680.

唐賢倫, 杜一銘, 劉雨微, 等. 基于條件深度卷積生成對(duì)抗網(wǎng)絡(luò)的圖像識(shí)別方法[J]. 自動(dòng)化學(xué)報(bào), 2018, 44(5): 855–864.

TANG Xianlun, DU Yiming, LIU Yuwei, et al. Image recognition with conditional deep convolutional generative adversarial networks[J]. Acta Automatica Sinica, 2018, 44(5): 855–864.

孫亮, 韓毓璇, 康文婧, 等. 基于生成對(duì)抗網(wǎng)絡(luò)的多視圖學(xué)習(xí)與重構(gòu)算法[J]. 自動(dòng)化學(xué)報(bào), 2018, 44(5): 819–828.

SUN Liang, HAN Yuxuan, KANG Wenjing, et al. Multi-view learning and reconstruction algorithms via generative adversarial networks[J]. Acta Automatica Sinica, 2018, 44(5): 819–828.

鄭文博, 王坤峰, 王飛躍. 基于貝葉斯生成對(duì)抗網(wǎng)絡(luò)的背景消減算法[J]. 自動(dòng)化學(xué)報(bào), 2018, 44(5): 878–890.

ZHENG Wenbo, WANG Kunfeng, and WANG Feiyue. Background subtraction algorithm with Bayesian generative adversarial networks[J]. Acta Automatica Sinica, 2018, 44(5): 878–890.

RADFORD A, METZ L, and CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv: 1511.06434 , 2015.

DENTON E, CHINTALA S, SZLAM A, et al. Deep generative image models using a laplacian pyramid of adversarial networks[C]. The 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015: 1486–1494.

LEDIG C, THEIS L, HUSZáR F, et al. Photo-realistic single image super-resolution using a generative adversarial network[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 105–114.

WANG Xiaolong and GUPTA A. Generative image modeling using style and structure adversarial networks[C]. The 14th European Conference on Computer Vision, Amsterdam, Netherlands, 2016: 318–335.

PENG Yuxin and QI Jinwei. CM-GANs: Cross-modal generative adversarial networks for common representation learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 98–121.

HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, 3(4): 212–223.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問(wèn)統(tǒng)計(jì)

圖(5) / 表(2)

計(jì)量

文章訪問(wèn)數(shù): 3618
HTML全文瀏覽量: 1868
PDF下載量: 152
被引次數(shù): 0

姓名
郵箱
手機(jī)號(hào)碼
標(biāo)題
留言內(nèi)容
驗(yàn)證碼

一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

留言板

基于多模態(tài)生成對(duì)抗網(wǎng)絡(luò)和三元組損失的說(shuō)話人識(shí)別

doi: 10.11999/JEIT190154

作者簡(jiǎn)介:
陳瑩：女，1976年生，教授，博士，研究方向?yàn)樾畔⑷诤?、模式識(shí)別

陳湟康：男，1994年生，碩士生，研究方向?yàn)檎f(shuō)話人識(shí)別

通訊作者:
陳瑩　chenying@jiangnan.edu.cn

計(jì)量

Speaker Recognition Based on Multimodal GenerativeAdversarial Nets with Triplet-loss

計(jì)量

目錄

一级黄色片免费播放|中国黄色视频播放片|日本三级a|可以直接考播黄片影视免费一级毛片

留言板

基于多模態(tài)生成對(duì)抗網(wǎng)絡(luò)和三元組損失的說(shuō)話人識(shí)別

doi: 10.11999/JEIT190154

作者簡(jiǎn)介: 陳瑩：女，1976年生，教授，博士，研究方向?yàn)樾畔⑷诤?、模式識(shí)別 陳湟康：男，1994年生，碩士生，研究方向?yàn)檎f(shuō)話人識(shí)別

通訊作者: 陳瑩 chenying@jiangnan.edu.cn

計(jì)量

出版歷程

Speaker Recognition Based on Multimodal GenerativeAdversarial Nets with Triplet-loss

計(jì)量

出版歷程

目錄

作者簡(jiǎn)介:
陳瑩：女，1976年生，教授，博士，研究方向?yàn)樾畔⑷诤?、模式識(shí)別

陳湟康：男，1994年生，碩士生，研究方向?yàn)檎f(shuō)話人識(shí)別

通訊作者:
陳瑩　chenying@jiangnan.edu.cn