雙向長短時(shí)記憶模型訓(xùn)練中的空間平滑正則化方法研究

李文潔; 葛鳳培; 張鵬遠(yuǎn); 顏永紅

doi:10.11999/JEIT180314

雙向長短時(shí)記憶模型訓(xùn)練中的空間平滑正則化方法研究

doi: 10.11999/JEIT180314 cstr: 32379.14.JEIT180314

李文潔^{1, 2},
葛鳳培^{1, 2},
張鵬遠(yuǎn)^{1, 2, ,},
顏永紅^{1, 2, 3}

1.
中國科學(xué)院聲學(xué)研究所語言聲學(xué)與內(nèi)容理解重點(diǎn)實(shí)驗(yàn)室 ??北京 ??100190
2.
中國科學(xué)院大學(xué) ??北京 ??100049
3.
中國科學(xué)院新疆理化技術(shù)研究所新疆民族語音語言信息處理實(shí)驗(yàn)室 ??烏魯木齊 ??830011

基金項(xiàng)目: 國家重點(diǎn)研發(fā)計(jì)劃重點(diǎn)專項(xiàng)(2016YFB0801203, 2016YFB0801200)，國家自然科學(xué)基金(11590770-4, U1536117, 11504406, 11461141004)，新疆維吾爾自治區(qū)科技重大專項(xiàng)(2016A03007-1)

詳細(xì)信息

作者簡介:
李文潔：女，1993年生，博士生，研究方向?yàn)檎Z音信號(hào)處理、語音識(shí)別、聲學(xué)模型、遠(yuǎn)場語音識(shí)別等

葛鳳培：女，1982年生，副研究員，研究方向?yàn)檎Z音識(shí)別、發(fā)音質(zhì)量評估、聲學(xué)建模及自適應(yīng)等

張鵬遠(yuǎn)：男，1978年生，研究員，碩士生導(dǎo)師，研究方向?yàn)榇笤~表非特定人連續(xù)語音識(shí)別、關(guān)鍵詞檢索、聲學(xué)模型、魯棒語音識(shí)別等

顏永紅：男，1967年生，研究員，博士生導(dǎo)師，研究方向?yàn)檎Z音信號(hào)處理、語音識(shí)別、口語系統(tǒng)及多模系統(tǒng)、人機(jī)界面技術(shù)等

通訊作者:
張鵬遠(yuǎn)　pzhang@hccl.ioa.ac.cn

中圖分類號(hào): TN912.34
計(jì)量
- 文章訪問數(shù): 2575
- HTML全文瀏覽量: 672
- PDF下載量: 80
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2018-04-03
- 修回日期: 2018-11-22
- 網(wǎng)絡(luò)出版日期: 2018-12-03
- 刊出日期: 2019-03-01

Spatial Smoothing Regularization for Bi-direction Long Short-term Memory Model

Wenjie LI^{1, 2},
Fengpei GE^{1, 2},
Pengyuan ZHANG^{1, 2
, ,},
Yonghong YAN^{1, 2, 3}

1.
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Acadamy of Sciences, Beijing 100190, China
2.
University of Chinese Academy of Sciences, Beijing 100049, China
3.
Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China

Funds: The National Key Research and Development Plan (2016YFB0801203, 2016YFB0801200), The National Natural Science Foundation of China (11590770-4, U1536117, 11504406, 11461141004), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)

摘要

摘要:
雙向長短時(shí)記憶模型(BLSTM)由于其強(qiáng)大的時(shí)間序列建模能力，以及良好的訓(xùn)練穩(wěn)定性，已經(jīng)成為語音識(shí)別領(lǐng)域主流的聲學(xué)模型結(jié)構(gòu)。但是該模型結(jié)構(gòu)擁有更大計(jì)算量以及參數(shù)數(shù)量，因此在神經(jīng)網(wǎng)絡(luò)訓(xùn)練的過程當(dāng)中很容易過擬合，進(jìn)而無法獲得理想的識(shí)別效果。在實(shí)際應(yīng)用中，通常會(huì)使用一些技巧來緩解過擬合問題，例如在待優(yōu)化的目標(biāo)函數(shù)中加入L2正則項(xiàng)就是常用的方法之一。該文提出一種空間平滑的方法，把BLSTM模型激活值的向量重組成一個(gè)2維圖，通過濾波變換得到它的空間信息，并將平滑該空間信息作為輔助優(yōu)化目標(biāo)，與傳統(tǒng)的損失函數(shù)一起，作為優(yōu)化神經(jīng)網(wǎng)絡(luò)參數(shù)的學(xué)習(xí)準(zhǔn)則。實(shí)驗(yàn)表明，在電話交談?wù)Z音識(shí)別任務(wù)上，這種方法相比于基線模型取得了相對4%的詞錯(cuò)誤率(WER)下降。進(jìn)一步探索了L2范數(shù)正則技術(shù)和空間平滑方法的互補(bǔ)性，實(shí)驗(yàn)結(jié)果表明，同時(shí)應(yīng)用這2種算法，能夠取得相對8.6%的WER下降。
- 語音信號(hào)處理 /
- 空間平滑 /
- 雙向長短時(shí)記憶模型(LSTM) /
- 正則化 /
- 過擬合
Abstract:
Bi-direction Long Short-Term Memory (BLSTM) model is widely used in large scale acoustic modeling recently. It is superior to many other neural networks on performance and stability. The reason may be that the BLSTM model gets complicated structure and computation with cell and gates, taking more context and time dependence into account during training. However, one of the biggest problem of BLSTM is overfitting, there are some common ways to get over it, for example, multitask learning, L2 model regularization. A method of spatial smoothing is proposed on BLSTM model to relieve the overfitting problem. First, the activations on the hidden layer are reorganized to a 2-D grid, then a filter transform is used to induce smoothness over the grid, finally adding the smooth information to the objective function, to train a BLSTM network. Experiment results show that the proposed spatial smoothing way achieves 4% relative reduction on Word Error Ratio (WER), when adding the L2 norm to model, which can lower the relative WER by 8.6% jointly.
- Speech signal processing /
- Spatial smoothing /
- Long Short-Term Memory (LSTM) /
- Regularization /
- Overfitting

HTML全文

圖 1 LSTM網(wǎng)絡(luò)的記憶單元

下載: 全尺寸圖片幻燈片

圖 2 將激活值的1維向量拼成2維網(wǎng)格

下載: 全尺寸圖片幻燈片

圖 3 模型結(jié)構(gòu)圖

下載: 全尺寸圖片幻燈片

表 1 不同位置空間平滑的結(jié)果

空間平滑位置	空間平滑權(quán)重(c)	CallHm WER (%)	Swbd WER (%)	總計(jì)WER (%)
無	無	20.0	10.3	15.2
P1	0.0020	19.9	10.4	15.2
P1	0.0010	19.9	10.0	15.0
P1	0.0007	20.0	10.3	15.2
P2	0.0020	19.7	10.0	14.9
P2	0.0010	19.7	9.8	14.8
P2	0.0007	19.9	9.8	15.0
P3	0.0020	20.1	10.3	15.2
P3	0.0010	20.0	9.8	15.0
P3	0.0007	20.0	10.1	15.1
P4	0.0010	20.9	10.6	15.8
P4	0.0007	20.6	10.3	15.5
P4	0.0006	20.5	10.6	15.6

下載: 導(dǎo)出CSV

表 2 不同權(quán)重下的細(xì)胞狀態(tài)值${{{c}}_t}$的空間平滑結(jié)果

空間平滑權(quán)重 (c)	CallHm WER (%)	Swbd WER (%)	總計(jì)WER (%)
無	20.0	10.3	15.2
0.0100	20.3	10.4	15.4
0.0010	19.7	9.8	14.8
0.0009	19.3	9.8	14.6
0.0008	19.6	9.7	14.7
0.0007	19.9	9.8	15.0

下載: 導(dǎo)出CSV

表 3 網(wǎng)絡(luò)中添加L2正則后的結(jié)果

L2正則有/無	空間平滑有/無	CallHm WER (%)	Swbd WER (%)	總計(jì)WER (%)
無	無	20.0	10.3	15.2
無	有	19.3	9.8	14.6
有	無	19.0	9.5	14.3
有	有	18.5	9.3	13.9

下載: 導(dǎo)出CSV

參考文獻(xiàn)(17)

LI X, and WU X. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015: 4520–4524. doi: 10.1109/ICASSP.2015.7178826.

CHEN K and HUO Q. Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2016, 24(7): 1185–1193. doi: 10.1109/TASLP.2016.2539499

AXELROD S, GOEL V, Gopinath R, et al. Discriminative estimation of subspace constrained gaussian mixture models for speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(1): 172–189. doi: 10.1109/TASL.2006.872617

POVEY D, KANEVSKY D, KINGSBURY B, et al. Boosted MMI for model and feature-space discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, USA, 2008: 4057–4060. doi: 10.1109/ICASSP.2008.4518545.

POVEY D and KINGSBURY B. Evaluation of proposed modifications to MPE for large scale discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, USA, 2007: 321–324. doi: 10.1109/ICASSP.2007.366914.

HUANG Z, SINISCALCHI S M, and LEE C H. Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation[J]. Pattern Recognition Letters, 2017, 98(15): 1–7. doi: 10.1016/j.patrec.2017.08.001

POVEY D. Discriminative training for large vocabulary speech recognition[D].[Ph.D. dissertation], University of Cambridge, 2003.

ZHOU P, JIANG H, DAI L R, et al. State-clustering based multiple deep neural networks modeling approach for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2015, 23(4): 631–642. doi: 10.1109/TASLP.2015.2392944

SRIVASTAVA N, HINTON G, KRIZHEYSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958.

GOODFELLOW I, BENGIO Y, and COURVILLE A, Deep Learning[M], Cambridge, MA: MIT Press, 2016: 228–230.

POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 2751–2755. doi: 10.21437/Interspeech.2016-595.

SAHRAEIAN R, and VAN D. Cross-entropy training of DNN ensemble acoustic models for low-resource ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 1991–2001. doi: 10.1109/TASLP.2018.2851145

LIU P, LIU C, JIANG H, et al. A constrained line search optimization method for discriminative training of HMMs[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(5): 900–909. doi: 10.1109/TASL.2008.925882

WU C, KARANASOU P, GALES M J, et al. Stimulated deep neural network for speech recognition[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 400–404. doi: 10.21437/Interspeech.2016-580.

Wu C, CALES M J F, RAGNI A, et al. Improving interpretability and regularization in deep learning[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2018, 26(2): 256–265. doi: 10.1109/TASLP.2017.2774919

KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]. International Speech Communication Association (INTERSPEECH), Dresden, Germany, 2015: 3586–3589. doi: 10.21437/Interspeech.2015-571.

LAURENT C, PEREYRA G, BRAKEL P, et al. Batch normalized recurrent neural networks[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 2657–2661. doi: 10.1109/ICASSP.2016.7472159.

相關(guān)文章

施引文獻(xiàn)

資源附件(0)

訪問統(tǒng)計(jì)