雙向長短時記憶模型訓(xùn)練中的空間平滑正則化方法研究
doi: 10.11999/JEIT180314
-
1.
中國科學院聲學研究所語言聲學與內(nèi)容理解重點實驗室 ??北京 ??100190
-
2.
中國科學院大學 ??北京 ??100049
-
3.
中國科學院新疆理化技術(shù)研究所新疆民族語音語言信息處理實驗室 ??烏魯木齊 ??830011
Spatial Smoothing Regularization for Bi-direction Long Short-term Memory Model
-
1.
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Acadamy of Sciences, Beijing 100190, China
-
2.
University of Chinese Academy of Sciences, Beijing 100049, China
-
3.
Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
-
摘要:
雙向長短時記憶模型(BLSTM)由于其強大的時間序列建模能力,以及良好的訓(xùn)練穩(wěn)定性,已經(jīng)成為語音識別領(lǐng)域主流的聲學模型結(jié)構(gòu)。但是該模型結(jié)構(gòu)擁有更大計算量以及參數(shù)數(shù)量,因此在神經(jīng)網(wǎng)絡(luò)訓(xùn)練的過程當中很容易過擬合,進而無法獲得理想的識別效果。在實際應(yīng)用中,通常會使用一些技巧來緩解過擬合問題,例如在待優(yōu)化的目標函數(shù)中加入L2正則項就是常用的方法之一。該文提出一種空間平滑的方法,把BLSTM模型激活值的向量重組成一個2維圖,通過濾波變換得到它的空間信息,并將平滑該空間信息作為輔助優(yōu)化目標,與傳統(tǒng)的損失函數(shù)一起,作為優(yōu)化神經(jīng)網(wǎng)絡(luò)參數(shù)的學習準則。實驗表明,在電話交談?wù)Z音識別任務(wù)上,這種方法相比于基線模型取得了相對4%的詞錯誤率(WER)下降。進一步探索了L2范數(shù)正則技術(shù)和空間平滑方法的互補性,實驗結(jié)果表明,同時應(yīng)用這2種算法,能夠取得相對8.6%的WER下降。
-
關(guān)鍵詞:
- 語音信號處理 /
- 空間平滑 /
- 雙向長短時記憶模型(LSTM) /
- 正則化 /
- 過擬合
Abstract:Bi-direction Long Short-Term Memory (BLSTM) model is widely used in large scale acoustic modeling recently. It is superior to many other neural networks on performance and stability. The reason may be that the BLSTM model gets complicated structure and computation with cell and gates, taking more context and time dependence into account during training. However, one of the biggest problem of BLSTM is overfitting, there are some common ways to get over it, for example, multitask learning, L2 model regularization. A method of spatial smoothing is proposed on BLSTM model to relieve the overfitting problem. First, the activations on the hidden layer are reorganized to a 2-D grid, then a filter transform is used to induce smoothness over the grid, finally adding the smooth information to the objective function, to train a BLSTM network. Experiment results show that the proposed spatial smoothing way achieves 4% relative reduction on Word Error Ratio (WER), when adding the L2 norm to model, which can lower the relative WER by 8.6% jointly.
-
表 1 不同位置空間平滑的結(jié)果
空間平滑
位置空間平滑
權(quán)重(c)CallHm WER (%) Swbd WER (%) 總計WER (%) 無 無 20.0 10.3 15.2 P1 0.0020 19.9 10.4 15.2 P1 0.0010 19.9 10.0 15.0 P1 0.0007 20.0 10.3 15.2 P2 0.0020 19.7 10.0 14.9 P2 0.0010 19.7 9.8 14.8 P2 0.0007 19.9 9.8 15.0 P3 0.0020 20.1 10.3 15.2 P3 0.0010 20.0 9.8 15.0 P3 0.0007 20.0 10.1 15.1 P4 0.0010 20.9 10.6 15.8 P4 0.0007 20.6 10.3 15.5 P4 0.0006 20.5 10.6 15.6 下載: 導(dǎo)出CSV
表 2 不同權(quán)重下的細胞狀態(tài)值
${{{c}}_t}$ 的空間平滑結(jié)果空間平滑權(quán)重
(c)CallHm WER
(%)Swbd WER
(%)總計WER
(%)無 20.0 10.3 15.2 0.0100 20.3 10.4 15.4 0.0010 19.7 9.8 14.8 0.0009 19.3 9.8 14.6 0.0008 19.6 9.7 14.7 0.0007 19.9 9.8 15.0 下載: 導(dǎo)出CSV
表 3 網(wǎng)絡(luò)中添加L2正則后的結(jié)果
L2正則
有/無空間平滑
有/無CallHm WER (%) Swbd WER (%) 總計WER (%) 無 無 20.0 10.3 15.2 無 有 19.3 9.8 14.6 有 無 19.0 9.5 14.3 有 有 18.5 9.3 13.9 下載: 導(dǎo)出CSV
-
LI X, and WU X. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015: 4520–4524. doi: 10.1109/ICASSP.2015.7178826. CHEN K and HUO Q. Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2016, 24(7): 1185–1193. doi: 10.1109/TASLP.2016.2539499 AXELROD S, GOEL V, Gopinath R, et al. Discriminative estimation of subspace constrained gaussian mixture models for speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(1): 172–189. doi: 10.1109/TASL.2006.872617 POVEY D, KANEVSKY D, KINGSBURY B, et al. Boosted MMI for model and feature-space discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, USA, 2008: 4057–4060. doi: 10.1109/ICASSP.2008.4518545. POVEY D and KINGSBURY B. Evaluation of proposed modifications to MPE for large scale discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, USA, 2007: 321–324. doi: 10.1109/ICASSP.2007.366914. HUANG Z, SINISCALCHI S M, and LEE C H. Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation[J]. Pattern Recognition Letters, 2017, 98(15): 1–7. doi: 10.1016/j.patrec.2017.08.001 POVEY D. Discriminative training for large vocabulary speech recognition[D].[Ph.D. dissertation], University of Cambridge, 2003. ZHOU P, JIANG H, DAI L R, et al. State-clustering based multiple deep neural networks modeling approach for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2015, 23(4): 631–642. doi: 10.1109/TASLP.2015.2392944 SRIVASTAVA N, HINTON G, KRIZHEYSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958. GOODFELLOW I, BENGIO Y, and COURVILLE A, Deep Learning[M], Cambridge, MA: MIT Press, 2016: 228–230. POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 2751–2755. doi: 10.21437/Interspeech.2016-595. SAHRAEIAN R, and VAN D. Cross-entropy training of DNN ensemble acoustic models for low-resource ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 1991–2001. doi: 10.1109/TASLP.2018.2851145 LIU P, LIU C, JIANG H, et al. A constrained line search optimization method for discriminative training of HMMs[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(5): 900–909. doi: 10.1109/TASL.2008.925882 WU C, KARANASOU P, GALES M J, et al. Stimulated deep neural network for speech recognition[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 400–404. doi: 10.21437/Interspeech.2016-580. Wu C, CALES M J F, RAGNI A, et al. Improving interpretability and regularization in deep learning[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2018, 26(2): 256–265. doi: 10.1109/TASLP.2017.2774919 KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]. International Speech Communication Association (INTERSPEECH), Dresden, Germany, 2015: 3586–3589. doi: 10.21437/Interspeech.2015-571. LAURENT C, PEREYRA G, BRAKEL P, et al. Batch normalized recurrent neural networks[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 2657–2661. doi: 10.1109/ICASSP.2016.7472159. -