低信噪比下基于融合网络的音素识别方法 Phoneme recognition method based on fusion network with low signal-to-noise ratio期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

低信噪比下基于融合网络的音素识别方法

引用本文：	黄辉波,邵玉斌,龙华,杜庆治. 低信噪比下基于融合网络的音素识别方法[J]. 重庆邮电大学学报(自然科学版), 2024, 36(4): 786-796

作者姓名：	黄辉波邵玉斌龙华杜庆治

作者单位：	昆明理工大学信息工程与自动化学院, 昆明 650500

基金项目：	云南省媒体融合重点实验室项目(220235205)

摘要：	针对低信噪比下音素识别准确率低的问题,提出一种新的识别方法。提取语音的Fbank特征,输入到由多头注意力机制、ResNet、BLSTM、CTC构建的A-R-B-CTC模型中进行音素识别,利用Wave-U-Net对语音特征Fbank、MFCC、GFCC、对数频谱进行图像去噪,发现Fbank特征去噪后,可以取得更低的音素错误率。在0 dB白噪声环境下采用THCHS30数据集进行实验验证。结果表明,Fbank去噪前,所提A-R-B-CTC模型相比于BLSTM-CTC、ResNet-BLSTM-CTC、Transformer模型,平均音素错误率分别降低了4.38%、2.5%、1.96%;Fbank去噪后,4种模型的音素错误率明显下降,其中所提A-R-B-CTC模型相比于其他3种模型性能依旧出色。此外,在其他信噪比下也达到了不错的效果。
关键词：	音素识别 Wave-U-Net 端到端多头自注意力机制 Transformer模型
收稿时间：	2023-06-10
修稿时间：	2024-05-10
Phoneme recognition method based on fusion network with low signal-to-noise ratio

HUANG Huibo,SHAO Yubin,LONG Hu,DU Qingzhi. Phoneme recognition method based on fusion network with low signal-to-noise ratio[J]. Journal of Chongqing University of Posts and Telecommunications, 2024, 36(4): 786-796

Authors:	HUANG Huibo SHAO Yubin LONG Hu DU Qingzhi

Affiliation:	Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China

Abstract:	Aiming at the problem of low accuracy of phoneme recognition under low signal-to-noise ratio, a new recognition method is proposed. Firstly, the Fbank features of speech are extracted and input into the A-R-B-CTC model constructed by multi-head attention mechanism,ResNet, BLSTM, and CTC for phoneme recognition. Then, the image denoising of the speech features Fbank, MFCC, GFCC, and logarithmic spectrum is performed by utilizing Wave-U-Net, and it is found that the denoising of the Fbank features results in a more lower phoneme error rate. The THCHS30 dataset is used for experimental validation in a 0 dB white noise environment. The results show that before Fbank denoising, the proposed A-R-B-CTC model reduces the average phoneme error rate by 4.38%, 2.5%, and 1.96% compared to the BLSTM-CTC, ResNet-BLSTM-CTC, and Transformer models, respectively; after Fbank denoising, the phoneme error rates of the four models are significantly reduced, and the proposed A-R-B-CTC model still performs well compared to the other three models. In addition, good results are also achieved at other signal-to-noise ratios.

Keywords:	phoneme recognition Wave-U-Net end-to-end multi-headed self-attention transformer model

	点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《重庆邮电大学学报(自然科学版)》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏