首页 | 本学科首页   官方微博 | 高级检索  
     检索      

利用说话人自适应实现基于DNN的情感语音合成
引用本文:智鹏鹏,杨鸿武,宋南.利用说话人自适应实现基于DNN的情感语音合成[J].重庆邮电大学学报(自然科学版),2018,30(5):673-679.
作者姓名:智鹏鹏  杨鸿武  宋南
作者单位:西北师范大学 物理与电子工程学院,兰州 730070,西北师范大学 物理与电子工程学院,兰州 730070,西北师范大学 物理与电子工程学院,兰州 730070
基金项目:国家自然科学基金(11664036, 61263036);甘肃省高等学校科技创新团队项目(2017C-03)
摘    要:为了提高情感语音合成的质量,提出一种采用多个说话人的情感训练语料,利用说话人自适应实现基于深度神经网络的情感语音合成方法。该方法应用文本分析获得语音对应的文本上下文相关标注,并采用WORLD声码器提取情感语音的声学特征;采用文本的上下文相关标注和语音的声学特征训练获得与说话人无关的深度神经网络平均音模型,用目标说话人的目标情感的训练语音和说话人自适应变换获得与目标情感的说话人相关的深度神经网络模型,利用该模型合成目标情感语音。主观评测表明,与传统的基于隐马尔科夫模型的方法比较,该方法合成的情感语音的主观评分更高。客观实验表明,合成的情感语音频谱更接近原始语音。所以,该方法能够提高合成情感语音的自然度和情感度。

关 键 词:情感语音合成  深度神经网络  说话人自适应训练  WORLD声码器  隐马尔可夫模型
收稿时间:2018/1/27 0:00:00
修稿时间:2018/9/13 0:00:00

DNN-based emotional speech synthesis by speaker adaptation
ZHI Pengpeng,YANG Hongwu and SONG Nan.DNN-based emotional speech synthesis by speaker adaptation[J].Journal of Chongqing University of Posts and Telecommunications,2018,30(5):673-679.
Authors:ZHI Pengpeng  YANG Hongwu and SONG Nan
Institution:College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070,P.R.China,College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070,P.R.China and College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070,P.R.China
Abstract:The paper proposed a deep neural network (DNN)-based emotional speech synthesis to improve the quality of synthesized emotional speech by speaker adaptation with a multi-speaker and multi-emotion speech corpus. Firstly, a text analyzer was employed to obtain the context-dependent labels from sentences while the WORLD vocoder was used to extract the acoustic features from corresponding speeches. Then a set of speaker-independent DNN average voice models were trained with the context-dependent labels and acoustic features. Finally, the speaker adaptation was adopted to train a set of speaker-dependent DNN voice models of target emotion with target emotional training speeches. The target emotional speech was synthesized by the speaker-dependent DNN voice models. Subjective evaluations show that comparing with the traditional hidden Markov model (HMM)-based method, the proposed method can achieve higher opinion scores. Objective tests demonstrate that the spectrum of the emotional speech synthesized by the proposed method is also closer to the original speech than that of the emotional speech synthesized by the HMM-based method. Therefore, the proposed method can improve the emotion expression and the naturalness of synthesized emotional speech.
Keywords:emotional speech synthesis  deep neural network  speak adaptive training  WORLD vocoder  hidden Markov model
点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆邮电大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号