MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances
📄 MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances #语音转换 #掩码建模 #无分类器引导 #零样本 ✅ 6.5/10 | 前50% | #语音转换 | #掩码建模 | #无分类器引导 #零样本 学术质量 5.0/7 | 选题价值 1.5/2 | 复现加成 0.0 | 置信度 中 👥 作者与机构 第一作者:Junhyeok Lee(Johns Hopkins University, Center for Language and Speech Processing) 通讯作者:Najim Dehak(Johns Hopkins University, Center for Language and Speech Processing) 作者列表: Junhyeok Lee(Johns Hopkins University, Center for Language and Speech Processing) Helin Wang(Johns Hopkins University, Center for Language and Speech Processing) Yaohan Guan(Johns Hopkins University, Center for Language and Speech Processing) Thomas Thebaud(Johns Hopkins University, Center for Language and Speech Processing) Laureano Moro-Velazquez(Johns Hopkins University, Center for Language and Speech Processing) Jesús Villalba(Johns Hopkins University, Center for Language and Speech Processing) Najim Dehak(Johns Hopkins University, Center for Language and Speech Processing) 💡 毒舌点评 这篇论文的亮点在于其前所未有的控制灵活性,通过巧妙设计让用户能在推理时“拧旋钮”来平衡音色、音高和音素,而非被固定在一种模式里。然而,其短板也很明显:MaskVCT-Spk模式为了极致音色模仿,可懂度(WER)比最强基线差了近一倍,且论文对如何系统化地选择那些“旋钮”权重(CFG系数)的讨论略显薄弱,更像是试错后的结果。 ...