多任务学习

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

📄 A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges #语音识别 #语音合成 #自监督学习 #多模态模型 #多任务学习 5.4/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 0.7/1.5 | 清晰 0.9/1 | 影响 0.7/1.5 | 开源 0/1.5 | 复现 0.1/0.5 | 工程 0.3/1.5 📝 5.4/10 | 后50% | #语音识别 | #自监督学习 | #语音合成 #多模态模型 | arxiv 👥 作者与机构 Wen Liang: Columbia University, Red Hat Li Siyan: Columbia University Zackary Rackauckas: RoleGaku Julia Hirschberg: Columbia University 💡 毒舌点评这篇综述试图为“自动化演讲辅导”这个看似细分但实际横跨多个热门领域的课题（CAPT、TTS、L2语言学习）建立一个清晰的分类法和研究路线图。其野心值得肯定，但执行上仍有改进空间。 ...

语音/音乐/音频论文速递 2026-06-29

语音/音乐/音频论文速递 2026-06-29 共分析 16 篇论文 ⚡ 今日概览 📥 抓取 16 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 4篇 ████ #语音合成 2篇 ██ #说话人识别 2篇 ██ #语音质量评估 1篇 █ #数据增强 1篇 █ #语音情感识别 1篇 █ #多模态模型 1篇 █ #语音增强 1篇 █ 📊 论文评分排行榜（16 篇，按分数降序）排名论文总分分档主任务 🥇 Screening Matters: A Comparative Study of Conventional 8.4分前25% #语音质量评估 🥈 From General-Purpose Audio Tagging to Spatially Grounde 8.3分前50% #数据增强 🥉 HPRO: Hierarchical Progressive Reward Optimization via 8.2分前50% #语音合成 4. Learning from Annotation Uncertainty: Entropy-Aware Cur 7.4分前50% #语音情感识别 5. MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thin 7.4分前25% #多模态模型 6. A Comparison of Fusion Techniques for Multi-Modal Human 7.3分前50% - 7. Do Speech Emphasis Models Generalize across Languages a 7.0分前25% #语音识别 8. Advancing Speaker-Based Vocal Effort Classification wit 6.8分前50% #语音增强 9. HybridCodec: Modeling Discrete and Continuous Represent 6.5分前50% #语音合成 10. Grammar-Guided Hierarchical Parsing for Long-form Audio 6.2分前50% #音频事件检测 11. Room for Error: Large-Scale Simulation of Over-the-Air 6.2分前50% #语音识别 12. What Was That Again? Certified Robustness for Automatic 6.2分前50% - 13. Dialogue to Detection: A Multimodal Hybrid NLP Pipeline 6.0分后50% #说话人识别 14. From Black-Box to Clinical Insight: A Multi-Stage Expla 6.0分前50% #语音识别 15. DG^VoiC: Speaker Clustering for Fraud Investigation und 5.7分前50% #说话人识别 16. A Survey of Automated Presentation Coaching: Systems, M 5.4分后50% #语音识别 📋 论文列表 🥇 Screening Matters: A Comparative Study of Conventional and Crowdsourced Listening Tests 8.4/10 | 创新 1.4/2 | 严谨 1.3/1.5 | 实验 1.2/1.5 | 清晰 1/1 | 影响 1.2/1.5 | 开源 0.5/1.5 | 复现 0.5/0.5 | 工程 1.3/1.5 ...

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

📄 CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents #Transformer #多任务学习 #多模态模型 7.7/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.3/1.5 | 清晰 1/1 | 影响 0.9/1.5 | 开源 0/1.5 | 复现 0.5/0.5 | 工程 1.3/1.5 ✅ 7.7/10 | 前25% | 音频分离 | #Transformer | #多任务学习 #多模态模型 | arxiv 👥 作者与机构作者：Adhiraj Banerjee, Vipul Arora 机构：印度理工学院坎普尔分校电气工程系 💡 毒舌点评论文提出了一个想法清��的模型：利用已经训练好的音频压缩模型（DAC）的紧凑表示和一个强大的文本-音频对齐模型（CLAP）的文本特征，通过一个轻量级的Transformer掩码器实现高效的文本引导音频分离。这个思路在计算效率上确实取得了显著优势，尤其是在边缘部署场景下，GMACs大幅降低。然而，“首个”的宣称需要谨慎对待，因为 CodecFormer 等工作已经探索了NAC在分离中的应用，本文的核心是加入了文本引导。实验评估全面，覆盖了多个数据集和不同的提示粒度。主要问题在于：1) 代码和模型权重未开源，严重削弱了可复现性和社区验证的基础；2) 论文第3.3节关于“为什么NAC潜在空间更好”的讨论篇幅过长，部分内容（如与RVQ层级结构的关联）更像是推测而非由严格实验证明的因果结论；3) 核心结论“掩码优于生成”虽然得到表格3的支持，但对照组（CodecFormer）是固定类别分离模型，与文本引导设置不完全对等，使得比较的公平性稍打折扣。总体而言，这是一篇扎实的增量工作，解决了具体且重要的部署效率问题，但缺乏代码开源和更底层的理论分析。 📌 核心摘要 CodecSep是首个将神经音频编解码器（NAC）与文本引导相结合，用于通用音频源分离的模型。它通过将预训练的DAC作为编解码骨干，冻结其参数，并利用CLAP生成的文本嵌入，通过FiLM条件调制一个Transformer掩码器。掩码器在DAC编码的紧凑潜在空间上操作，预测源掩码，从而实现高效的分离。该方法在分离保真度（SI-SDR）上超越了AudioSep，同时保持了有竞争力的感知质量（ViSQOL），并将代码流部署下的计算成本降低了约54倍。 🔗 开源详情代码：论文中未提及代码链接。模型权重：论文中未提及。数据集： dnr-v2 (Divide and Remaster v2.0)：论文中提及该数据集的引用，但未提供具体下载链接。 AudioCaps：论文中提及该数据集的引用，但未提供具体下载链接。 ESC-50：论文中提及该数据集的引用，但未提供具体下载链接。 Clotho-v2：论文中提及该数据集的引用，但未提供具体下载链接。 AudioSet-eval：论文中提及该数据集的引用，但未提供具体下载链接。 VGGSound：论文中提及该数据集的引用，但未提供具体下载链接。 LibriSpeech, FMA (Free Music Archive), FSD50K：论文中提及作为dnr-v2的组成部分，但未提供具体下载链接。 Demo：论文中未提及。复现材料：论文中未提及单独的复现材料包（如预训练检查点、完整训练配置文件等）。论文在第4.3节“训练”中详细描述了训练配置（如优化器、学习率、硬件环境等），但未提供可直接使用的材料链接。论文中引用的开源项目： CLAP (Contrastive Language-Audio Pretraining)：论文中引用，但未提供具体代码仓库链接。 DAC (Descript Audio Codec)：论文中引用，但未提供具体代码仓库链接。 CodecFormer：论文中引用，但未提供具体代码仓库链接。 SDCodec：论文中引用，但未提供具体代码仓库链接。 AudioSep：论文中引用，但未提供具体代码仓库链接。 Torchprofile：用于计算MACs的工具，论文中提供了其GitHub链接：https://github.com/zhijian-liu/torchprofile。 TDANet：论文中引用，但未提供具体代码仓库链接。 DPTNet, SepFormer, Wave-UNet, Demucs, MM-DenseLSTM, DCCRN, Spleeter：论文中引用，但均未提供具体代码仓库链接。 🏗️ 方法概述和架构 CodecSep采用编码器-掩码器-解码器的架构，在DAC的潜在空间中进行操作。 ...

语音/音乐/音频论文速递 2026-06-26

语音/音乐/音频论文速递 2026-06-26 共分析 22 篇论文 ⚡ 今日概览 📥 抓取 22 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 3篇 ███ #语音质量评估 2篇 ██ #语音合成 2篇 ██ #扩散模型 1篇 █ 歌唱评估 1篇 █ 音频编解码 1篇 █ 音频事件检测 1篇 █ 音频分离 1篇 █ 📊 论文评分排行榜（21 篇，按分数降序）排名论文总分分档主任务 🥇 DNSMOS-C: Improving End-to-end Speech Quality Models vi 9.3分前50% #语音质量评估 🥈 UnityShots: Memory-Driven Multi-Shot Audio-Video Genera 8.9分前25% #扩散模型 🥉 Listening Like a Judge: A Music-Aware Framework for Aut 8.8分前25% 歌唱评估 4. Elastic Time: Dynamic Frame Rate Bottlenecks for Neural 8.3分前50% 音频编解码 5. Soroll-IA: A Weakly Labeled Audio Dataset for Real-Worl 8.3分前25% 音频事件检测 6. A Large-Scale Database and Predictive Model of Listener 8.1分前25% #语音质量评估 7. SamaVaani: Auditing and Debiasing Multilingual Clinical 7.8分前25% #语音识别 8. CodecSep: Prompt-Driven Universal Sound Separation on N 7.7分前25% 音频分离 9. VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinfo 7.6分前50% #语音合成 10. What We are Missing in Multimodal LLM Evaluation? 7.0分前50% - 11. RedVox: Safety and Fairness Gaps in Speech Models Acros 6.8分前50% #基准测试 12. WQ-Fusion: Dynamic Gated Attention for Cross-Domain Aud 6.7分前50% #音频分类 13. Thinking While Speaking: Inference-Time Knowledge Trans 6.7分后50% #知识蒸馏 14. When Does Quality-Aware Multimodal Fusion Matter? A Lea 6.6分前50% #语音情感识别 15. voxmap-studio: An open-source speaker diarization annot 6.5分前50% #说话人日志 16. FBK's Long-form SpeechLLMs for IWSLT 2026 Instructi 6.5分前50% #语音识别 17. wav2tok 2.0: Scalable Audio Tokenization Maintaining Ex 6.4分前50% #语音检索 18. Generative AI and Copyright Infringement: A Legal-Techn 6.0分前50% #音乐生成 19. Closing the Quality Gap in Low-Resource Text-to-Speech: 6.0分后50% #语音合成 20. Neural Speaker Diarization via Multilingual Training: E 5.5分前50% #语音分离 21. Low Resource Multimodal Translation of Nepali Spoken Wo 5.3分后50% #语音识别 22 Phonetic and semantic analyses of spoken corpora of Bei N/A - - 📋 论文列表 🥇 DNSMOS-C: Improving End-to-end Speech Quality Models via Contrastive Learning 9.3/10 | 创新 1.3/2 | 严谨 1.2/1.5 | 实验 1.4/1.5 | 清晰 1/1 | 影响 1.3/1.5 | 开源 1.2/1.5 | 复现 0.5/0.5 | 工程 1.4/1.5 ...

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

📄 Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs? #语音识别 #语音合成 #语音翻译 #多任务学习 #大语言模型 7.1/10 | 创新 1.3/2 | 严谨 1.2/1.5 | 实验 1.2/1.5 | 清晰 1/1 | 影响 0.9/1.5 | 开源 0.2/1.5 | 复现 0.5/0.5 | 工程 0.8/1.5 ✅ 7.1/10 | 前50% | #语音识别 | #多任务学习 | #语音合成 #语音翻译 | arxiv 👥 作者与机构作者：Tomoya Mizumoto, Yusuke Fujita 机构：SB Intuitions Inc. 邮箱：tomoya.mizumoto@sbintuitions.co.jp, yusuke.fujita@sbintuitions.co.jp 💡 毒舌点评这篇论文像一篇严谨的“消融实验报告”。它精确地回答了一个问题：在训练语音编码器时，加入翻译任务到底有没有用、有多大用？答案是“有用，且双向翻译比单向翻译更有用”。优点在于实验设计非常干净（控制变量），结论清晰直接。然而，这种清晰也暴露了其局限：研究范围被严格限定在“将预训练好的编码器接入冻结LLM”这一特定范式内，没有探索更灵活的架构（如端到端训练）。130k小时的训练数据对于如今的大模型时代来说显得“小家碧玉”，更像是在验证一个想法而非冲击SOTA。开源方面的完全缺席，对于需要复现或在该方向上继续推进的同行来说，无疑是一种遗憾。 📌 核心摘要本文的核心研究问题是：在预训练语音编码器时，引入翻译任务（尤其是双向翻译）能否改善其与冻结的大语言模型的集成效果？作者认为，传统基于ASR的编码器学习到的是语言特定的表示，这与LLM统一的语义空间存在结构错位。为解决此问题，他们提出在预训练阶段加入跨语言翻译任务，特别是要求模型在英语与其他语言之间进行双向翻译，以迫使编码器学习语言无关的语义表示。实验对比了三种预训练目标：仅ASR、ASR + 单向翻译（X→en）、ASR + 双向翻译（X↔en）。结果表明，双向翻译预训练（X↔en）在语音翻译、意图分类等任务上带来了显著且一致的性能提升，并且能够泛化到预训练未见过的语言对，同时不损害依赖声学信息的情感识别任务性能。论文将这一优势归因于双向翻译目标提供了更对称、更彻底的语义抽象路径。 ...

语音/音乐/音频论文速递 2026-06-25

语音/音乐/音频论文速递 2026-06-25 共分析 27 篇论文 ⚡ 今日概览 📥 抓取 27 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 6篇 ██████ #语音合成 5篇 █████ #语音增强 2篇 ██ #音乐生成 1篇 █ #语音翻译 1篇 █ #语音伪造检测 1篇 █ #自监督学习 1篇 █ #端到端 1篇 █ 📊 论文评分排行榜（27 篇，按分数降序）排名论文总分分档主任务 🥇 Fully Differentiable Neural Forced Alignment via Soft D 8.3分前25% - 🥈 Attractive and Repulsive Pattern Control in Sequence Ge 8.1分前25% #音乐生成 🥉 STEB: A Speech-to-Speech Translation Expressiveness Ben 7.8分前50% #语音翻译 4. Supervised Post-training of Speech Foundation Models fo 7.6分前50% #语音伪造检测 5. Joint Residual Reweighting for Classifier Free Guidance 7.5分前50% #语音合成 6. Velocity Prediction in Automatic Guitar Transcription 7.5分前25% - 7. SE-AGCNet: An End-to-End Framework for Joint Speech Enh 7.4分前50% #语音增强 8. MJEPA: A Simple and Scalable Joint-Embedding Predictive 7.4分前25% #自监督学习 9. Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese 7.3分前50% #语音合成 10. One Model, Many Latencies: Universal Speech Enhancement 7.2分前50% #语音增强 11. From Sounds to Scenes: A Benchmark for Evaluating Conte 7.2分前50% #语音识别 12. Wan-Streamer v0.1: End-to-end Real-time Interactive Fou 7.2分前25% #语音合成 13. Does Translation-Enhanced Speech Encoder Pre-training A 7.1分前50% #语音识别 14. Adaptive Oscillatory Inductive Bias for Modeling Sharp 7.0分前50% #语音合成 15. End-to-End Voice Intent Recognition for Spontaneous Hum 7.0分前50% #端到端 16. Real-Time Voice AI Hears but Does Not Listen 7.0分前50% - 17. FoleySet: A Multi-Level Human-Annotated Foley Sound Dat 7.0分前50% #音频分类 18. EmotionAI: A Privacy-Preserving Computational Intellige 6.9分前50% #语音情感识别 19. Frequency-Aware Self-Supervised Music Representation Le 6.8分前50% #音乐信息检索 20. BCoughBench: Benchmarking Respiratory Acoustic Foundati 6.7分前50% #基准测试 21. SpeechEQ: Benchmarking Emotional Intelligence Quotient 6.7分前25% #语音对话系统 22. Graph-Based Phonetic Error Correction of Noisy ASR 6.7分前50% #语音识别 23. What Does a Pathological Speech Assessment Model Know a 6.4分前50% #语音可懂度评估 24. Phoneme-Level Mispronunciation Screening in Polish-Spea 6.2分前50% #语音识别 25. Error-Aware TF-IDF Retrieval-Augmented Generation for A 6.1分前50% #语音识别 26. Evaluating Japanese Dialect Robustness Across Speech an 5.8分前50% #语音识别 27. CrossAccent-TTS: Cross-Lingual Accent-Intensity Control 5.5分前50% #语音合成 📋 论文列表 🥇 Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming 8.3/10 | 创新 1.4/2 | 严谨 1.3/1.5 | 实验 1.0/1.5 | 清晰 1/1 | 影响 1.1/1.5 | 开源 1.2/1.5 | 复现 0.5/0.5 | 工程 0.8/1.5 ...

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

📄 Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages #语音识别 #多语言 #多任务学习 6.5/10 | 创新 1/2 | 严谨 1/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 0.5/1.5 | 开源 0/1.5 | 复现 0.5/0.5 | 工程 1/1.5 ✅ 6.5/10 | 前50% | #语音识别 | #多任务学习 | #多语言 | arxiv 👥 作者与机构 Pavan Kumar J^{1}, Agneedh Basu^{2}, Pranav Bhat^{2}, Sujith Pulikodan^{2}, Visruth Sanka^{2}, Nihar Desai^{2}, Prasanta Kumar Ghosh^{2} 1 AI & Robotics Technology Park (ARTPARK), I-Hub @ IISc, Bangalore, India 2 Department of Electrical Engineering, Indian Institute of Science, Bangalore, India 邮箱: pavanjk@artpark.in ...

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

📄 Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning #语音合成 #自监督学习 #语音增强 #多任务学习 #对比学习 7.7/10 | 创新 1.3/2 | 严谨 1/1.5 | 实验 1.1/1.5 | 清晰 1/1 | 影响 1/1.5 | 开源 0.8/1.5 | 复现 0.5/0.5 | 工程 1/1.5 ✅ 7.7/10 | 前25% | #语音合成 | #自监督学习 | #语音增强 #多任务学习 | arxiv 👥 作者与机构作者：SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Chang D. Yoo 机构：1 Korea Advanced Institute of Science and Technology, South Korea; 2 University of Illinois Urbana-Champaign, United States ...

语音/音乐/音频论文速递 2026-06-19

语音/音乐/音频论文速递 2026-06-19 共分析 40 篇论文 ⚡ 今日概览 📥 抓取 40 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音合成 10篇 ██████████ #语音识别 8篇 ████████ #语音转换 2篇 ██ #语音增强 2篇 ██ #自监督学习 2篇 ██ #说话人验证 1篇 █ #模型压缩 1篇 █ #多模态模型 1篇 █ 📊 论文评分排行榜（40 篇，按分数降序）排名论文总分分档主任务 🥇 FlowEdit: Associative Memory for Lifelong Pronunciation 10.0分前25% #语音合成 🥈 Low-Burden Data Augmentation for Dysarthric ASR via Zer 8.7分前25% #语音识别 🥉 S-JEPA : Soft Clustering Anchors for Self-Supervised Sp 8.7分前25% #语音识别 4. Personalized Keyword Spotting for User-Defined Keywords 8.6分前25% #说话人验证 5. FlowFake: Liquid Networks for Audio Deepfake Detection 8.5分前25% #模型压缩 6. Systematic Study of Dysarthric Speech Recognition: Spec 8.3分前50% #语音识别 7. PerceptionDLM: Parallel Region Perception with Multimod 8.1分前25% #多模态模型 8. RIVET: Robust Idempotent Voice Attribute Editing 8.0分前50% #语音转换 9. Repurposing a Speech Classifier for Guided Diffusion-Ba 7.9分前50% #语音合成 10. Exploring Feature Extraction Technique Parameters for A 7.9分前50% #音频事件检测 11. Transcript-Free Flow-Matching Text-to-Speech via Speech 7.7分前25% #语音合成 12. How Do Instructions Shape Speech? Cross-Attention Attri 7.7分前50% #语音合成 13. Hybrid Diffusion Transformer for Instruction-Guided Aud 7.6分前50% #Transformer 14. Improving Code-Switching ASR with Code-Mixing Guided Sy 7.6分前25% #语音识别 15. PolSeT: Polish Semantics of Timbre Dataset 7.5分后50% - 16. IHBench: Evaluating Post-Interruption Recovery in Voice 7.5分前25% #语音对话系统 17. A Survey of Full-Duplex Spoken Dialogue Systems: Archit 7.4分前50% #语音合成 18. PhysDrift: Bridging the Embodiment Gap in Humanoid Co-S 7.4分前50% #语音合成 19. PrefSQA: Pairwise Preference Prediction for Speech Qual 7.3分前50% #语音质量评估 20. Latency-Configurable Streaming Speech Enhancement via A 7.2分前50% #语音增强 21. A Comparative Study of Pretrained Transformer Models fo 7.2分前50% #语音识别 22. Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, C 7.2分前50% - 23. Stuttering Classification and Segmentation with Attenti 7.0分前50% - 24. Time-Unconditional Generative Speech Enhancement via Au 7.0分前25% #语音增强 25. Investigating Human-Model Discrepancies in Speech Quali 6.9分前25% #语音合成 26. Prismriver: Formalization of Music Theory and Algorithm 6.9分前50% - 27. NEST: Narrative Event Structures in Time for Long Video 6.8分前50% - 28. Cross-Dataset, Age, and Gender Generalization: A Compre 6.7分前50% #语音识别 29. Exploring Pre-training Benefits on Phoneme Addition thr 6.7分前50% - 30. Analyzing Language and Geographical Variation in Speech 6.5分前50% #语音识别 31. Improving End-to-End Speech Recognition for Dysarthric 6.5分前50% #语音识别 32. Segment-Level Mandarin Chinese Speech-Based Cognitive I 6.5分前50% #对比学习 33. Light-weight Pronunciation Assessment via Discrete Spee 6.4分前50% #自监督学习 34. ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Co 6.2分前50% #语音合成 35. Zero-VC: Zero-Lookahead Streaming Voice Conversion via 6.1分前50% #语音转换 36. MixProLAP: Mixture-Induced Uncertainty Modeling for Pro 5.7分前50% #音频检索 37. MaineCoon: Pursuing A Real-Time Audio-Visual Social Wor 5.7分前50% #语音合成 38. Leveraging systems' non-linearity to tackle the sca 5.5分后50% #数据增强 39. Interpreting Content and Speaker Characteristics in Fac 5.0分后50% #语音合成 40. Beyond Speaker Independence: Evaluating Cross-Lingual A 4.9分后50% #自监督学习 📋 论文列表 🥇 FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS 10.0/10 | 创新 2/2 | 严谨 1.5/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.5/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction

📄 Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction #语音合成 #语音识别 #流式处理 #多任务学习 #自监督学习 #参数高效微调 #实时处理 7.9/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.2/1.5 | 清晰 1/1 | 影响 1.2/1.5 | 开源 0/1.5 | 复现 0.5/0.5 | 工程 1.3/1.5 ✅ 7.9/10 | 前50% | #语音合成 | #多任务学习 | #语音识别 #流式处理 | arxiv 👥 作者与机构 Tristan Tsoi, Jiajun Deng, Yingke Zhu, Huu Quyen Dang, Tianxiang Cao, Nikita Kuzmin, Tao Zhong, Simon Lui 华为中央媒体技术学院, 香港中文大学, 南洋理工大学 ...