多语言 | 语音/音乐/音频论文速递

语音/音乐/音频论文速递 2026-06-11

语音/音乐/音频论文速递 2026-06-11 共分析 36 篇论文 ⚡ 今日概览 📥 抓取 36 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 7篇 ███████ #语音合成 7篇 ███████ #基准测试 2篇 ██ #音乐信息检索 2篇 ██ #语音情感识别 2篇 ██ #低资源 1篇 █ #音频问答 1篇 █ #音频质量评估 1篇 █ 📊 论文评分排行榜（36 篇，按分数降序）排名论文总分分档主任务 🥇 Massive Open-Vocabulary Keyword Spotting 9.8分前50% #语音识别 🥈 Tight Boundary Prediction in Speaker Diarization Using 9.6分前25% #低资源 🥉 RAIL: Rethinking Auditory Intelligence in Large Audio-L 9.6分前10% #音频问答 4. Quality Adaptive Angular Margin Learning for Respirator 9.5分前50% #音频质量评估 5. CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched 9.2分前50% #多语言 6. Gumbel-BEARD: Automatic Layer Selection for Self-Superv 9.1分前25% #语音识别 7. PianoKontext: Expressive Performance Rendering from Dea 9.1分前50% #音乐生成 8. Benchmarking Neural Speech Compression from a Rate-Dist 9.0分前25% #基准测试 9. Fast-SDE: Efficient Single-Microphone Sound Source Dist 8.8分前50% - 10. Evaluating Bias in Phoneme-Based Automatic Speech Recog 8.8分前50% #语音识别 11. Real-Time Language Model Jamming: A Case Study for Live 8.7分前25% #音乐信息检索 12. HALO: Half-Frame-Rate Adaptive Learnable Operator for L 8.4分前50% #语音增强 13. The Dynamics of Human and AI-Generated Language: How Se 8.1分前25% #语音合成 14. UR-BERT: Scaling Text Encoders for Massively Multilingu 8.1分前25% #语音合成 15. SARA: A Dual-Stream VAE for High-Fidelity Speech Genera 7.9分前25% #语音合成 16. SpAArSIST: Sparsified AASIST for Efficient and Reliable 7.7分前50% #模型压缩 17. Interpreting and Steering a Text-to-Speech Language Mod 7.7分前25% #语音合成 18. Which Speech Representation Better Matches Text-Native 7.5分前50% #语音识别 19. MA-DLE: Speech-based Automatic Depression Level Estimat 7.5分前25% #语音情感识别 20. The Hidden Cost of Pairwise Verification in Synthetic S 7.5分前50% #语音合成 21. Sensitivity Analysis of Generative Spatial Audio Metric 7.2分前50% #音频生成 22. Snapping Matters: Context-Aware Onset Refinement for Au 7.1分前25% #音乐信息检索 23. Feature-Aligned Speech Watermarking for Robustness to R 7.1分前25% #鲁棒性 24. Context-Aware Multimodal Claim Verification in Spoken D 7.1分前50% #多模态模型 25. Afrispeech Semantics: Evaluating Audio Semantic Reasoni 7.0分前50% #数据集 26. Lung-SRAD: Spectral-Aware Regularized Audio DASS with D 6.8分前50% #对比学习 27. Lip Forcing: Few-Step Autoregressive Diffusion for Real 6.8分前50% #语音合成 28. Frozen Multimodal Embeddings for Personality and Cognit 6.7分前50% #语音情感识别 29. Fast Speech Foundation Model Distillation Using Interle 6.6分前50% #知识蒸馏 30. Steering Where to Listen: Instruction-Based Activation 6.5分前50% - 31. Pretrained self-supervised speech models can recognize 6.5分前50% #语音识别 32. Towards Data-free and Training-free Compression for Spe 6.4分前50% #语音识别 33. Additive Noise, Shift Recovery, and Signed Signals in t 6.1分前50% #信号处理基础 34. I Understand How You Feel: Enhancing Deeper Emotional S 5.8分前50% #语音识别 35. Overcoming State Inertia in Full-Duplex Spoken Language 5.5分前50% #基准测试 36. BadRobot: Jailbreaking Embodied LLM Agents in the Physi 5.2分后50% #语音合成 📋 论文列表 🥇 Massive Open-Vocabulary Keyword Spotting 9.8/10 | 创新 1.6/2 | 严谨 1.5/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 0.7/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

📄 Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning #多语言 8/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1/1.5 | 开源 1/1.5 | 复现 0.5/0.5 | 工程 0.3/1.5 🔥 8/10 | 前25% | #多语言 | #多语言 | arxiv 👥 作者与机构 Cristian-Teodor Neamtu, Serban Mihalache, Stefan Smeu, Dan Oneata, Horia Cucu, Dragos Burileanu ( affiliations: 1Politehnica University of Bucharest, Romania; 2Bitdefender, Romania - note: the text lists affiliations but not explicit in the provided snippet, inferred from context) ...

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

📄 GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models #数据集 #基准测试 #多语言 #多模态模型 #低资源 7.9/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.1/1.5 | 清晰 1/1 | 影响 1.3/1.5 | 开源 0/1.5 | 复现 0.5/0.5 | 工程 1.3/1.5 ✅ 7.9/10 | 前25% | #语音识别 | #数据集 | #基准测试 #多语言 | arxiv 👥 作者与机构作者：Ryner Tan, Wenxuan Zhang 机构：Singapore University of Technology and Design (新加坡科技设计大学) 💡 毒舌点评审稿人：一位匿名的顶会审稿人。这论文瞄准了LALM评估中一个真实存在的痛点——缺乏自然、多语言、多文化的测试场景，这个动机值得肯定。作者们收集数据、设计问题、进行质量控制的工作看起来也相当扎实。然而，这终究是一个“评测集”工作，而非提出新的模型或算法。在当前这个“Benchmark疲劳”的时代，如果只是提供一个新的数据集，其边际贡献需要仔细掂量。论文的最大亮点或许在于“自然发生音频”和“文化根基问题”的结合，但实验分析部分（尤其是错误案例分析）的缺失，使得这种结合的优势没能被充分证明。整体而言，这是一篇稳妥的、必要的工作，但距离“令人兴奋”或“突破性”还有差距。 ...

语音/音乐/音频论文速递 2026-06-10

语音/音乐/音频论文速递 2026-06-10 共分析 45 篇论文 ⚡ 今日概览 📥 抓取 45 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 13篇 █████████████ #数据增强 3篇 ███ #自监督学习 2篇 ██ #语音合成 2篇 ██ #多模态模型 1篇 █ #语音对话系统 1篇 █ #语音生成 1篇 █ #参数高效微调 1篇 █ 📊 论文评分排行榜（45 篇，按分数降序）排名论文总分分档主任务 🥇 ViP-VL: Vietnamese Self-supervised Speech Pretraining M 9.7分前25% #语音识别 🥈 Spatial-Omni: Spatial Audio Understanding Integration i 9.4分前25% #多模态模型 🥉 Multi-Faceted Interactivity Alignment in Full-Duplex Sp 9.3分前25% #语音对话系统 4. OmniCap-IF: Benchmarking and Improving Instruction Foll 9.1分前25% #语音生成 5. RAT: Reference-Augmented Training for ASV Anti-Spoofing 8.8分前25% #数据增强 6. Recovering the Zipfian Distribution in Unsupervised Ter 8.7分前50% #自监督学习 7. LLM can Read Spectrogram: Encoder-free Speech-Language 8.6分前25% #语音识别 8. ParaBridge: Bridging Paralinguistic Perception and Dial 8.6分前25% #参数高效微调 9. Time-frequency localization of bird calls in dense soun 8.5分前25% #信号处理基础 10. Ethical and Technical Limits of Deepfake Speech Dataset 8.4分前25% - 11. Speech Meets ELF: Audio Conditional Continuous-Target D 8.3分前25% #语音识别 12. DeRA-MOS: Optimizing Text-to-Music Evaluation via Decou 8.2分前25% #音乐评估 13. Anchoring the Unknown: Open-Set Model Attribution via P 8.0分前25% #多语言 14. ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refi 8.0分前25% #语音质量评估 15. ContextCodec: Content-Focused Context Guidance for Ultr 7.9分前25% #语音编码 16. GlobeAudio: A Multilingual Multicultural Benchmark for 7.9分前25% #语音识别 17. Dual-Branch Gated Fusion for Open-Set Audio Deepfake So 7.8分前25% #音频深度伪造检测 18. Data Journalist Agent: Transforming Data into Verifiabl 7.7分前25% - 19. GC-LoRA: Gated Convolutional LoRA for Parameter-Efficie 7.6分前25% #语音识别 20. What Do Deepfake Speech Detectors Actually Hear? 7.6分前25% - 21. KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyw 7.6分前25% #关键词检测 22. Entropy-Aware Domain-Routed Mixture-of-Experts Speech-L 7.5分前25% #语音识别 23. Linguistically Augmented Audio Speech Data (LinguAS) 7.5分后50% #语音伪造检测 24. AudioProcessBench: Benchmark for Identifying Process Er 7.5分前50% - 25. Cross-Modal Knowledge Distillation without Paired Data: 7.5分前50% #语音识别 26. AuRA: Internalizing Audio Understanding into LLMs as Lo 7.5分前25% #语音问答 27. TRADE: Transducer-Augmented Decoder for Speech LLM 7.4分前25% #语音识别 28. Inside the Latent Flow: Causal Deciphering of Attention 7.3分前50% #语音分离 29. Optimality of FSQ Tokens for Continuous Diffusion for C 7.3分前50% #语音合成 30. Speech Encoder Fusion for LLM-based Automatic Speech Re 7.2分后50% #语音识别 31. Enhancing Multilingual LLM-based ASR with Mixture of Ex 7.0分前50% - 32. Phoneme-First Prediction for LLM-Based Speech Recogniti 6.9分前50% #语音识别 33. Profy: Interpretable Visualization of Expertise-Depende 6.9分前50% #音乐信息检索 34. Optimizing 2D Input Representations and Sub-phase Fusio 6.8分前50% #数据增强 35. SSL-GMMVC: Interpretable Voice Conversion via Locally L 6.8分前50% #语音转换 36. Deploying Speech-Driven 3D Facial Animation in Unreal E 6.6分前50% #语音合成 37. RespiraMFM: A Multimodal Foundation Model with Contrast 6.5分前50% #对比学习 38. From Senses to Decisions: The Information Flow of Audit 6.5分前50% #语音识别 39. Speaker Group Encoding in Self-supervised Speech Recogn 6.5分前50% #语音识别 40. Towards Robust Arabic Speech Emotion Recognition with D 6.4分前50% #语音情感识别 41. Multilingual Word-Level Forced Alignment with Self-Supe 6.3分前50% #自监督学习 42. Overview of ESDD2: Environment-Aware Speech and Sound D 6.3分前50% #数据增强 43. Towards Deep Contextual Reasoning from Broad Descriptio 6.2分前50% #语音识别 44. A Lightweight Dual-Factor Acoustic Authentication Syste 6.0分前50% #说话人验证 45. Automated Pronunciation Evaluation for Korean Toddler S 6.0分前50% #说话人日志 📋 论文列表 🥇 ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning 9.7/10 | 创新 1.5/2 | 严谨 1.3/1.5 | 实验 1.3/1.5 | 清晰 1/1 | 影响 1.1/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...

A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification

📄 A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification #自监督学习 #对比学习 #低资源 #多语言 8.9/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.2/1.5 | 开源 0.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 🔥 8.9/10 | 前50% | #自监督学习 | #自监督学习 | #对比学习 #低资源 | arxiv 👥 作者与机构 Agneedh Basu1, Pavan Kumar J1, Sujith P1, Visruth Sanka1, Nihar Desai1, Prasanta Kumar Ghosh2 ...

A study on the impact of region specific data on the performance of Indic ASR

📄 A study on the impact of region specific data on the performance of Indic ASR #语音识别 #低资源 #多语言 7.2/10 | 创新 1.2/2 | 严谨 1.1/1.5 | 实验 1.2/1.5 | 清晰 1/1 | 影响 1/1.5 | 开源 0.5/1.5 | 复现 0.5/0.5 | 工程 0.7/1.5 ✅ 7.2/10 | 前50% | #语音识别 | #低资源 | #多语言 | arxiv 👥 作者与机构作者：Agneedh Basu, Pavan Kumar J, Pranav Bhat, Sujith Pulikodan, Visruth Sanka, Nihar Desai, Prasanta Kumar Ghosh。机构：AI & Robotics Technology Park (ARTPARK), I-Hub @ IISc, Bangalore, India； Department of Electrical Engineering, Indian Institute of Science, Bangalore, India。 ...

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

📄 Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages #语音识别 #低资源 #多语言 #自回归模型 6.2/10 | 创新 1.3/2 | 严谨 1/1.5 | 实验 1.1/1.5 | 清晰 1/1 | 影响 0.5/1.5 | 开源 0.2/1.5 | 复现 0.5/0.5 | 工程 0.6/1.5 ✅ 6.2/10 | 后50% | #语音识别 | #低资源 | #多语言 #自回归模型 | arxiv 👥 作者与机构作者：Venkata Kumar Tripathi, Chowdam Kumar, Pankaj Wasnik 机构：Media Analysis Group, Sony Research India 邮箱：kumud.tripathi@sony.com, chowdam.kumar@sony.com, pankaj.wasnik@sony.com 💡 毒舌点评这篇论文切中了多语言ASR中一个真实且重要的痛点：Whisper等模型在达罗毗荼语上的表现显著落后于印地语等。作者通过语言学分析将问题归因于形态复杂性导致的解码器注意力失衡，这个动机是合理且有启发性的。提出的Weighted-Attention和Self-Conditioning是直接针对这一问题的工程化尝试，方法本身是合理且可理解的。然而，最大的问题在于贡献的“天花板”较低。两个模块都是对现有Transformer解码器的微小调整（门控和残差连接），创新深度有限。实验规模（仅微调解码器、使用Medium模型、8种印度语言+2种泛化语言）和与当前最强基线（如Whisper-large-v3或专有SOTA）的差距分析不足，使得结论的说服力打了折扣。更关键的是，完全未开源，对于一项声称解决“公平性”问题的工作来说，这限制了其社会影响力和可复现性。总的来说，这是一篇扎实的、解决特定问题的工作，但离顶会论文所期望的突破性贡献仍有距离。 ...

语音/音乐/音频论文速递 2026-06-09

语音/音乐/音频论文速递 2026-06-09 共分析 48 篇论文 ⚡ 今日概览 📥 抓取 48 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音合成 10篇 ██████████ #语音识别 9篇 █████████ #自监督学习 3篇 ███ #多模态模型 3篇 ███ #语音增强 2篇 ██ #音频生成 2篇 ██ #说话人验证 2篇 ██ #大语言模型 1篇 █ 📊 论文评分排行榜（48 篇，按分数降序）排名论文总分分档主任务 🥇 A Finetuned SpeechLLM for Joint Multi-Granular L2 Asses 10.0分前25% #大语言模型 🥈 G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior 9.3分前50% #语音增强 🥉 HoliDubber: Holistic Video Dubbing for Complex Acoustic 9.0分前10% #语音合成 4. Probing Token Spaces under Generator Shift in AI-Genera 9.0分前10% #音频编码 5. A Comparative Study of Pre-trained Speech Encoders and 8.9分前50% #自监督学习 6. AVI-Bench: Toward Human-like Audio-Visual Intelligence 8.8分前25% #语音识别 7. Liberating LLM Capabilities in Full-Duplex Speech Model 8.7分前25% #多模态模型 8. MeCo: One-Step MeanFlow-based Corrector for Multi-Chann 8.4分前25% #语音分离 9. Your U-Net Dereverberation Model is Secretly an RIR Enc 8.3分前50% #对比学习 10. Predictive Fixed-Filter Active Noise Control (PFANC) Us 8.3分前25% - 11. TLDR: Compressing Audio Tokens for Efficient Autoregres 8.2分前25% #语音合成 12. Subtitle-Aligned Fine-Tuning of Whisper for Swiss Germa 8.2分前25% #语音识别 13. Discovering Functionally Selective Brain Regions with a 8.2分前25% #多模态模型 14. Parameter-Efficient Continual Learning for Automatic Sp 8.1分前25% #语音识别 15. OmniMem: Perturbation-aware Memory Compression for Stre 8.0分前25% #高效推理 16. OpenBibleTTS: Large-Scale Speech Resources and TTS Mode 8.0分前25% #语音合成 17. FlashTTS: Fast Streaming TTS with MTP Acceleration and 7.9分前25% #语音合成 18. Multi-View Speech Representation Learning for Parkinson 7.9分前50% #自监督学习 19. Is Text All You Need? Text as a Universal Information B 7.6分前50% #语音识别 20. End-to-End Training for Discrete Token LLM based TTS Sy 7.6分前50% #语音合成 21. Conan-embedding-v3: Fusing Modality-Specific Models for 7.6分前25% #音频检索 22. Cross-Modal Masking for Robust Silent Speech Synthesis 7.5分前50% #语音合成 23. Rethinking Depth: A study of the Recursive-Transformer 7.5分前25% #语音识别 24. What Makes Synthetic Speech Sound Sarcastic? A Prosody- 7.5分前25% #语音合成 25. FXplorer: A Map-Based Interface for Exploratory Audio E 7.5分前25% #音频生成 26. Assessing the Energy and Carbon Emissions of Neural Spe 7.4分前50% #说话人验证 27. Exploring the Scale and Diversity of Speech Anti-spoofi 7.4分前50% #数据增强 28. From A to B to A: Palindromic Zero-Shot Voice Conversio 7.3分前50% - 29. A study on the impact of region specific data on the pe 7.2分前50% #语音识别 30. Speaker-Invariant Representation Learning for Spoofing 7.1分前25% #对抗训练 31. BareWave: Waveform-Native Flow-Matching Text-to-Speech 7.0分前50% #语音合成 32. SMC-ITA: Sequential Monte Carlo Inference-Time Alignmen 7.0分前50% #音频生成 33. Quality-Diversity Search in Sound Generation: Investiga 7.0分前50% - 34. Can LLMs understand LilyPond? A benchmark for symbolic 7.0分前50% #音乐生成 35. NüshuVoice: Reviving the Voice of Endangered Nüshu with 7.0分前50% #语音合成 36. Factors affecting ASR performance: A study using state 6.9分前50% #语音识别 37. MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice 6.9分前50% #语音转换 38. Few-shot Class-variable Incremental Audio Classificatio 6.9分前50% #音频分类 39. A Hierarchical Feature Engineering Framework for Automa 6.8分前50% - 40. Fast and Robust On-Device Speaker Diarization: Relative 6.6分前50% #说话人分离 41. On Low-Bit Quantization Errors in Speaker Verification: 6.6分前50% #说话人验证 42. Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Ne 6.5分后50% #语音合成 43. TinyGiantALM: A Compact Audio-Language Model for Intent 6.4分前50% #多模态模型 44. Overcoming Decoder Inconsistencies in Whisper for Dravi 6.2分后50% #语音识别 45. Bridging Traditional Explainability Methods and Multimo 5.4分后50% #语音识别 46. Sound Field Interpolation Using Physics-Informed Extrem 5.3分后50% #语音增强 47. A Comparison of SSL-Based Feature Extractors and Back-E 5.0分后50% #自监督学习 48. AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining 4.5分后50% #音频事件检测 📋 论文列表 🥇 A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales 10.0/10 | 创新 2.0/2 | 严谨 1.5/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.5/1.5 | 开源 1.0/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...

dots.tts Technical Report

📄 dots.tts Technical Report #语音合成 #流匹配 #自回归模型 #多语言 #低资源 #数据增强 #模型压缩 9/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1/1.5 | 清晰 1/1 | 影响 1.2/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.1/1.5 🔥 9/10 | 前25% | #语音合成 | #数据增强 | #流匹配 #自回归模型 | arxiv 👥 作者与机构作者：Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu。机构：dots团队，小红书公司（Xiaohongshu Inc.），上海交通大学X-LANCE实验室。 ...

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

📄 Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations #语音合成 #自监督学习 #多语言 #语音编码 8.4/10 | 创新 1.5/2 | 严谨 1.3/1.5 | 实验 1.3/1.5 | 清晰 1/1 | 影响 1/1.5 | 开源 0.8/1.5 | 复现 0.5/0.5 | 工程 1/1.5 🔥 8.4/10 | 前25% | #语音合成 | #自监督学习 | #多语言 #语音编码 | arxiv 👥 作者与机构作者：Naman Kothari, Arjun Gangwar, Adarsh S, Umesh 机构：National Institute of Technology, Trichy; Indian Institute of Technology, Madras ...