基准测试 | 语音/音乐/音频论文速递

VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations

📄 VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations #鲁棒性 #基准测试 #多语言 9.4/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.2/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1/1.5 🔥 9.4/10 | 前50% | #鲁棒性 | #基准测试 | #多语言 | arxiv 👥 作者与机构作者：Farnaz Sedaghati, Yuxi Wang, Zicheng Weng, Wei Rao 机构：1 University of Tehran, Iran; 2 Nanyang Technological University, Singapore ...

语音/音乐/音频论文速递 2026-06-16

语音/音乐/音频论文速递 2026-06-16 共分析 62 篇论文 ⚡ 今日概览 📥 抓取 62 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 9篇 █████████ #语音合成 6篇 ██████ #多模态模型 5篇 █████ #自监督学习 4篇 ████ #音频生成 3篇 ███ #生成模型 2篇 ██ #语音生成 2篇 ██ #音乐信息检索 2篇 ██ 📊 论文评分排行榜（62 篇，按分数降序）排名论文总分分档主任务 🥇 TuneJury: An Open Metric for Improving Music Generation 9.7分前25% #多模态模型 🥈 Acoustic, VOC, and Multimodal Stress Source Localizatio 9.7分前50% #声源定位 🥉 VoxWatermark: A Large-Scale Benchmark for Audio Waterma 9.4分前50% #鲁棒性 4. Phonetically Explainable Speech Deepfake Detection 9.0分前50% #语音伪造检测 5. FreeSonic: Training-Free Temporal-Aware Decoupled Atten 9.0分前25% #音频生成 6. MambAdapter: Lightweight Mamba-Based Adapters for Param 8.9分前25% #语音识别 7. XAI-Grounded Explanation Generation for Speech Deepfake 8.9分前25% #多模态模型 8. Unified Audio Generation and Editing via Joint Conditio 8.7分前25% #音频生成 9. AdaTT: Text-Guided Instrument Timbre Transfer with Targ 8.7分前25% #音频生成 10. DuraMark: Duration-Embedded Watermarking in LLM-based T 8.7分前25% #生成模型 11. When the Same Musical Knowledge Forgets Differently: A 8.6分前10% - 12. Probing Low Frame Rate Degradation in Neural Audio Code 8.6分前25% #语音生成 13. Rhythm of the Deep: A Computational-Linguistic Test of 8.5分前25% #自监督学习 14. Beyond Artifacts: Towards Generalizable Synthetic Song 8.4分前25% #音乐信息检索 15. Acoustic Prompting via Stage-wise Modulation for Few-Sh 8.3分前50% #音频分类 16. ArtNet: A JEPA-Like Articulatory Predictive Framework f 8.3分前50% #语音识别 17. MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Rep 8.3分前25% #音频分类 18. Bridging the SEA Gap: An Initial Benchmark for Neural A 8.2分前25% #语音合成 19. An Empirical Study on Learning Latent Representations f 8.2分后50% #语音合成 20. From Physics to Representation: Audio Learning with Syn 8.2分前25% #自监督学习 21. An Asymmetric Formula for Interval Consonance and its R 8.0分前25% #音乐信息检索 22. Universal adaptive beamforming: A Bayesian approach 8.0分前50% #自适应滤波 23. Learning Input-Channel Permutation Equivariance for Mul 7.9分前50% #音乐源分离 24. Stabilizing Short Duration Speaker Verification through 7.9分前50% #说话人验证 25. AUDEDIT: Inversion-Free Text-Guided Editing with Pretra 7.8分前25% #生成模型 26. Interpretable and Frugal Learning Systems Employing Mul 7.8分前25% - 27. MuVAP: Multimodal Multiparty Voice Activity Projection 7.8分前25% #语音对话系统 28. Dynamic Prosody Prediction in LLM-based TTS for Improvi 7.6分前25% #语音合成 29. Scaling Human and G2P Supervision for Robust Phonetic T 7.6分前25% #语音识别 30. SPRI: SVD-Partitioned Residual Initialization for Data- 7.6分前25% #语音翻译 31. CraBERT: Efficient Phoneme Encoder Pre-Training via Cas 7.5分前50% #语音合成 32. Pixel-TTS: Image based Text Rendering for Robust Text-t 7.5分前50% #语音合成 33. AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Op 7.4分前50% #语音识别 34. Spectro-Temporal Interference Confounds Phase Encoding 7.4分前50% #自监督学习 35. Teacher-Student Structure for Domain Adaptation in Ense 7.4分前50% #多模态模型 36. SciText2Eq: Assessing LLMs for Explainable Equation Gen 7.3分前50% #大语言模型 37. Confidence Score Guided Incremental and Speaker Adaptiv 7.2分前50% #语音识别 38. Geometrically Constrained Decentralized Independent Vec 7.2分前50% #语音分离 39. Dual-Granularity Orthogonal Disentanglement for General 7.2分前50% #课程学习 40. Data-Driven Decoding of Russell's Circumplex Model 7.2分前50% #语音情感识别 41. Connecting Speech to Words through Images 7.1分前50% #无监督学习 42. Bridging the Usability Gap: Lessons from Interpreting S 7.1分前50% #语音翻译 43. TMASC: Transmasculine Attitude and Speech Corpus 7.0分前50% - 44. MUNI: Multimodal Unified Latent Diffusion for Coherent 6.9分前50% #语音生成 45. Decoding while Adapting: Zero-Shot Online Speaker Adapt 6.8分前50% #语音识别 46. Joycent: Diffusion-based Accent TTS without Accented Ph 6.8分前50% #语音合成 47. Semi-Supervised Speech Confidence Detection using Pseud 6.8分前50% - 48. Robust Spoofed Speech Detection via Temporal Pyramid Mo 6.7分前50% #音频深度伪造检测 49. From Awareness to Adherence: Bridging the Context Gap i 6.7分前50% #语音识别 50. ArtBoost: Synthetic Articulatory Data Augmentation for 6.5分前50% #语音识别 51. DDPO-VC: Speaker De-Identification via Diffusion Denois 6.5分前50% #语音转换 52. NVMOS: Non-Verbal Vocalization Quality Assessment in Sp 6.2分前50% #自监督学习 53. Unifying Acoustic Features and Text with Multimodal LLM 6.2分前50% #多模态模型 54. ROMPAR: Morphological Completion and Demographic Unlear 6.2分前50% #语音识别 55. EChO-Agent: Evidence Chain Orchestration Agent for Audi 6.1分前50% #音频问答 56. Beyond Classification: A Cough Regression Benchmark for 6.0分前50% #音频事件检测 57. Towards Robust Generative Speech Enhancement Using Vect 5.9分前50% #语音增强 58. Fast When, Careful Who: Dual-Process Multiparty Turn-Ta 5.9分前50% #语音活动检测 59. MAF: Multimodal Adaptive Few-shot Prompting for Sentime 5.9分前50% #多模态模型 60. An auscultation location specific study on the relation 5.8分前50% - 61. Closed-Loop Triplet Synergistic Generation for Long-For 5.5分前50% - 62. LLM-Based Synthetic Ground Truth Generation for Audio-B 5.3分后50% #数据增强 📋 论文列表 🥇 TuneJury: An Open Metric for Improving Music Generation Preference Alignment 9.7/10 | 创新 1.5/2 | 严谨 1.3/1.5 | 实验 1.4/1.5 | 清晰 1.0/1 | 影响 1.5/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.0/1.5 ...

Who Spoke When in Multi-Conversation: Target Speaker Tagging Task and Benchmark

📄 Who Spoke When in Multi-Conversation: Target Speaker Tagging Task and Benchmark #说话人识别 #基准测试 8.6/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.3/1.5 | 开源 0.8/1.5 | 复现 0.5/0.5 | 工程 0.8/1.5 🔥 8.6/10 | 前50% | #说话人识别 | #基准测试 | arxiv 👥 作者与机构作者：Minjae Lee, Hee-Soo Heo, Youngki Kwon, Han-Gyu Kim, You Jin Kim, Bong-Jin Lee 机构：NAVER Cloud Corporation, NAVER Corporation 💡 毒舌点评这篇论文像一个设计精良的“应用题”：它精准地指出了实际场景中说话人识别技术落地的痛点（需要同时解决“谁在何时说话”和“说话的是谁”），并为此量身定做了一套考试（TST任务）和考卷（TST-Bench）。优点在于问题定义清晰、考卷设计周全（规模大、可控、有全局标签），并通过实验证明了“做题技巧”（专用系统设计）比“直接套公式”（模块堆叠）更有效。然而，其“答题方法”（系统本身）更多是现有技术的合理组装与调优，原创性略显不足。合成数据虽然解决了隐私和可控性问题，但其与真实会话的鸿沟（朗读vs对话、缺乏自然打断和重叠等）是一个需要反复强调的“房间里的大象”，论文对此讨论尚可但解决方案有限。总体而言，它是一项扎实的工程贡献，为社区提供了一个急需的标准化评测平台，但其方法论的深度和广度距离“顶会突破”尚有一步之遥。 ...

语音/音乐/音频论文速递 2026-06-15

语音/音乐/音频论文速递 2026-06-15 共分析 26 篇论文 ⚡ 今日概览 📥 抓取 26 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 4篇 ████ #语音合成 4篇 ████ #说话人识别 3篇 ███ #数据增强 2篇 ██ #音频问答 2篇 ██ #语音增强 1篇 █ #音乐信息检索 1篇 █ #强化学习 1篇 █ 📊 论文评分排行榜（26 篇，按分数降序）排名论文总分分档主任务 🥇 Listening with Attention: Entropy-Guided Explainability 9.6分前25% #语音识别 🥈 MaskedFOP: Polyglot Speaker Identification under Missin 9.2分前25% #说话人识别 🥉 HIDVAS: A Hearing Instrument Dataset in Various Acousti 9.0分前25% #语音增强 4. BayLing-Duplex: Native Full-Duplex Speech Dialogue with 9.0分前10% #语音合成 5. Moonlight in Latent Space: Chirality and Structural Cor 8.7分前50% #音乐信息检索 6. Who Spoke When in Multi-Conversation: Target Speaker Ta 8.6分前50% #说话人识别 7. Learning to Hear Hesitation: Continual Learning for Dis 8.3分前25% #语音识别 8. The Holistic Storage of Verb+Up Phrases in Text-based a 8.2分前50% #语音识别 9. OmniVideo-100K: A Dataset for Audio-Visual Reasoning th 8.2分前50% #数据增强 10. Orchestra-o1: Omnimodal Agent Orchestration 8.1分前50% #强化学习 11. Unsupervised Approaches for Global Prosodic Embedding E 7.8分前25% #语音合成 12. Instantaneous Pitch Estimation via Wave-U-Net-Based Fun 7.7分前25% #数据增强 13. A Deep Zero-Inflated Model of North Atlantic Right Whal 7.6分前50% #概率图模型 14. FAConformer: Frequency-Aware Convolutional Transformer 7.5分前25% #Transformer 15. From Self-Supervised Speech Models to Mixture-of-Expert 7.5分前50% #自监督学习 16. The Perceived Fragility of Explanations in Audio Models 7.5分前25% - 17. A Multi-Domain Feature Fusion Framework for Generalizab 7.4分前50% #多模态模型 18. AudioDER: A Deduplication-Enhanced Reasoning Dataset fo 7.3分前50% #音频问答 19. Beyond task performance: Decoding bioacoustic embedding 7.1分前50% - 20. Explainable and Trustworthy Speech Emotion Recognition 7.0分前50% #语音情感识别 21. FoleyGenEx: Unified Video-to-Audio Generation with Mult 7.0分前50% #语音合成 22. Spatio-Temporal Audio Language Modeling for Dynamic Sou 6.9分前25% #音频问答 23. Mask, Sample, Revise: A Revisable CTMC Inference Stack 6.8分前25% #语音合成 24. MoDiCoL: A Modular Diagnostic Continual Learning Datase 6.5分前50% #语音识别 25. Multimodal Speaker Identification in Classroom Environm 6.0分前50% #说话人识别 26. Efficiency-Performance Trade-offs in Neural Speaker Dia 5.1分后50% #说话人日志 📋 论文列表 🥇 Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models 9.6/10 | 创新 1.5/2 | 严谨 1.4/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.5/1.5 | 开源 1.0/1.5 | 复现 0.5/0.5 | 工程 1.2/1.5 ...

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

📄 Benchmarking Neural Speech Compression from a Rate-Distortion Perspective #基准测试 9/10 | 创新 1.5/2 | 严谨 1.4/1.5 | 实验 1.4/1.5 | 清晰 1/1 | 影响 1.4/1.5 | 开源 0.8/1.5 | 复现 0.5/0.5 | 工程 1/1.5 🔥 9/10 | 前25% | #基准测试 | #基准测试 | arxiv 👥 作者与机构作者：Jun Xu, Zhengxue Cheng, Fengxi Zhang, Yuhan Liu, Li Song (通讯作者), Wenjun Zhang 机构：上海交通大学信息科学与电子工程学院 💡 毒舌点评这篇论文的工作量是扎实的，对神经语音编解码器的现状进行了一次有价值的梳理，并提出了一个具体的方法。但所谓“Benchmarking”的定位稍显高调——它更像是一个“改进型”或“方法论文”，其核心贡献是提出的ECC模型，而非一个中立、全面的基准测试平台（代码和统一评估框架未开源）。实验结果不错，但对比的基线主要是已发布的、可能未针对相同数据集和训练设置优化的模型，这削弱了“公平基准”的说服力。创新点（如熵跳过）虽然实用，但并非原理性突破。论文行文有些冗长，图表可以更直观。总体来说，是一篇合格的、甚至优于平均水平的工作，但距离顶会标杆性文章还有差距。 📌 核心摘要本文从率失真理论出发，系统分析了当前神经语音编解码器中普遍存在的“表示学习与概率建模解耦”问题。为解决此问题，论文首先构建了一个统一的学习型语音编码框架，并对近期主流编解码器进行了分类学分析。随后，作者提出了熵约束编解码器（ECC），其核心创新在于：1）采用标量量化结合可学习的概率熵模型进行端到端训练；2）设计了通道级上下文建模与潜在残差预测机制；3）引入了无需额外传输信息的熵跳过机制，以提高编码效率。大量实验证明，ECC在多个公开数据集和评估指标上，实现了优于传统及神经网络基线的低比特率率失真性能。 🔗 开源详情代码：论文中未提供ECC的代码仓库链接。但提供了多个对比基线模型的开源实现链接。模型权重：论文中未提及ECC模型权重的具体获取链接。数据集： LibriTTS: 用于训练和评估。 VCTK: 用于域外评估。 AISHELL-3: 用于跨语言泛化评估。（论文中未提供这些数据集的具体下载链接，但它们是公开可用的标准数据集。） Demo：项目主页：https://avery-xu.github.io/ECC-demo/ 复现材料：论文提供了详细的训练配置和超参数（见论文表II），但未提供官方训练脚本或完整配置文件。论文中引用的开源项目（部分）： SoundStream: https://github.com/google/lyra EnCodec: https://github.com/facebookresearch/encodec DAC: https://github.com/descriptinc/descript-audio-codec SNAC: https://github.com/hubertsiuzdak/snac FunCodec: https://github.com/modelscope/FunCodec SpeechTokenizer: https://github.com/ZhangXInFD/SpeechTokenizer Mimi: https://github.com/kyutai-labs/moshi BigCodec: https://github.com/Aria-K-Alethia/BigCodec SemantiCodec: https://github.com/haoheliu/SemantiCodec-inference TAAE: https://github.com/Stability-AI/stable-codec 🏗️ 方法概述和架构 ECC的核心思想是将比特率作为可微分项直接纳入训练目标，从而联合优化编码器、量化器和熵模型，生成易于压缩的潜在表示。 ...

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

📄 Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering #基准测试 5.5/10 | 创新 0.8/2 | 严谨 1/1.5 | 实验 0.9/1.5 | 清晰 1/1 | 影响 0.8/1.5 | 开源 0/1.5 | 复现 0.3/0.5 | 工程 0.7/1.5 📝 5.5/10 | 前50% | #基准测试 | #基准测试 | arxiv 👥 作者与机构作者：Cheng-Kuang Chang (共同一作), Kai-Wei Chang (共同一作), Alexander H. Liu, James Glass 机构：MIT CSAIL 💡 毒舌点评一篇切入点有趣的工作，将激活引导从纯文本LLM延伸到多模态全双工模型。核心观察“状态惰性”直观且有一定洞察力，ZBB基准的设计也精准地戳中了当前模型在精细时间粒度上的理解短板。然而，方法的核心——构建感知向量——过于依赖启发式定义的状态（生成/感知状态）和阈值选择，其“训练免费”的优势在实际部署中可能被对能量检测器的依赖所抵消。实验仅在三个模型上进行，且提升幅度因模型而异（Raon-SpeechChat的提升虽然百分比高，但绝对值过低），结论的普适性存疑。最遗憾的是，论文未开源任何代码、模型或数据集，极大地限制了其可验证性和影响力。整体而言，这是一篇概念清晰、实验尚可但缺乏深度验证和工程落地细节的早期探索性工作。 📌 核心摘要本文研究了全双工语音语言模型在处理用户打断时出现的内部状态转换延迟问题，作者将其命名为“状态惰性”。通过对模型隐藏表示的分析，发现其内部存在与用户输入流对齐的“感知状态”和与模型输出流对齐的“生成状态”，而打断发生时从生成状态到感知状态的转换存在滞后，导致模型丢失用户输入的早期关键信息。为量化此问题，提出了零缓冲基准，通过将关键语义词置于打断话语的最前端来测试模型的瞬时理解能力。最后，提出了一种无需微调的激活引导方法，通过注入“感知向量”来加速状态转换。在三个开源FD-SLM上的实验表明，该方法能有效提升模型在零缓冲基准上的表现。 🔗 开源详情代码：论文未提及提供任何代码仓库链接。虽然文中详细描述了激活引导、亲和力计算、数据集构建（附录A）的方法和参数，但未提供用于复现这些分析或实验的代码。模型权重：论文未提供所评估的三个全双工语音语言模型（PersonaPlex， Moshi， Raon-SpeechChat）的权重下载链接。仅说明它们是开源模型，但未指明具体版本或获取地址。数据集：论文未提及构建的数据集（轮次交互数据集、打断分析数据集、零缓冲基准数据集）是否开源或提供下载地址。附录A详细描述了创建方法。 Demo：论文未提及。复现材料：论文未提供完整的复现指南、训练脚本或检查点。论文中引用的开源项目（非论文自身贡献）： Dia2-2B (TTS模型): https://huggingface.co/nari-labs/Dia2-2B Parakeet-TDT-0.6B-v2 (ASR模型): https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Claude Opus 4.5 (用于数据生成的LLM): 论文中仅提及名称，未提供链接。激活��向相关参考文献: 引用了多篇先前工作，但未列出具体项目链接。 🏗️ 方法概述和架构论文的方法主要围绕问题诊断、基准构建和干预解决三个层面展开，其核心是利用模型的隐藏表示进行分析和操控。 ...

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

📄 RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark #基准测试 #多模态模型 9.6/10 | 创新 1.5/2 | 严谨 1.3/1.5 | 实验 1.4/1.5 | 清晰 0.9/1 | 影响 1.5/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1/1.5 🔥 9.6/10 | 前10% | #音频问答 | #基准测试 | #多模态模型 | arxiv 👥 作者与机构论文作者来自多个机构，包括：墨尔本大学（The University of Melbourne）：Hongyu Jin, Siyi Wang, Yang Xiao, Jiaheng Dong, Kaiyuan Peng, Eun-Jung Holden, Ting Dang (通讯作者) 亚历山大·约安·库扎大学（Alexandru Ioan Cuza University of Iași）：Georgiana Juravle 武汉大学（Wuhan University）：Shihong Tan, Gongping Huang 香港大学（The University of Hong Kong）：Shanquan Chen 奥克兰大学（The University of Auckland）：Hong Jia 莫纳什大学（Monash University）：James Bailey 💡 毒舌点评这篇论文就像给音频AI做了一次全面的“认知体检”，而不是只看它会不会听写或分类。作者们很聪明地借用了心理学中成熟的CHC理论框架，把评估维度从简单的任务表现拆解成了感知、推理、记忆、效率、知识五大能力，这比市面上那些七拼八凑的基准要科学得多。26个模型的大规模“体检报告”确实揭示了当前LALM们的“偏科”问题：背课文（知识）还行，但真要听懂复杂场景、记住长对话、又快又好地思考，还差得远。特别是发现了推理和记忆强相关、效率跟模型大小没啥关系这些点，挺有意思。 ...

语音/音乐/音频论文速递 2026-06-11

语音/音乐/音频论文速递 2026-06-11 共分析 36 篇论文 ⚡ 今日概览 📥 抓取 36 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 7篇 ███████ #语音合成 7篇 ███████ #基准测试 2篇 ██ #音乐信息检索 2篇 ██ #语音情感识别 2篇 ██ #低资源 1篇 █ #音频问答 1篇 █ #音频质量评估 1篇 █ 📊 论文评分排行榜（36 篇，按分数降序）排名论文总分分档主任务 🥇 Massive Open-Vocabulary Keyword Spotting 9.8分前50% #语音识别 🥈 Tight Boundary Prediction in Speaker Diarization Using 9.6分前25% #低资源 🥉 RAIL: Rethinking Auditory Intelligence in Large Audio-L 9.6分前10% #音频问答 4. Quality Adaptive Angular Margin Learning for Respirator 9.5分前50% #音频质量评估 5. CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched 9.2分前50% #多语言 6. Gumbel-BEARD: Automatic Layer Selection for Self-Superv 9.1分前25% #语音识别 7. PianoKontext: Expressive Performance Rendering from Dea 9.1分前50% #音乐生成 8. Benchmarking Neural Speech Compression from a Rate-Dist 9.0分前25% #基准测试 9. Fast-SDE: Efficient Single-Microphone Sound Source Dist 8.8分前50% - 10. Evaluating Bias in Phoneme-Based Automatic Speech Recog 8.8分前50% #语音识别 11. Real-Time Language Model Jamming: A Case Study for Live 8.7分前25% #音乐信息检索 12. HALO: Half-Frame-Rate Adaptive Learnable Operator for L 8.4分前50% #语音增强 13. The Dynamics of Human and AI-Generated Language: How Se 8.1分前25% #语音合成 14. UR-BERT: Scaling Text Encoders for Massively Multilingu 8.1分前25% #语音合成 15. SARA: A Dual-Stream VAE for High-Fidelity Speech Genera 7.9分前25% #语音合成 16. SpAArSIST: Sparsified AASIST for Efficient and Reliable 7.7分前50% #模型压缩 17. Interpreting and Steering a Text-to-Speech Language Mod 7.7分前25% #语音合成 18. Which Speech Representation Better Matches Text-Native 7.5分前50% #语音识别 19. MA-DLE: Speech-based Automatic Depression Level Estimat 7.5分前25% #语音情感识别 20. The Hidden Cost of Pairwise Verification in Synthetic S 7.5分前50% #语音合成 21. Sensitivity Analysis of Generative Spatial Audio Metric 7.2分前50% #音频生成 22. Snapping Matters: Context-Aware Onset Refinement for Au 7.1分前25% #音乐信息检索 23. Feature-Aligned Speech Watermarking for Robustness to R 7.1分前25% #鲁棒性 24. Context-Aware Multimodal Claim Verification in Spoken D 7.1分前50% #多模态模型 25. Afrispeech Semantics: Evaluating Audio Semantic Reasoni 7.0分前50% #数据集 26. Lung-SRAD: Spectral-Aware Regularized Audio DASS with D 6.8分前50% #对比学习 27. Lip Forcing: Few-Step Autoregressive Diffusion for Real 6.8分前50% #语音合成 28. Frozen Multimodal Embeddings for Personality and Cognit 6.7分前50% #语音情感识别 29. Fast Speech Foundation Model Distillation Using Interle 6.6分前50% #知识蒸馏 30. Steering Where to Listen: Instruction-Based Activation 6.5分前50% - 31. Pretrained self-supervised speech models can recognize 6.5分前50% #语音识别 32. Towards Data-free and Training-free Compression for Spe 6.4分前50% #语音识别 33. Additive Noise, Shift Recovery, and Signed Signals in t 6.1分前50% #信号处理基础 34. I Understand How You Feel: Enhancing Deeper Emotional S 5.8分前50% #语音识别 35. Overcoming State Inertia in Full-Duplex Spoken Language 5.5分前50% #基准测试 36. BadRobot: Jailbreaking Embodied LLM Agents in the Physi 5.2分后50% #语音合成 📋 论文列表 🥇 Massive Open-Vocabulary Keyword Spotting 9.8/10 | 创新 1.6/2 | 严谨 1.5/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 0.7/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

📄 GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models #数据集 #基准测试 #多语言 #多模态模型 #低资源 7.9/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.1/1.5 | 清晰 1/1 | 影响 1.3/1.5 | 开源 0/1.5 | 复现 0.5/0.5 | 工程 1.3/1.5 ✅ 7.9/10 | 前25% | #语音识别 | #数据集 | #基准测试 #多语言 | arxiv 👥 作者与机构作者：Ryner Tan, Wenxuan Zhang 机构：Singapore University of Technology and Design (新加坡科技设计大学) 💡 毒舌点评审稿人：一位匿名的顶会审稿人。这论文瞄准了LALM评估中一个真实存在的痛点——缺乏自然、多语言、多文化的测试场景，这个动机值得肯定。作者们收集数据、设计问题、进行质量控制的工作看起来也相当扎实。然而，这终究是一个“评测集”工作，而非提出新的模型或算法。在当前这个“Benchmark疲劳”的时代，如果只是提供一个新的数据集，其边际贡献需要仔细掂量。论文的最大亮点或许在于“自然发生音频”和“文化根基问题”的结合，但实验分析部分（尤其是错误案例分析）的缺失，使得这种结合的优势没能被充分证明。整体而言，这是一篇稳妥的、必要的工作，但距离“令人兴奋”或“突破性”还有差距。 ...

语音/音乐/音频论文速递 2026-06-10

语音/音乐/音频论文速递 2026-06-10 共分析 45 篇论文 ⚡ 今日概览 📥 抓取 45 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 13篇 █████████████ #数据增强 3篇 ███ #自监督学习 2篇 ██ #语音合成 2篇 ██ #多模态模型 1篇 █ #语音对话系统 1篇 █ #语音生成 1篇 █ #参数高效微调 1篇 █ 📊 论文评分排行榜（45 篇，按分数降序）排名论文总分分档主任务 🥇 ViP-VL: Vietnamese Self-supervised Speech Pretraining M 9.7分前25% #语音识别 🥈 Spatial-Omni: Spatial Audio Understanding Integration i 9.4分前25% #多模态模型 🥉 Multi-Faceted Interactivity Alignment in Full-Duplex Sp 9.3分前25% #语音对话系统 4. OmniCap-IF: Benchmarking and Improving Instruction Foll 9.1分前25% #语音生成 5. RAT: Reference-Augmented Training for ASV Anti-Spoofing 8.8分前25% #数据增强 6. Recovering the Zipfian Distribution in Unsupervised Ter 8.7分前50% #自监督学习 7. LLM can Read Spectrogram: Encoder-free Speech-Language 8.6分前25% #语音识别 8. ParaBridge: Bridging Paralinguistic Perception and Dial 8.6分前25% #参数高效微调 9. Time-frequency localization of bird calls in dense soun 8.5分前25% #信号处理基础 10. Ethical and Technical Limits of Deepfake Speech Dataset 8.4分前25% - 11. Speech Meets ELF: Audio Conditional Continuous-Target D 8.3分前25% #语音识别 12. DeRA-MOS: Optimizing Text-to-Music Evaluation via Decou 8.2分前25% #音乐评估 13. Anchoring the Unknown: Open-Set Model Attribution via P 8.0分前25% #多语言 14. ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refi 8.0分前25% #语音质量评估 15. ContextCodec: Content-Focused Context Guidance for Ultr 7.9分前25% #语音编码 16. GlobeAudio: A Multilingual Multicultural Benchmark for 7.9分前25% #语音识别 17. Dual-Branch Gated Fusion for Open-Set Audio Deepfake So 7.8分前25% #音频深度伪造检测 18. Data Journalist Agent: Transforming Data into Verifiabl 7.7分前25% - 19. GC-LoRA: Gated Convolutional LoRA for Parameter-Efficie 7.6分前25% #语音识别 20. What Do Deepfake Speech Detectors Actually Hear? 7.6分前25% - 21. KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyw 7.6分前25% #关键词检测 22. Entropy-Aware Domain-Routed Mixture-of-Experts Speech-L 7.5分前25% #语音识别 23. Linguistically Augmented Audio Speech Data (LinguAS) 7.5分后50% #语音伪造检测 24. AudioProcessBench: Benchmark for Identifying Process Er 7.5分前50% - 25. Cross-Modal Knowledge Distillation without Paired Data: 7.5分前50% #语音识别 26. AuRA: Internalizing Audio Understanding into LLMs as Lo 7.5分前25% #语音问答 27. TRADE: Transducer-Augmented Decoder for Speech LLM 7.4分前25% #语音识别 28. Inside the Latent Flow: Causal Deciphering of Attention 7.3分前50% #语音分离 29. Optimality of FSQ Tokens for Continuous Diffusion for C 7.3分前50% #语音合成 30. Speech Encoder Fusion for LLM-based Automatic Speech Re 7.2分后50% #语音识别 31. Enhancing Multilingual LLM-based ASR with Mixture of Ex 7.0分前50% - 32. Phoneme-First Prediction for LLM-Based Speech Recogniti 6.9分前50% #语音识别 33. Profy: Interpretable Visualization of Expertise-Depende 6.9分前50% #音乐信息检索 34. Optimizing 2D Input Representations and Sub-phase Fusio 6.8分前50% #数据增强 35. SSL-GMMVC: Interpretable Voice Conversion via Locally L 6.8分前50% #语音转换 36. Deploying Speech-Driven 3D Facial Animation in Unreal E 6.6分前50% #语音合成 37. RespiraMFM: A Multimodal Foundation Model with Contrast 6.5分前50% #对比学习 38. From Senses to Decisions: The Information Flow of Audit 6.5分前50% #语音识别 39. Speaker Group Encoding in Self-supervised Speech Recogn 6.5分前50% #语音识别 40. Towards Robust Arabic Speech Emotion Recognition with D 6.4分前50% #语音情感识别 41. Multilingual Word-Level Forced Alignment with Self-Supe 6.3分前50% #自监督学习 42. Overview of ESDD2: Environment-Aware Speech and Sound D 6.3分前50% #数据增强 43. Towards Deep Contextual Reasoning from Broad Descriptio 6.2分前50% #语音识别 44. A Lightweight Dual-Factor Acoustic Authentication Syste 6.0分前50% #说话人验证 45. Automated Pronunciation Evaluation for Korean Toddler S 6.0分前50% #说话人日志 📋 论文列表 🥇 ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning 9.7/10 | 创新 1.5/2 | 严谨 1.3/1.5 | 实验 1.3/1.5 | 清晰 1/1 | 影响 1.1/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.5/1.5 ...