声源定位 | 语音/音乐/音频论文速递

ForestIR: Physics-Informed Forest Sound Simulation for Array-Based Bioacoustic Remote Sensing

📄 ForestIR: Physics-Informed Forest Sound Simulation for Array-Based Bioacoustic Remote Sensing #声源定位 7.2/10 | 创新 1/2 | 严谨 1/1.5 | 实验 0.8/1.5 | 清晰 0.8/1 | 影响 0.6/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1/1.5 ✅ 7.2/10 | 前50% | #声源定位 | #声源定位 | arxiv 👥 作者与机构第一作者：Xin Shen（杜克大学统计科学系）通讯作者：Xin Shen（杜克大学统计科学系）、Otso Ovaskainen（于韦斯屈莱大学生物与环境科学系）、David B. Dunson（杜克大学统计科学系）作者列表：Xin Shen（杜克大学统计科学系）、Jennifer N. Kampe（杜克大学统计科学系；于韦斯屈莱大学生物与环境科学系）、Changwoo J. Lee（杜克大学统计科学系）、Braden Scherting（杜克大学统计科学系）、Panu Somervuo（赫尔辛基大学生物与环境科学学院）、Ari Lehtiö（于韦斯屈莱大学数字服务部）、Sandro von Brandenburg（于韦斯屈莱大学数字服务部）、Ossi Nokelainen（于韦斯屈莱大学生物与环境科学系；于韦斯屈莱大学开放科学中心）、Otso Ovaskainen（于韦斯屈莱大学生物与环境科学系）、David B. Dunson（杜克大学统计科学系） 💡 毒舌点评本文为森林声源定位任务量身打造了一个物理驱动模拟器，将树干散射、地面反射、枝叶散射以及大气声速变化等要素打包成可组件化配置的管线，并给出了与野外实测较为一致的衰减曲线和鸟叫相似度——这是少数敢于用真数据验证的模拟工具。但模型底层依然是经典几何声学加单次散射的老套路，地面反射只靠常数乘子，波效应和多次散射一概忽略，这使得它更像一个“实验室沙盘”而非高保真数字孪生，在复杂林分条件下的外推性存疑。模拟验证也仅在无树的冬季雪地上进行，其对真实森林环境的保真度仍未得到严格检验。 ...

Learning-based Physics-Constrained Neural Kernel for Sound Field Estimation With Source-Position-Dependent Directional Weighting

📄 Learning-based Physics-Constrained Neural Kernel for Sound Field Estimation With Source-Position-Dependent Directional Weighting #声源定位 #空间音频 #低资源 #预训练 5.2/10 | 创新 1/2 | 严谨 1.2/1.5 | 实验 0.6/1.5 | 清晰 0.6/1 | 影响 0.8/1.5 | 开源 0/1.5 | 复现 0.1/0.5 | 工程 0.9/1.5 📝 5.2/10 | 后50% | #声源定位 | #预训练 | #空间音频 #低资源 | arxiv 👥 作者与机构第一作者：Mattia Marella（National Institute of Informatics, Tokyo, Japan / University of Ferrara, Ferrara, Italy）通讯作者：未明确标注，推测为Shoichi Koyama（同为NII，且为项目资助获得者）全部作者：Mattia Marella（NII / Univ. Ferrara）、Shoichi Koyama（NII） 💡 毒舌点评这篇文章试图用一个直白且合理的想法——把源位置喂进INR让方向权重学会跨源共享——来解决物理约束神经核单快照过拟合的问题。想法本身没有毛病，方向权重朝向镜像源聚焦的可视化也算亮点。但通篇实验在一个玩具级的模拟房间里打转，声称可推广到“practical measurements”却毫无实测数据支撑，跨房间泛化更是只字不提，这跟只在MNIST上验证一个声称能解决通用视觉问题的方法有什么本质区别？致命的是，代码、模型、数据一概没有，训练细节缺失到让人怀疑作者自己能不能把实验复现出来。放在NeurIPS/ICML的bench上，这篇工作目前的状态顶多算个workshop poster。 ...

语音/音乐/音频论文速递 2026-07-08

语音/音乐/音频论文速递 2026-07-08 共分析 26 篇论文 ⚡ 今日概览 📥 抓取 26 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音属性识别 3篇 ███ #音频分类 3篇 ███ #语音合成 3篇 ███ #语音识别 3篇 ███ #声源定位 2篇 ██ #音乐生成 2篇 ██ #语音交互 1篇 █ #音频事件检测 1篇 █ 📊 论文评分排行榜（26 篇，按分数降序）排名论文总分分档主任务 🥇 Hierarchical Acoustic-Semantic Modeling: Modality Separ 9.2分前10% #语音交互 🥈 Propose and Attend: Training-free MLLM Grounding Confid 8.2分前25% #音频事件检测 🥉 Music I Care About: Automated Multimodal Benchmarking o 7.8分前25% #音乐理解 4. Escaping the Procrustean Bed: Groupwise Orthogonal Conn 7.8分前25% #语音属性识别 5. TriA Pipeline: A Large-Scale Automatic Audio Annotation 7.4分前50% #音频分类 6. InsideSSL: Understanding Self-Supervised Speech Represe 7.4分前50% #语音属性识别 7. Precise Video-to-Audio Generation with Cross-Modal Alig 7.4分前50% #音视频生成 8. WordVoice: Explicit and Decoupled Multi-Dimensional Wor 7.2分前50% #语音合成 9. ForestIR: Physics-Informed Forest Sound Simulation for 7.2分前50% #声源定位 10. Uncovering Latent Depression Severity for Binary Depres 7.0分前50% #音视频理解 11. Determinantal point process sampling for bioacoustic ac 6.9分前50% #音频分类 12. From Sinhala to Dhivehi: Cross-Lingual Transfer Learnin 6.6分前50% #语音识别 13. Goodbye Equal Error Rate, Hello Local Information Discl 6.5分前50% #语音转换 14. BlueMagpie-TTS: A Token-Efficient Tokenizer, Language M 6.5分前50% #语音合成 15. Fréchet Distance Loss on Speech Representations for Tex 6.5分前50% #语音合成 16. NAVER LABS System Re-implementation for the IWSLT 2026 6.4分前50% #语音翻译 17. Few-Shot Class-Incremental Audio Classification Using P 6.3分前50% #音频分类 18. Gemma 4 Technical Report 6.2分前50% #语音识别 19. Revisiting the Relation Between Language Model Perplexi 6.0分前50% #语音识别 20. Multimodal Video-to-Music Recommendation via Semantic R 5.4分后50% #音乐检索 21. Designing Maintainable Hybrid Generative Systems: A Qua 5.3分后50% #音乐生成 22. Learning-based Physics-Constrained Neural Kernel for So 5.2分后50% #声源定位 23. Distributed Multichannel Wiener Filtering for Topology- 5.1分后50% #语音增强 24. Flow Matching-Based Speech Source Separation with Best- 4.9分后50% #语音分离 25. Umm… With Transformers? Insights from Filled Pause Us 4.8分后50% #语音属性识别 26. From Textural Counterpoint to Feature Encoding: A Multi 2.1分后50% #音乐生成 📋 论文列表 🥇 Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs 9.2/10 | 创新 1.8/2 | 严谨 1.2/1.5 | 实验 1.1/1.5 | 清晰 0.8/1 | 影响 1.3/1.5 | 开源 1.2/1.5 | 复现 0.3/0.5 | 工程 1.5/1.5 ...

语音/音乐/音频论文速递 2026-07-07

语音/音乐/音频论文速递 2026-07-07 共分析 58 篇论文 ⚡ 今日概览 📥 抓取 58 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 11篇 ███████████ #语音伪造检测 5篇 █████ #音频理解 4篇 ████ #语音交互 3篇 ███ #音频事件检测 3篇 ███ #语音转换 3篇 ███ #音视频理解 3篇 ███ #语音合成 3篇 ███ 📊 论文评分排行榜（58 篇，按分数降序）排名论文总分分档主任务 🥇 Doppelganger: Sound Effects and Their Synthetic Twins 9.1分前10% #音频检索 🥈 SPEARBench: A Benchmark for Naturalness Evaluation in S 8.9分前25% #语音交互 🥉 Metronome: Bound the Cache, Keep the Beat for Real-Time 8.7分前25% #语音交互 4. Auto-AEG: Scalable Data Construction for Open-Vocabular 8.3分前25% #音频事件检测 5. RABBiT: Rapidly adaptive BOLD foundation model via brai 8.1分前25% #音频理解 6. TRACE-EVC: Text-Guided Relative Affective Control for Z 8.0分前25% #语音转换 7. Parallelized Autoregressive Decoding for Omni-Modal Den 8.0分前25% #音视频理解 8. Speaker-Disentangled Chunk-Wise Regression for Syllabic 7.9分前25% #语音编码 9. Speaker-Aware Temporal Aggregation Strategies on Segmen 7.9分前25% #语音属性识别 10. REDDIT: Correcting Model-Generated Timestamp Drift in A 7.8分前25% #语音识别 11. Deriving Benchmarking Datasets from Long-Form Recording 7.7分前25% #基准测试 12. ProPS: Prompted Profile Synthesis for Natural Language- 7.6分前25% #语音合成 13. DELTA-TTS: Adapting Autoregressive Model into Diffusion 7.5分前25% #语音合成 14. TokAN: Accent Normalization Using Self-Supervised Speec 7.5分前25% #语音转换 15. Listen, Think, Transcribe: Continuous Latent Test-Time 7.5分前25% #语音识别 16. \(C^3\)ASD: Multi-Level Consistency-Driven Representation 7.5分前25% #音视频理解 17. Training-Free Model Selection and Domain-Aware Score Ca 7.3分前50% #音频事件检测 18. CHILDES-Aligned: A Curated Children's Speech Datase 7.2分前50% #语音识别 19. Taste-aware music retrieval from audio embeddings 6.9分前50% #音乐检索 20. Lights, Camera, Carbon: Architectural Scaling Laws for 6.9分前50% #音视频生成 21. Unified Audio Intelligence Without Regressing on Text I 6.8分前50% #音频交互 22. Ranking the Impact of Contextual Specialization in Neur 6.7分前50% #语音增强 23. SynSFX: Multi-Model Sound Effects Synthesis Dataset for 6.5分前50% #音频伪造检测 24. Evaluating the Effect of Linguistic Relatedness on Cros 6.5分前50% #语音识别 25. MOSAIC: Interpretable Multi-Token Cross-Attention of Bi 6.3分前50% #语音伪造检测 26. CARD: Cross-component Audio Representation Distillation 6.3分前50% #音频字幕生成 27. Probing Low-Level Acoustic Attribute Encoding in CLAP A 6.2分前50% #音频理解 28. Trajectory Variance: AnUnsupervised Measure of Developm 6.2分前50% #音频理解 29. Adaptive Diversity-Uncertainty Active Learning with Red 6.2分前50% #音频事件检测 30. Adaptive Loss Balancing for Multi-Task Bioacoustic Clas 6.1分前50% #音频分类 31. An Intervention-Based Framework for Shortcut Diagnosis 6.1分前50% #语音伪造检测 32. QuaSR: Quality-Aware Sample Reweighting for Pacific Ind 6.0分前50% #语音识别 33. CaReCoS: A Spectrogram based Visual Benchmark for Cardi 6.0分前50% #音频理解 34. Open-Set Source Tracing as Compositional Factors via St 6.0分前50% #语音伪造检测 35. Context-Aware ASR for Mandarin Technical Lectures 6.0分前50% #语音识别 36. Streaming Neural Speech Codecs through Time-Invariant R 6.0分前50% #语音编码 37. Physiological Noise Augmentation Improves Non-Invasive 6.0分前50% #语音识别 38. DuplexChat: Constructing Speaker-Separated Full-Duplex 5.9分前50% #语音交互 39. Noisy Environment Adaptation of Neural Speech Codec via 5.9分前50% #语音增强 40. NouveauVoice: Generating Novel Pseudo Speakers for Voic 5.9分前50% #语音转换 41. OmniFocus: Query-Guided Modality-Balanced Token Compres 5.9分前50% #音视频问答 42. Jointly Improving Dialect Identification and ASR in Ind 5.8分前50% #语音识别 43. S-DiverSe: Spanish Diverse Speech 5.8分前50% #语音识别 44. Towards Robust Uncertainty-Aware Speaker Modeling 5.7分前50% #说话人验证 45. Towards Language-Agnostic Speech Inversion 5.6分前50% #语音属性识别 46. Layer-wise Cross-Lingual Depression Detection from Spee 5.5分前50% #语音情感识别 47. Wan-Streamer v0.2: Higher Resolution, Same Latency 5.4分后50% #音视频交互 48. Mixture-Constrained Max Pooling Improves Separation-Bas 5.3分后50% #音频分类 49. Reinforcement Learning for Data-Efficient Code-Switched 5.3分后50% #语音识别 50. Physics-Informed Direction-of-Arrival Estimation Over D 5.3分后50% #声源定位 51. Sampling Bias Compensation for Robust Evaluation of Aud 4.9分后50% #音频分类 52. UniSkip-Mamba: A Frequency-Aware State Space Model for 4.8分后50% #音视频理解 53. Progressive Refinement: An Iterative Pseudo-Labeling Ap 4.6分后50% #语音识别 54. Weakly Guided and Autoregressive Beamformer Parameteriz 4.3分后50% #语音分离 55. DETECT-3B-Omni is Agnostic of Content and Demographics 4.2分后50% #语音伪造检测 56. Towards Digital Preservation of Efik: TTS for a Low-Res 4.0分后50% #语音合成 57. Quantum-Inspired Harmonic Decision Models: A Computatio 2.3分后50% #音乐生成 58. Information-Geometric Superposed Vowel Evaluation: Part 1.9分后50% #语音伪造检测 📋 论文列表 🥇 Doppelganger: Sound Effects and Their Synthetic Twins 9.1/10 | 创新 1.5/2 | 严谨 1.4/1.5 | 实验 1.4/1.5 | 清晰 0.9/1 | 影响 0.8/1.5 | 开源 1.5/1.5 | 复现 0.5/0.5 | 工程 1.1/1.5 ...

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

📄 JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments #声源定位 #多模态模型 #空间音频 #参数高效微调 #数据集 8.1/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.2/1.5 | 清晰 0.8/1 | 影响 1/1.5 | 开源 1/1.5 | 复现 0.4/0.5 | 工程 1/1.5 🔥 8.1/10 | 前25% | #声源定位 | #多模态模型 | #空间音频 #参数高效微调 | arxiv 👥 作者与机构第一作者：Zhan Liu（清华大学、腾讯AI Lab）通讯作者：Chao Zhang（清华大学）作者列表：Zhan Liu（清华大学、腾讯AI Lab）、Changli Tang（清华大学）、Yuxin Wang（香港科技大学）、Zhiyuan Zhu（浙江大学）、Youjun Chen（香港中文大学）、Yiwen Shao（腾讯AI Lab）、Tianzi Wang（腾讯AI Lab）、Lei Ke（腾讯AI Lab）、Zengrui Jin（清华大学）、Chao Zhang（清华大学） 💡 毒舌点评本文提出了在3D模拟物理环境中进行联合音视频定位与推理的框架 JAEGER，其核心贡献 Neural IV 和 SpatialSceneQA 数据集为空间音频理解研究提供了有价值的工具和基准。亮点在于系统性整合了 RGB-D 视觉与多通道 FOA，并在附录中通过 SimpleFuse 基线实验初步证明了其架构设计的有效性，而非仅依赖于多模态输入的堆砌。然而，实验设计存在明显的“避重就轻”：正文主表（Table 2）回避了 SimpleFuse 基线，将其置于附录，这使得核心主张——即架构的优越性——在主叙述中缺乏最直接的量化支撑。此外，3D 视觉接地任务中，专门针对 3D 的模型 N3D-VLM 竟获得 0.0 IoU，这一零样本、无适配的对比方式极不公正，更像是对基线的“处决”而非“比较”。更严重的是，多说话人推理任务在正文中汇报了接近 100% 的准确率，营造出任务已被解决的假象，而论文在附录中承认，当干扰项增至 4-6 个时性能迅速下降，这种对任务天花板效应（ceiling effect）的深度分析本应是正文的核心内容，却被掩盖于近乎完美的数字之下。 ...

Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition

📄 Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition #空间音频 #声源定位 #语音增强 #语音识别 4.1/10 | 创新 0.8/2 | 严谨 0.6/1.5 | 实验 0.4/1.5 | 清晰 0.8/1 | 影响 0.7/1.5 | 开源 0/1.5 | 复现 0/0.5 | 工程 0.8/1.5 📝 4.1/10 | 后50% | #声源定位 | #空间音频 | #语音增强 #语音识别 | arxiv 👥 作者与机构第一作者：Pengyuan Shao（University College London, Department of Computer Science）通讯作者：未明确说明，根据作者顺序推断为 Dimitrios Kanoulas（University College London, Department of Computer Science）作者列表：Pengyuan Shao（University College London, Department of Computer Science）、Dimitrios Kanoulas（University College London, Department of Computer Science） 💡 毒舌点评这篇综述选题有现实意义，试图将空间语音感知系统的三大组件进行统一综述，但在顶会级别看来，其贡献仅停留在文献整理和概念归纳层面。全文没有任何定量元分析、方法对比实验或新基准/工具，不发布数据集也不开源代码。所谓的"系统级评价"、“语义可靠性"等概念始终停留在愿景，缺乏可操作的量化定义或评测方案。对于希望直接拿来评估或改进自己系统的研究者而言，这篇综述提供不了太多硬核见解。 ...

Speaker head orientation estimation with a single microphone array using phase spectrogram features

📄 Speaker head orientation estimation with a single microphone array using phase spectrogram features #声源定位 #端到端 #多通道 #鲁棒性 #数据集 5.8/10 | 创新 1.2/2 | 严谨 1/1.5 | 实验 1/1.5 | 清晰 0.7/1 | 影响 0.6/1.5 | 开源 0/1.5 | 复现 0.3/0.5 | 工程 1/1.5 📝 5.8/10 | 前50% | #声源定位 | #端到端 | #多通道 #鲁棒性 | arxiv 👥 作者与机构第一作者：Balint Turi（坦佩雷大学，未在论文中明确标注）通讯作者：未明确说明作者列表：Balint Turi、Archontis Politis、Parthasaarathy Sudarsanam、Tuomas Virtanen（均来自坦佩雷大学，音频信号处理领域） 💡 毒舌点评这项工作用高维STFT相位替代传统手工特征来估计说话人头朝向，配合仿真预训练与真实微调的范式，在多种噪声条件下确实稳定地甩开了之前的基线。然而，全文除了给出一个粗略的模型架构和部分超参数外，没有提供任何代码、权重或可直接使用的数据集；最关键的网络组件消融实验完全缺失，所谓“SOTA”的可复现性和可靠性因此大打折扣。此外，对推理延迟、模型大小、阵列拓扑变化等工程关键问题只字未提，使一项号称面向实际部署的工作显得有些不够落地。 📌 核心摘要问题：使用单个小型麦克风阵列（如6通道、半径4.5cm的环形阵）估计说话人在混响室内的水平朝向（0°–360°），要求泛化到未知说话人、未知房间和多种噪声环境。方法核心：以各通道STFT相位（经sin/cos编码消除±π不连续性）堆叠为高维多通道特征，送入由2D CNN（空间下采样）、双向GRU（时序建模）和多头自注意力（全局上下文）组成的端到端网络，最终在单位圆上回归 [cosθ, sinθ] 并用 atan2 恢复连续角度。新颖性：首次将高维STFT相位作为头朝向估计的唯一输入特征，证明其在表达声源方向性方面优于人工特征（ILD/ITD等）和原始波形；并采用“大规模仿真预训练+少量真实数据微调”的跨域策略，解决了高维特征在真实标注稀缺场景下的学习问题。实验结果：在仿真混响干净条件下MAE=19.9°，0–10 dB强噪声下MAE=29.5°，远优于基于原始波形的44.8°/75.1°和基于ITD/ILD的52.7°/82.8°。在真实数据（8方向分类）上，预训练+微调达到73.2%准确率，超过DoV基线（65.4%）。用户+房间个性化微调后MAE可降至11.3°。混响对STFT相位方法反而有利，误差分布更均匀。实际意义：为资源受限的智能音箱、会议系统、驾驶员监控等场景提供了一种硬件要求低、对噪声和混响鲁棒的纯音频头朝向感知方案，支持用户级个性化适配。主要局限：（1）零样本跨说话人/跨房间的泛化能力仍显不足，个性化微调提升巨大从反面说明了这一点；（2）无任何开源资源（代码/模型/数据），可复现性极差；（3）缺少对网络各组件（CNN、GRU、Attention）的消融实验以及对不同阵列拓扑、麦克风失效、动态朝向等工程边界条件的分析；（4）未评估推理延迟与计算开销。 🔗 开源详情代码：未提供任何代码链接，文中无相关声明。模型权重：未提供。数据集：使用了剑桥VCTK语料库、WHAM噪声数据集和文献[3]中的公开8方位真实录音数据集。论文仅给出了引用，未提供数据集的直接下载、预处理脚本或生成的仿真数据集。 Demo：未提及。复现材料：未提供详细训练配置文件、模型定义或实验记录。论文中引用的开源项目：Pyroomacoustics（https://github.com/LCAV/pyroomacoustics） 🏗️ 方法概述和架构系统流程由语音活动检测（VAD）、特征提取和深度神经网络回归三部分组成。输入为单说话人的一段多通道语音（最多3秒），首先通过文献[7]中的VAD模块去除首尾静音段，仅保留活动语音帧。 ...

语音/音乐/音频论文速递 2026-07-03

语音/音乐/音频论文速递 2026-07-03 共分析 31 篇论文 ⚡ 今日概览 📥 抓取 31 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #音频分类 4篇 ████ #声源定位 4篇 ████ #语音识别 4篇 ████ #语音交互 3篇 ███ #语音合成 3篇 ███ #音视频理解 2篇 ██ #语音增强 2篇 ██ #音乐理解 1篇 █ 📊 论文评分排行榜（31 篇，按分数降序）排名论文总分分档主任务 🥇 Unlocking Speech-Text Compositional Powers: Instruction 8.5分前25% #语音交互 🥈 Decomposer: Learning to Decompile Symbolic Music to Pro 8.4分前25% #音乐理解 🥉 A global predicted-fMRI drive signal from TRIBE does no 7.7分前25% #音视频理解 4. Cross Domain Few-Shot Class-Incremental Audio Classific 7.4分前50% #音频分类 5. Self-Supervised Test-Time Tuning for Packet Loss Concea 7.4分前50% #音频修复 6. Reasoning LLM Improves Speaker Recognition in Long-form 7.2分前50% #音视频理解 7. SelectTSL: Prompt-Guided Selective Target Sound Localiz 7.1分前50% #声源定位 8. Enhancing Acoustic-to-Articulatory Inversion with Multi 7.0分前50% #语音交互 9. TurnNat: Automatic Evaluation of Turn-Taking Naturalnes 7.0分前50% #语音交互 10. Audio-Based Understanding of Audiobook Narration Appeal 6.9分前50% #语音属性识别 11. H-SAGE: Holistic Speaker-Aware Guided Experts for MoE-b 6.9分前50% #语音识别 12. An Efficient vLLM-Based Inference Pipeline for Unified 6.8分前50% #语音合成 13. Few-Shot Open-Set Audio Classification Using Attention 6.8分前50% #音频分类 14. Beyond Words: Towards Effective Modeling of Non-Verbal 6.4分前50% #语音识别 15. LMPAN: A Lightweight Multi-Path Alignment Network for J 6.2分前50% #语音增强 16. NAVER LABS Europe Submission to the Instruction-followi 6.2分前50% #语音翻译 17. Pmeta-TLA: Backdoor Attacks for Speech Classification M 6.0分前50% #语音唤醒 18. Neural Audio Codec with Adjustable Token Temporal Resol 5.8分前50% - 19. SPARCLE: SPeaker-aware Aligned Representations via Cont 5.8分前50% #语音合成 20. Speaker head orientation estimation with a single micro 5.8分前50% #声源定位 21. Towards a Phonology-Informed Evaluation of Multilingual 5.7分前50% #语音质量评估 22. Rethinking Speech-LLM Integration for ASR: Effective Jo 5.6分前50% #语音识别 23. RT-Tango: Real-Time Distributed Binaural Speech Enhance 5.5分前50% #语音增强 24. Quantifying the Uncertainty of Blindly Estimated Room E 5.2分后50% #音频检索 25. CNN Models for Microphone Array Covariance Matrix Upsam 5.0分后50% #声源定位 26. A Multi-Branch Hierarchy-Aware Framework for Heterogene 4.9分后50% #音频分类 27. From Monolingual to Multilingual: Evaluating Mamba for 4.8分后50% #语音识别 28. DRL-CLBA: A Clean Label Backdoor Attack for Speech Clas 4.7分后50% #音频分类 29. Spatial Speech Perception Systems: A Survey of Sound So 4.1分后50% #声源定位 30. UT-AISTimprt submission for ICME 2026 Grand Challenge o 4.1分后50% #音乐生成 31. Using embeddings to predict spoken word duration and pi 4.0分后50% #语音合成 📋 论文列表 🥇 Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning 8.5/10 | 创新 1.6/2 | 严谨 1.3/1.5 | 实验 1.3/1.5 | 清晰 0.8/1 | 影响 1.2/1.5 | 开源 1.1/1.5 | 复现 0.4/0.5 | 工程 0.8/1.5 ...

What Do Neural Networks Learn for TDOA Estimation? A Cross-Architecture Probing Study

📄 What Do Neural Networks Learn for TDOA Estimation? A Cross-Architecture Probing Study #声源定位 7.7/10 | 创新 1.5/2 | 严谨 1.2/1.5 | 实验 1.3/1.5 | 清晰 1/1 | 影响 0.6/1.5 | 开源 1/1.5 | 复现 0.5/0.5 | 工程 0.6/1.5 ✅ 7.7/10 | 前50% | #声源定位 | #声源定位 | arxiv 👥 作者与机构作者：Kang, Wang, Shi, Ashizawa, Yen, Nakadai (注：原文作者列表中包含 Yaozhong Jiang, Runwu, Takeshi, Benjamin, Kazuhiro，但署名单位一致) 机构：Department of Systems and Control Engineering, Institute of Science Tokyo, Japan ...

语音/音乐/音频论文速递 2026-06-23

语音/音乐/音频论文速递 2026-06-23 共分析 83 篇论文 ⚡ 今日概览 📥 抓取 83 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #语音识别 19篇 ███████████████ #语音合成 14篇 ██████████████ #音乐生成 3篇 ███ #说话人验证 3篇 ███ #语音增强 3篇 ███ #对比学习 2篇 ██ #自监督学习 2篇 ██ #音频水印 2篇 ██ 📊 论文评分排行榜（83 篇，按分数降序）排名论文总分分档主任务 🥇 CoughPhase-CLR: Designing an acoustics-informed foundat 10.0分前10% #对比学习 🥈 Libretto: Giving LLM Agents a Sense of Musical Structur 9.2分前50% #音乐生成 🥉 Speaker Identity in Non-Verbal Vocalizations: Condition 9.1分前25% #说话人验证 4. PHAST-Net: Attention-Guided, Physics-Informed Network f 9.0分前10% #音乐信息检索 5. Domain-incremental audio classification using domain-sp 9.0分前50% #音频分类 6. MSU-Bench: Towards Speaker-Centric Understanding in Con 9.0分前10% - 7. How Well Do Self-Supervised Speech Models Encode Age an 9.0分前50% #自监督学习 8. CAAD: Contrastive Audio-Aware Distillation for Efficien 8.9分前25% #语音识别 9. STAR-VAE: Structured Topology-Aware Regularization for 8.8分前25% #音频生成 10. An Evaluation Framework for Text-to-Speech Voice Recons 8.8分前25% #语音合成 11. An Analysis of Untrained Deep Reservoir Networks for Au 8.8分前50% #音频事件检测 12. Towards Detecting Neural Audio Codec Synthesized Heart 8.7分前50% #自监督学习 13. Bridging the Age Gap: Towards Detecting Neural Audio Co 8.6分前50% #语音伪造检测 14. ATCCaps: A Call-Sign-Aware Speech Dataset for Air Traff 8.6分前25% #语音识别 15. InstructFX2FX: A Multi-turn Text-to-Preset Demo for Ite 8.6分前50% #对比学习 16. When EER Hides Deployment Failure: Auditing Threshold T 8.6分前25% - 17. CapRiCorn-1K: A Comprehensive Benchmark for Video Capti 8.6分前50% #语音识别 18. Compiling Differentiable Audio Graphs to Real-Time DSP 8.5分前25% - 19. Improving Text-to-Music Generation with Human Preferenc 8.5分前50% #音乐生成 20. Don't Listen to Me: A Lightweight, Low-Latency Mode 8.4分前50% #语音增强 21. HALAS: A Human-Annotated Dataset of Hallucinations of M 8.4分前50% #语音识别 22. Benchmarking Large Language Models for Grapheme-to-Phon 8.4分前25% #语音合成 23. Cross-lingual Retrieval-Augmented Classification for Dy 8.4分前25% #语音识别 24. Bagpiper-TTS: Natural Language Guided Universal Speech 8.4分前25% #语音合成 25. Using Phonological-Level Wav2Vec2 for Mandarin Automati 8.3分前25% #语音识别 26. Word Lengthening as a Function of Utterance Position: A 8.1分前25% #语音合成 27. LambdaMark: Semantic Audio Watermarking for Robustness 8.0分前25% #音频水印 28. OpenWER: Improving Cross-Lingual ASR Evaluation and Ena 8.0分前50% #语音识别 29. AudioCALM: Continuous Autoregressive Language Modeling 7.9分前25% #语音合成 30. AOR-Bench: Do Large Audio Language Models Over-Refuse P 7.9分前50% #音频问答 31. Gradient-Based Learning of Parametric Engine Sound Repr 7.8分前50% #参数高效微调 32. Toward Open-Set Speaker Attribute Prediction with Keywo 7.8分前25% #多模态模型 33. Time-Frequency Weighted Losses for Phoneme Reconstructi 7.8分前25% #语音增强 34. An implicitization-based solution to the minimal 4s/6r 7.8分前50% - 35. CORTIS: Text-Only Adaptation of Spoken Language Models 7.7分前50% #语音识别 36. What Do Neural Networks Learn for TDOA Estimation? A Cr 7.7分前50% #声源定位 37. Kiwano: A Cutting-Edge Open-Source Toolkit for Speaker 7.6分前50% #说话人验证 38. Learning to Evade: Adaptive Attacks on Audio Watermarki 7.6分前50% #音频水印 39. Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via R 7.6分前25% #语音合成 40. From Text Metrics to Model Internals: A Study of Whispe 7.5分前50% #语音识别 41. Bridging Self-Supervised Learning and Speech Enhancemen 7.5分前25% #语音增强 42. Integrating Facial Generation into Full-Duplex Spoken D 7.5分前25% - 43. ESPnet3: Infrastructure for Scalable Speech and Audio R 7.5分前25% #语音识别 44. On the Effect of Segmentation Width and Cluster Size on 7.4分前25% #语音合成 45. The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion 7.3分前50% #语音识别 46. FlowTTS-GRPO: Online Reinforcement Learning with Multi- 7.2分前50% - 47. DisSpeech: Low-Resource Controllable Mandarin Stuttered 7.2分前25% #语音合成 48. SDP-Codec: A Speaker-Decoupled Speech Codec with Pitch 7.2分前50% #语音编码 49. Synthesizing the Lombard Effect: Multi-Level Control of 7.2分前50% #语音合成 50. Scaling Audio Models Efficiently: A Joint Study of Comp 7.2分前50% #语音识别 51. Online Predictive Coding for Dual-Mode Self-Supervised 7.2分前50% #语音识别 52. Exploiting Neural Audio Codec Latents for Adversarial A 7.2分前50% #生成对抗网络 53. Audio Editing in the Era of Foundation Models: A Survey 7.0分前25% - 54. Adding Robust Code-Switching Capabilities to High Perfo 7.0分前50% #语音识别 55. Unlocking In-Context Learning in Audio-Language Models 7.0分前50% #联邦学习 56. Backdoor Attacks on Speech Emotion Recognition via TTS- 7.0分前50% #语音情感识别 57. LK Jam: System Architecture and Implementation of a Rea 7.0分前50% #音乐生成 58. An Acoustic Landmark Database of the English Lexicon vi 6.9分前50% #语音合成 59. Learning from Audio-Dependency Errors: Data Curation St 6.9分前50% #音频问答 60. The Watermark Shortcut: How Provenance Marking Sabotage 6.8分前50% #数据增强 61. LISE : Listenable Interpretable Speaker Embeddings 6.8分前50% #说话人验证 62. PIVOTSBench: Evaluating Fine-Grained Interpersonal Rela 6.8分前50% #基准测试 63. AugCodec: A Low-Bitrate Disentangled Neural Speech Code 6.7分前50% #数据增强 64. Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark 6.7分前50% #语音识别 65. Physics-Informed Neural Operator for Speech Production 6.7分前50% #语音合成 66. Streaming T5-based Text-to-Speech Synthesis with Limite 6.7分前25% #语音合成 67. ProsoCodec: Prosody-Oriented Speech Codec for Voice Con 6.6分前50% #语音转换 68. Beyond ROC-AUC: Operating-Point Performance Reporting f 6.6分前50% - 69. ISCSLP 2026 CoT-TTS Challenge: Chain-of-Thought Reasoni 6.6分前50% #语音合成 70. A DDSP Framework for Adaptive Room Equalization 6.5分前50% #自适应滤波 71. EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional 6.5分前50% - 72. Interleaved Speech Language Models Latently Work In Tex 6.4分前50% #语音识别 73. DSSCNet: A Transfer Learning Framework for Cross-Corpus 6.3分前50% #迁移学习 74. Sea-Scan: High-Accuracy, ML-based Dark Vessel Detection 6.3分前50% - 75. Catching Lies Without Sending the Video: Privacy-Preser 6.2分前50% #多模态模型 76. MindAlign: Decoding Inner Speech from fMRI Signals via 5.8分前50% #语音识别 77. Acoustic Landmark Detector based on Conformer and HuBER 5.5分前50% #语音识别 78. Explainable AI in Speaker Recognition – Attention Map 5.5分前50% #说话人识别 79. Imitation Learning for Elder-Facing Speech Synthesis 5.5分前50% #语音合成 80. Improving Engine Sound Analysis in Hot-Test Environment 4.9分后50% #音频降噪 81. Direct Raw Audio Signal Processing via Reservoir Comput 4.5分后50% #语音识别 82. A Generalized Formalism of Auto-Regressive Decoding for 4.1分后50% #自回归模型 83. Noise-Driven Instrument Based on Coherent Quantum and S 3.8分后50% - 📋 论文列表 🥇 CoughPhase-CLR: Designing an acoustics-informed foundation model for coughing sound classification 10.0/10 | 创新 2/2 | 严谨 1.5/1.5 | 实验 1.5/1.5 | 清晰 1/1 | 影响 1.5/1.5 | 开源 1.3/1.5 | 复现 0.5/0.5 | 工程 1.0/1.5 ...