音频理解 on 语音/音频论文速递

音频理解 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E9%9F%B3%E9%A2%91%E7%90%86%E8%A7%A3/ Recent content in 音频理解 on 语音/音频论文速递 Hugo zh-cn Tue, 21 Apr 2026 00:00:00 +0000 Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-deepthinker-progressive-reasoning-aware/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-deepthinker-progressive-reasoning-aware/ 这篇论文旨在解决大型音频语言模型（LALMs）缺乏显式、高质量推理能力的问题。现有方法要么受限于监督数据的质量，要么使用粗糙的奖励，导致生成的思维链形式良好但缺乏声学依据。作者提出了**Audio-DeepThinker**框架，其核心贡献有三：1）设计了一种**混合推理相似度奖励**，结合LLM评 A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven's Piano and Cello Sonatas https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-a-manual-bar-by-bar-tempo-measurement-protocol/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-a-manual-bar-by-bar-tempo-measurement-protocol/ 本文旨在解决现有自动化节拍提取工具在分析历史复调室内乐录音（特别是贝多芬钢琴与大提琴奏鸣曲）时出现的系统性失败问题。作者与一名VLSI工程师合作，设计并验证了一套形式化的手动逐小节速度测量协议。该协议 Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/ 这篇论文旨在解决当前音频大语言模型（AudioLLMs）在细粒度声学感知任务上表现不佳的核心问题。作者指出，主流的以自动语音识别（ASR）为中心的训练范式，通过将音频映射到纯文本转录，系统性地丢弃了副 Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-elastic-net-regularization-and-gabor-dictionary/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-elastic-net-regularization-and-gabor-dictionary/ 本文旨在解决心音信号（PCG）的多分类问题，以辅助心血管疾病的自动诊断。核心贡献在于提出了一套结合**优化Gabor字典**和**弹性网络正则化**的特征提取框架，并与**CNN-LSTM深度学习网络 Enhancing time-frequency resolution with optimal transport and barycentric fusion of multiple spectrogram https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-enhancing-time-frequency-resolution-with-optimal/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-enhancing-time-frequency-resolution-with-optimal/ **核心问题**：短时傅里叶变换（STFT）生成的谱图受制于不确定性原理，无法同时获得优异的时间和频率分辨率。传统融合方法（如几何平均）要求输入谱图网格对齐，且性能有限。 **核心方法**：本文提出一 Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-few-shot-and-pseudo-label-guided-speech-quality/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-few-shot-and-pseudo-label-guided-speech-quality/ 本文旨在解决非侵入式语音质量评估在标注数据有限场景下的性能瓶颈。作者提出了GatherMOS框架，其核心是将大语言模型（如GPT-5）作为一个元评估器，通过精心设计的文本提示，融合多类异构信号：包括手 Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listen-pause-and-reason-toward-perception/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listen-pause-and-reason-toward-perception/ 本文旨在解决大型音频语言模型在复杂音频场景中因感知错误导致的推理失败问题。受听觉场景分析启发，作者提出了一个感知接地的混合推理框架。首先，他们构建了一个名为PAQA的新数据集，通过层次化解耦策略（区分 On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ 本文针对现有语音变分自编码器（VAE）在统一语音重建、理解和生成任务上表现不平衡的问题（尤其是理解能力差），系统性地研究了蒸馏损失函数的设计空间。作者探索了三种将自监督学习（SSL）模型知识蒸馏到VA SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ 本文旨在解决大型音频语言模型在**细粒度音频事件时间定位**上的不足。现有模型因训练数据缺乏精确时间戳、基准测试过于简单，导致在长音频中定位短暂事件（“大海捞针”）时表现不可靠。为此，作者提出了**S Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ 这篇论文旨在解决大型音频语言模型（LALM）在细粒度时间感知（如精确定位声音事件的起止时间）上的不足。作者提出了**TimePro-RL**框架，其核心是两步走策略：首先，提出**音频侧时间提示（AS Transformer Based Machine Fault Detection From Audio Input https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-transformer-based-machine-fault-detection-from/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-transformer-based-machine-fault-detection-from/ 本文旨在探讨基于Transformer的架构在机器故障音频检测任务上相对于传统卷积神经网络（CNN）的潜在优势。**要解决的问题**是传统CNN在处理频谱图时固有的局部性和平移不变性等归纳偏置，可能并 VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ 本文旨在解决语音处理中一个基础但被忽视的问题：如何系统化地识别语音音频所经过的后期处理效果及其参数。现实中，语音几乎都经过了降噪、压缩等效果处理，但现有数据集缺乏此类精确标注，阻碍了相关研究。为此，作 VoxSafeBench: Not Just What Is Said, but Who, How, and Where https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ 这篇论文旨在解决一个关键问题：当语音大模型（SLM）进入多用户共享环境时，仅基于文本内容的安全对齐策略是不足的，说话人身份、副语言特征和声学场景等音频上下文信息会根本性地改变请求的性质。为此，作者提出 Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ 这篇论文深入探讨了在端到端音频语言模型中，将视觉信息融入音频分词器时普遍存在的“理解提升但重建质量下降”的核心矛盾。作者通过系统性实验，揭示了三个关键发现：融合位置（在量化前还是量化后）至关重要；在离