论文速递 on 语音/音频论文速递

论文速递 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/categories/%E8%AE%BA%E6%96%87%E9%80%9F%E9%80%92/ Recent content in 论文速递 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 Accelerating Regularized Attention Kernel Regression for Spectrum Cartography https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-accelerating-regularized-attention-kernel/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-accelerating-regularized-attention-kernel/ 频谱测绘 | 8.5/10 ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asap-an-azimuth-priority-strip-based-search/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asap-an-azimuth-priority-strip-based-search/ 声源定位 | 7.5/10 Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-isolated-utterances-cue-guided-interaction/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-isolated-utterances-cue-guided-interaction/ 多模态模型 | 7.5/10 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clawmark-a-living-world-benchmark-for-multi-turn/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clawmark-a-living-world-benchmark-for-multi-turn/ 基准测试 | 7.0/10 Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-linguistic-rhythmic-and-spectral-feature/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-linguistic-rhythmic-and-spectral-feature/ Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/ 生成模型 | 8.5/10 Generative UI as an Accessibility Bridge: Lessons from C2C E-Commerce https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-ui-as-an-accessibility-bridge-lessons/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-ui-as-an-accessibility-bridge-lessons/ 无障碍 | 6.5/10 Huí Sù: Co-constructing a Dual Feedback Apparatus https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hu-s-co-constructing-a-dual-feedback-apparatus/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hu-s-co-constructing-a-dual-feedback-apparatus/ 音乐生成 | 5.5/10 Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-human-1-by-josh-talks-a-full-duplex/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-human-1-by-josh-talks-a-full-duplex/ 语音对话系统 | 7.5/10 Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-independent-component-based-encoding-models-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-independent-component-based-encoding-models-of/ 神经编码 | 7.5/10 Korean aegyo speech shows systematic F1 increase to signal childlike qualities https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-korean-aegyo-speech-shows-systematic-f1-increase/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-korean-aegyo-speech-shows-systematic-f1-increase/ 语音情感识别 | 6.0/10 Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/ 多模态模型 | 7.5/10 ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ml-san-multi-level-speaker-adaptive-network-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ml-san-multi-level-speaker-adaptive-network-for/ 语音情感识别 | 8.0/10 MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/ 基准测试 | 7.5/10 Monitoring exposure-length variations in submarine power cables using distributed fiber-optic sensing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-monitoring-exposure-length-variations-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-monitoring-exposure-length-variations-in/ 音频事件检测 | 6.5/10 Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/ 音频生成 | 7.5/10 Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-nemotron-3-nano-omni-efficient-and-open/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-nemotron-3-nano-omni-efficient-and-open/ 多模态模型 | 8.5/10 Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-praxy-voice-voice-prompt-recovery-bups-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-praxy-voice-voice-prompt-recovery-bups-for/ 语音合成 | 8.0/10 PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ 基准测试 | 7.5/10 RAS: a Reliability Oriented Metric for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ 语音识别 | 7.5/10 Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-robust-accent-identification-via-voice-conversion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-robust-accent-identification-via-voice-conversion/ 语音识别 | 7.5/10 Step-Audio-R1.5 Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-step-audio-r15-technical-report/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-step-audio-r15-technical-report/ 语音对话系统 | 8.0/10 SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ 音乐生成 | 7.5/10 The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ 基准测试 | 7.0/10 UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unet-based-fusion-and-exponential-moving-average/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unet-based-fusion-and-exponential-moving-average/ 说话人验证 | 7.5/10 Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unrequited-emotions-investigating-the-gaps-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unrequited-emotions-investigating-the-gaps-in/ 语音情感识别 | 8.0/10 Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-walking-through-uncertainty-an-empirical-study-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-walking-through-uncertainty-an-empirical-study-of/ 音频问答 | 7.5/10 WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ 语音识别 | 6.5/10 语音/音频论文速递 2026-04-29 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/ 共分析 29 篇语音/AI 论文 A Functorial Formulation of Neighborhood Aggregating Deep Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-a-functorial-formulation-of-neighborhood/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-a-functorial-formulation-of-neighborhood/ 理论分析 | 6.5/10 All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-all-that-glitters-is-not-audio-rethinking-text/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-all-that-glitters-is-not-audio-rethinking-text/ 音频问答 | 6.5/10 An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ 音乐理解 | 7.5/10 CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/ 跨模态 | 8.0/10 Come Together: Analyzing Popular Songs Through Statistical Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-come-together-analyzing-popular-songs-through/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-come-together-analyzing-popular-songs-through/ 音乐信息检索 | 6.5/10 Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-comparison-of-semg-encoding-accuracy-across/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-comparison-of-semg-encoding-accuracy-across/ 语音生物标志物 | 8.0/10 Explainable AI in Speaker Recognition -- Making Latent Representations Understandable https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-explainable-ai-in-speaker-recognition-making/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-explainable-ai-in-speaker-recognition-making/ 说话人识别 | 7.5/10 Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/ 音视频 | 8.5/10 HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-headrouter-dynamic-head-weight-routing-for-task/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-headrouter-dynamic-head-weight-routing-for-task/ 音频大模型 | 8.0/10 Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-latent-hysteresis-graph-odes-modeling-coupled/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-latent-hysteresis-graph-odes-modeling-coupled/ 图神经网络 | 8.0/10 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.0/10 Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-meta-ensemble-learning-with-diverse-data-splits/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-meta-ensemble-learning-with-diverse-data-splits/ 音频分类 | 8.0/10 Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-opening-the-design-space-two-years-of-performance/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-opening-the-design-space-two-years-of-performance/ 音乐生成 | 6.5/10 Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-predictive-directional-selective-fixed-filter/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-predictive-directional-selective-fixed-filter/ 声源定位 | 7.5/10 Psychologically-Grounded Graph Modeling for Interpretable Depression Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-psychologically-grounded-graph-modeling-for/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-psychologically-grounded-graph-modeling-for/ 语音情感识别 | 8.0/10 RAS: a Reliability Oriented Metric for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-ras-a-reliability-oriented-metric-for-automatic/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-ras-a-reliability-oriented-metric-for-automatic/ 语音识别 | 7.5/10 Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-robust-audio-text-retrieval-via-cross-modal/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-robust-audio-text-retrieval-via-cross-modal/ 音频检索 | 7.5/10 RTCFake: Speech Deepfake Detection in Real-Time Communication https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-rtcfake-speech-deepfake-detection-in-real-time/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-rtcfake-speech-deepfake-detection-in-real-time/ 语音伪造检测 | 7.0/10 Scaling Properties of Continuous Diffusion Spoken Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/ 语音生成 | 8.0/10 Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-spectro-temporal-modulation-representation/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-spectro-temporal-modulation-representation/ 语音伪造检测 | 6.5/10 Speech Enhancement Based on Drifting Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-speech-enhancement-based-on-drifting-models/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-speech-enhancement-based-on-drifting-models/ 语音增强 | 7.5/10 Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ 语音合成 | 7.5/10 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-tts-prism-a-perceptual-reasoning-and/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-tts-prism-a-perceptual-reasoning-and/ 语音合成评估 | 7.0/10 语音/音频论文速递 2026-04-28 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28/ 共分析 24 篇语音/AI 论文 Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-advancing-automatic-speech-recognition-using/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-advancing-automatic-speech-recognition-using/ 语音识别 | 7.0/10 Audio Effect Estimation with DNN-Based Prediction and Search Algorithm https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-audio-effect-estimation-with-dnn-based-prediction/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-audio-effect-estimation-with-dnn-based-prediction/ 音乐理解 | 8.0/10 Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-audio-video-verbal-analysis-avva-for-capturing/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-audio-video-verbal-analysis-avva-for-capturing/ 音频问答 | 6.0/10 Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-beyond-acoustic-sparsity-and-linguistic-bias-a/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-beyond-acoustic-sparsity-and-linguistic-bias-a/ 发音错误检测 | 8.5/10 DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-dm-asr-diarization-aware-multi-speaker-asr-with/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-dm-asr-diarization-aware-multi-speaker-asr-with/ 说话人识别 | 8.0/10 Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-earable-platform-with-integrated-simultaneous-eeg/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-earable-platform-with-integrated-simultaneous-eeg/ 音频事件检测 | 5.5/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-identifying-and-typifying-demographic-unfairness/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-identifying-and-typifying-demographic-unfairness/ 语音识别 | 7.0/10 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-spectrographic-portamento-gradient-analysis-a/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-spectrographic-portamento-gradient-analysis-a/ 音乐信息检索 | 7.5/10 Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-transformer-based-rhythm-quantization-of/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-transformer-based-rhythm-quantization-of/ 音乐信息检索 | 8.0/10 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ 语音质量评估 | 7.5/10 UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/ 音频生成 | 8.5/10 语音/音频论文速递 2026-04-27 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27/ 共分析 13 篇语音/AI 论文 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-magic-tts-fine-grained-controllable-speech/ Sat, 25 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.5/10 MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-momo-a-framework-for-seamless-physical-verbal-and/ Sat, 25 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-momo-a-framework-for-seamless-physical-verbal-and/ 机器人技能学习 | 7.5/10 语音/音频论文速递 2026-04-25 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25/ Sat, 25 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25/ 共分析 2 篇语音/AI 论文 "This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-this-wasnt-made-for-me-recentering-user/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-this-wasnt-made-for-me-recentering-user/ 语音识别 | 7.0/10 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ 语音合成 | 7.0/10 AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ 音频问答 | 6.5/10 Beyond Rules: Towards Basso Continuo Personal Style Identification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-beyond-rules-towards-basso-continuo-personal/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-beyond-rules-towards-basso-continuo-personal/ 音乐理解 | 7.0/10 DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-diarizen-explained-a-tutorial-for-the-open-source/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-diarizen-explained-a-tutorial-for-the-open-source/ 说话人分离 | 6.5/10 Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-dilated-cnns-for-periodic-signal-processing-a-low/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-dilated-cnns-for-periodic-signal-processing-a-low/ 语音增强 | 6.5/10 Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ 语音识别 | 7.5/10 Evaluation of Automatic Speech Recognition Using Generative Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ 语音识别 | 7.5/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-hierarchical-policy-optimization-for-simultaneous/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-hierarchical-policy-optimization-for-simultaneous/ 语音翻译 | 7.5/10 Low-Rank Adaptation Redux for Large Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-low-rank-adaptation-redux-for-large-models/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-low-rank-adaptation-redux-for-large-models/ 大语言模型 | 5.5/10 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-magic-tts-fine-grained-controllable-speech/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.5/10 Materialistic RIR: Material Conditioned Realistic RIR Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-materialistic-rir-material-conditioned-realistic/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-materialistic-rir-material-conditioned-realistic/ 音频生成 | 7.5/10 MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-mer-2026-from-discriminative-emotion-recognition/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-mer-2026-from-discriminative-emotion-recognition/ 语音情感识别 | 6.0/10 Misinformation Span Detection in Videos via Audio Transcripts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-misinformation-span-detection-in-videos-via-audio/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-misinformation-span-detection-in-videos-via-audio/ 音频安全 | 7.5/10 Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-phonological-subspace-collapse-is-aetiology/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-phonological-subspace-collapse-is-aetiology/ Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ 语音合成 | 7.5/10 Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-prosody-as-supervision-bridging-the-non-verbal/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-prosody-as-supervision-bridging-the-non-verbal/ 语音情感识别 | 8.0/10 Sema: Semantic Transport for Real-Time Multimodal Agents https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-sema-semantic-transport-for-real-time-multimodal/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-sema-semantic-transport-for-real-time-multimodal/ 实时处理 | 6.5/10 Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-time-vs-layer-locating-predictive-cues-for/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-time-vs-layer-locating-predictive-cues-for/ 语音生物标志物 | 7.0/10 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ 音乐生成 | 7.0/10 语音/音频论文速递 2026-04-24 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24/ 共分析 21 篇语音/AI 论文 Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ 1. **问题**：当前口吃语音技术研究与口吃者（PWS）及言语语言病理学家（SLP）的实际需求存在系统性脱节，研究重点、任务定义和评估方法未能充分以用户为中心。 2. **方法核心**：通过两部分结合分析：1）对228篇相关论文进行范围综述，提出研究任务分类法并分析研究现状；2）对70名利益相 ATIR: Towards Audio-Text Interleaved Contextual Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-atir-towards-audio-text-interleaved-contextual/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-atir-towards-audio-text-interleaved-contextual/ 这篇论文旨在解决现有音频-文本检索方法无法处理查询和文档中音频与文本交错出现（如多轮对话、混合输入）的局限性。为此，作者定义了音频-文本交错上下文检索（ATIR）任务，并构建了一个包含约8.8万对样本的大规模基准。为解决直接应用多模态大语言模型（MLLM）时音频token冗余导致的效率和精度问题，论 Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-before-the-mic-physical-layer-voiceprint/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-before-the-mic-physical-layer-voiceprint/ 这篇论文针对在公共场景（如会议、演讲）中，不可信录音设备可能导致声纹泄露且事后无法补救的问题，提出了EchoMask——首个基于声学超材料的物理层实时声纹匿名化系统。其核心方法是在声音到达麦克风前，通过精心设计的被动声学结构对特定低频段（300-700Hz）进行选择性干扰，该频段对说话人识别至关重要 Centering Ecological Goals in Automated Identification of Individual Animals https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-centering-ecological-goals-in-automated/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-centering-ecological-goals-in-automated/ 这篇论文旨在解决一个关键问题：为什么近年来在动物个体自动识别（基于图像或声音）上报告的高准确率算法，却很少转化为生态学实践中的常规工具？其方法核心是提出一个“以生态目标为中心”的评估与部署框架，强调自动化识别的有用性取决于其服务的具体生态问题、可用数据以及错误类型带来的实际后果。与以往主要关注算法准 CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/ 1. **问题**：现有视频扩散模型在生成人机交互（HOI）视频时，常出现手/脸结构崩溃和人机物理穿透等问题，根源在于模型缺乏对3D空间关系和交互结构的理解。 2. **方法核心**：提出CoInteract框架，核心是“空间结构化协同生成”范式。在一个共享的DiT骨干中联合训练RGB外观流和辅助的 Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-deep-hierarchical-knowledge-loss-for-fault/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-deep-hierarchical-knowledge-loss-for-fault/ 1. **要解决什么问题**：传统故障强度诊断方法将各类故障视为独立标签，忽略了物理状态之间固有的层次依赖关系（如“空化”是“初期空化”、“稳定空化”等的父类），这限制了模型的性能和鲁棒性。 2. **方法核心是什么**：提出一个名为DHK的通用框架，其核心是设计两个新的损失函数：**层次树损失 Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-embedding-based-intrusive-evaluation-metrics-for/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-embedding-based-intrusive-evaluation-metrics-for/ 1. **问题**：音乐源分离（MSS）领域常用的客观评估指标（BSS-Eval）与人类感知评分相关性较低，导致模型评估不够准确。 2. **方法核心**：提出两种基于嵌入的侵入式评估指标：在预训练MERT模型的嵌入空间上计算目标与分离信号的均方误差（MSE_MERT）和一种逐曲目的Fréchet音 Enhancing ASR Performance in the Medical Domain for Dravidian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-asr-performance-in-the-medical-domain/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-asr-performance-in-the-medical-domain/ 这篇论文旨在解决达罗毗荼语言（Telugu和Kannada）在医疗领域自动语音识别（ASR）中面临的标注数据稀缺和语言形态复杂两大挑战。其核心方法是提出一个“置信度感知训练框架”，该框架通过一个混合置信度评分机制（结合静态的感知、声学相似性、WER分数和动态的模型熵），对混合了真实与合成语音的训练数 Enhancing Speaker Verification with Whispered Speech via Post-Processing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-speaker-verification-with-whispered/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-speaker-verification-with-whispered/ 1. **问题**：耳语语音因缺乏声带振动，其声学特征与正常语音差异显著，导致现有的说话人验证系统性能严重下降。这在用户为保护隐私而低语、或因疾病无法正常发声等实际场景中构成挑战。 2. **方法核心**：在预训练的说话人验证骨干网络（ReDimNet-B6）之上，添加一个轻量级的编码器-解码器结构 Environmental Sound Deepfake Detection Using Deep-Learning Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ 1. **问题**：针对环境声音（包括声音场景和声音事件）的深度伪造检测（ESDD）任务，现有研究不足，且尚不清楚声音场景与声音事件的伪造检测是否需要不同模型。 2. **方法核心**：提出一个深度学习框架，核心是采用预训练的音频模型（BEATs）作为特征提取器，并结合一种三阶段训练策略（包含对 Explicit Dropout: Deterministic Regularization for Transformer Architectures https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-explicit-dropout-deterministic-regularization-for/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-explicit-dropout-deterministic-regularization-for/ 这篇论文旨在解决传统Dropout方法依赖随机掩码、正则化效果不透明且难以精确控制的问题。其核心方法是提出一种确定性公式，将Dropout重新表述为一个可直接加入训练损失函数的显式正则化项，并推导出了适用于Transformer架构中注意力机制（Q、K、V）和前馈网络的正则化表达式。与已有方法相比， FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-fastturn-unifying-acoustic-and-streaming-semantic/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-fastturn-unifying-acoustic-and-streaming-semantic/ 这篇论文针对全双工语音对话系统中需要低延迟、高精度判断用户是否结束发言（轮次检测）的难题，提出了FastTurn统一框架。其核心方法是将流式CTC解码提供的快速部分语义信息，与Conformer编码器提取的声学特征，通过适配器输入给大语言模型（LLM）进行推理，并最终融合声学与语义特征进行轮次预测。 FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/ 这篇论文旨在解决对多语言、多模态句子嵌入（如SONAR, LaBSE）的可解释性问题。核心方法是提出一种称为因子化线性投影（FLiP）的模型，通过将嵌入向量线性投影到词汇表空间来提取关键词，以此作为理解嵌入内容的代理任务。与之前非因子化的线性探测方法（如LiP）和SpLiCE相比，FLiP在关键词提 Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-indic-codecfake-meets-satyam-towards-detecting/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-indic-codecfake-meets-satyam-towards-detecting/ 1. **问题**：现有针对基于神经音频编解码器的语音深度伪造（CodecFake）检测的研究主要集中在英语和中文，对于语言多样性极高的印度语言缺乏大规模的基准数据集和有效的检测方法。 2. **方法**：作者构建了首个大规模印度语言CodecFake数据集（ICF），并提出了一个名为SATYA MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-momo-a-framework-for-seamless-physical-verbal-and/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-momo-a-framework-for-seamless-physical-verbal-and/ 1. **问题**：工业机器人需要频繁适应新任务和环境，但现有技能调整方法（如手动重编程）对非专家用户不友好，且单一交互模态无法高效处理所有类型的调整需求。 2. **方法核心**：提出MOMO框架，集成三种互补交互模态：动觉接触（用于精确空间修正）、自然语言（用于高层语义修改）和图形界面（用于参数 MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ 这篇论文旨在解决语音到语音翻译（S2ST）系统普遍丢失源语音中非语言声音（如笑声、哭声）和情感信息的问题，这严重影响了跨语言交流的自然度和准确性。为此，作者提出了三项核心贡献：首先，设计了一个可扩展的自动化数据合成管道，用于生成大规模、高质量的英中富有表现力S2ST平行语料，克服了训练数据稀缺的瓶颈 ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/ 1. **问题**：当前多模态大模型在音乐符号处理（Omnimodal Notation Processing, ONP）领域存在严重缺陷：研究碎片化、模型存在严重的符号偏差（偏向五线谱）、且普遍依赖不可靠的“LLM-as-a-Judge”评估方法，掩盖了模型在音乐理论推理上的系统性失败。 2. Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-qwen35-omni-technical-report/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-qwen35-omni-technical-report/ 这篇论文介绍了Qwen3.5-Omni，一个支持文本、图像、音频和音频-视频输入的全模态大语言模型。为解决现有模型在实时交互、跨模态推理和工具使用上的不足，其核心方法是采用“Thinker-Talker”架构，并引入混合专家（MoE）设计以提升效率。与前代相比，主要创新在于：1）模型规模扩展至数千亿 Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ 1. **问题**：训练一个既能高精度离线转录又能低延迟流式识别的统一ASR模型极具挑战性，传统方法在低延迟下性能会急剧下降。 2. **方法核心**：提出一个统一的Transducer框架，结合分块注意力（含右上下文）和动态块卷积（DCConv）来适配两种模式。核心创新是引入了模式一致性正则化损失 SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ 1. **解决的问题**：针对神经退行性疾病（特别是肌萎缩侧索硬化症ALS）的早期诊断和监测，缺乏大规模、有临床标注的语音数据集，以及标准化的算法评估框架。 2. **方法核心**：构建并发布了名为SAND的挑战赛，其核心是提供一个扩展的、包含纵向数据的ALS患者与健康对照语音数据集（VOC-A Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-self-noise-reduction-for-capacitive-sensors-via/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-self-noise-reduction-for-capacitive-sensors-via/ 1. **问题**：电容式传感器（如ECM麦克风）的自噪声主要源于前置放大器中用于建立直流偏置的门极电阻（Rm）的热噪声。该电阻同时决定了噪声的低通截止频率和信号的高通截止频率，形成了一个难以调和的噪声-带宽权衡。 2. **方法核心**：提出PDS-Amp（光电直流伺服放大器），用基于外部光电效应 SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ 1. **问题**：现有大型音频语言模型在副语言（如情绪、语气、音色）生成与理解能力上的评估存在特征覆盖不全、评估方法主观且不可扩展的问题。 2. **方法**：提出了SpeechParaling-Bench，一个包含1000余个中英平行语音查询、覆盖超过100个细粒度副语言特征的综合基准。基准 Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ 1. **问题**：现有的古兰经语音数据集在规模、诵读者多样性、音频质量和标注深度上存在严重不足，限制了古兰经ASR、诵读者识别等任务的研究进展。 2. **方法核心**：提出Tadabur数据集及其构建流水线。流水线核心是“古兰经经文对齐模块”（AAM），它结合WhisperX进行初步转录，再利用 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ 1. **问题**：现有基于离散token的TTS模型，其“粗到细”的生成范式主要体现在从语义token到声学token的转换，而对语音固有的时间动态（temporal dynamics）缺乏显式建模。 2. **方法核心**：提出Chain-of-Details (CoD)框架，将语音生成分解为多 Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/ 1. **要解决什么问题**：现有基于生成模型（如扩散模型、自回归模型）的目标说话人提取（TSE）方法依赖全局上下文，难以直接用于实时流式场景，强行适配会导致性能严重下降。 2. **方法核心是什么**：提出首个面向流式TSE的自回归（AR）框架，核心是“分块交错拼接范式”。该范式将混合语音分块 Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-utterance-level-methods-for-identifying-reliable/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-utterance-level-methods-for-identifying-reliable/ 1. **要解决什么问题**：儿童语音自动识别（ASR）错误率高，影响语言学习、阅读辅助等应用。传统置信度估计方法在噪声大、模式多变的儿童语音上可能失效。需要一种在转录后（utterance级别）自动识别哪些ASR输出是可靠的方法，以减少人工审核负担。 2. **方法核心是什么**：提出两种基于 X-VC: Zero-shot Streaming Voice Conversion in Codec Space https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-x-vc-zero-shot-streaming-voice-conversion-in/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-x-vc-zero-shot-streaming-voice-conversion-in/ 1. **问题**：零样本语音转换需要同时实现高质量的说话人特征迁移和低延迟的流式推理，这是一个尚未很好解决的挑战。 2. **方法核心**：提出X-VC系统，在预训练的SAC语音编解码器的潜在空间中进行一步转换。核心是一个双条件声学转换器，它联合处理源语音的编解码器潜在表示和目标参考语音的帧级梅尔语音/音频论文速递 2026-04-23 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/ 共分析 27 篇语音/AI 论文 APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-aprvos-1st-place-winner-of-5th-pvuw-mevis-audio/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-aprvos-1st-place-winner-of-5th-pvuw-mevis-audio/ 这篇论文报告了APRVOS系统，一个专为MEVIS_Audio（音频条件下的指代视频对象分割）任务设计的冠军方案。**要解决的问题**是传统文本指代分割模型无法直接处理包含噪声、不完整且可能描述视频中不存在物体的语音输入。**采用的方法**是一个四阶段流水线：首先使用VibeVoice-ASR将语音 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ 本文针对现有语音合成系统在生成角色驱动、情感丰富的语音时难以同时保持角色身份一致性和情感表达准确性的问题，提出了ATRIE框架。其核心是**Persona-Prosody Dual-Track (P2-DT) 架构**，将语音生成解耦为静态的**音色轨道**（通过标量量化保持身份锚点）和动态的**韵 Audio Spoof Detection with GaborNet https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-audio-spoof-detection-with-gabornet/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-audio-spoof-detection-with-gabornet/ 本论文旨在解决传统SincNet前端在音频伪造检测中因有限长度sinc函数截断导致的频率泄漏问题。作者提出使用可学习的Gabor滤波器组（GaborNet）替代SincNet，并将其集成到两种先进的端到端检测架构RawNet2和RawGAT-ST中。同时，论文探索了将LEAF（Learnable F BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ 本文针对符号音乐生成中主流的事件序列（event-based）tokenization方法隐含处理时间规律、导致模型需额外学习时间网格的问题，提出了一种名为**BEAT**的新型网格化tokenization框架。其核心思想是将音乐在时间上均匀离散化为“拍”（beat）作为基本单位，将每拍内每个音高 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-benign-fine-tuning-breaks-safety-alignment-in/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-benign-fine-tuning-breaks-safety-alignment-in/ 这篇论文首次系统研究了良性（无害）音频数据微调对音频大模型安全对齐的破坏作用。**要解决的问题**是：用户出于提升模型性能目的进行的常规微调，是否会无意中破坏模型的安全防护？**方法**上，作者提出了一个基于嵌入空间邻近度的过滤框架，从语义、声学及混合维度，选择性地用与有害内容在表示空间上相近的良性 Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-comparison-of-semg-encoding-accuracy-across/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-comparison-of-semg-encoding-accuracy-across/ 这篇论文旨在为无声言语接口（SSI）选择更优的中间表示目标。研究系统比较了发音特征（SPARC）和传统的音素独热编码，在预测表面肌电（sEMG）信号包络上的表现。核心发现是：1）在出声、默语和次发声三种模式下，SPARC特征的编码准确性均显著优于音素特征；2）出声和默语模式的编码性能相当，次发声模式 Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ 这篇论文旨在解决将连续变化的基频（F0）曲线映射到首尔韩语中离散、不变的音高重音类别（如LHLH, HHLH）这一难题。传统方法易受F0测量噪声和说话人差异的影响。为此，作者提出了**Dual-Glob**，一个深度监督对比学习框架。其核心是通过一个**双分支（干净视图和增强视图）编码器**，在共享 Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-detecting-hallucinations-in-speechllms-at/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-detecting-hallucinations-in-speechllms-at/ 本文旨在解决语音大模型（SpeechLLMs）在推理时产生的“幻觉”问题，即生成与输入音频不符的流畅文本。现有方法依赖昂贵的黄金标准输出，而文本LLM的方法无法捕捉音频特有信号。为此，作者提出了四个基于注意力图的轻量级指标（AudioRatio, AudioConsistency, AudioEnt Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-disentangling-damage-from-operational-variability/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-disentangling-damage-from-operational-variability/ 本文针对结构健康监测中损伤信号易被环境与操作变异掩盖的核心挑战，提出了一种**无标签、自监督的解缠表示学习框架**。该框架采用双流自编码器架构，通过**时间序列重构损失**确保信息完整性，并利用**VICReg自监督损失**（基于假设损伤状态不变的基线期数据）强制损伤敏感表征（`z_dmg`）对操作 Environmental Sound Deepfake Detection Using Deep-Learning Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-environmental-sound-deepfake-detection-using-deep/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-environmental-sound-deepfake-detection-using-deep/ 本文针对环境声音（如声音事件、声音场景）的深度伪造检测这一新兴任务，提出了一个系统的深度学习框架。**核心贡献**在于通过大量实验，系统评估了不同频谱图（MEL, CQT, Gammatone）、多种CNN架构（ResNet, Inception等）以及预训练模型（BEATs）在该任务上的表现，并验 HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-halluaudio-a-comprehensive-benchmark-for/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-halluaudio-a-comprehensive-benchmark-for/ 这篇论文旨在解决大型音频语言模型（LALM）中普遍存在的“幻觉”问题（即生成与音频证据不符的内容）缺乏系统性评估工具的难题。为此，作者构建并发布了**HalluAudio**，这是首个大规模、多领域（语音、环境声、音乐）、多任务（二分类、多选、属性验证、开放生成）的人工验证音频幻觉检测基准，包含超过 MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-move-translating-laughter-and-tears-via-mixture/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-move-translating-laughter-and-tears-via-mixture/ MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-mtr-duplexbench-towards-a-comprehensive/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-mtr-duplexbench-towards-a-comprehensive/ 这篇论文旨在解决当前全双工语音语言模型（FD-SLMs）评测体系的一个关键缺陷：缺乏对多轮、连续对话能力的系统性评估。现有基准多关注单轮交互或特定对话特性（如打断），忽略了模型在多轮语境下维持指令遵循、安全等核心能力的一致性。为此，作者提出了**MTR-DuplexBench**，一个全新的多轮全双 NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ 这篇论文旨在解决语音合成（TTS）领域中一个关键但被忽视的问题：如何标准化评估系统生成非语言声音（NVV，如笑声、叹息）的能力。作者提出了**NVBench**，一个包含**45类NVV统一分类体系**的双语（英/中）基准。其核心方法包括：1）构建了一个每类50例、总计4500例的高质量平衡评估数据 Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ 这篇技术报告全面介绍了Qwen3.5-Omni，一个能够统一理解与生成文本、图像、音频和音视频内容的全模态大语言模型。**要解决的问题**是现有模型在实时交互、跨模态推理和自主智能体行为方面的局限性。**采用的方法**是基于“思考者-说话者”架构，引入了多项关键创新：1）思考者和说话者均采用混合注意 Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-reducing-the-offline-streaming-gap-for-unified/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-reducing-the-offline-streaming-gap-for-unified/ 本文旨在解决训练单一自动语音识别（ASR）模型同时高效支持高精度离线转写和低延迟流式识别这一挑战。现有统一模型在低延迟流式模式下性能下降明显。作者提出了一个统一的RNN-Transducer (RNNT) 框架，其核心是结合了**带右上下文的chunk限制注意力**和**动态chunk卷积（DCCo Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ 本文旨在解决古兰经语音研究领域缺乏大规模、多样化、细粒度标注数据集的问题。为此，作者提出了**Tadabur**数据集及其自动化构建流水线。该流水线首先从公共平台收集音频，并利用大语言模型（Gemini）从非结构化文本中提取标准化元数据（如章节、朗诵者）。核心步骤是**Ayah Alignment Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ 本文针对文本转语音（TTS）任务，提出了一种名为“细节链”（Chain-of-Details, CoD）的新框架。**要解决的问题**是现有TTS方法在建模语音生成的时域动态（从粗略时序到精细声学细节的渐进过程）方面存在不足。**使用的方法**是将语音生成分解为多个时间分辨率递增的阶段，在每个阶段使 Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/ 这篇论文旨在解决生成式目标说话人提取（TSE）模型在流式实时应用中因依赖全局上下文而导致性能严重下降的核心问题。作者首次提出了一个基于自回归语言模型（LauraGPT）的流式TSE框架。其核心创新是“分块交织拼接范式”，通过将混合音频块与对应的目标语音离散编码块交错排列作为模型输入，严格保证了推理的 UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/ **核心贡献**：本文提出了首个专为全双工语音交互设计的统一音频前端大模型（UAF）。它打破了传统级联式前端处理的范式，将语音活动检测（VAD）、说话人识别（SR）、自动语音识别（ASR）、轮次检测（TD）和问答（QA）等多个任务，统一建模为一个自回归序列预测问题。 **关键方法**：模型采用“音 Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-voice-of-india-a-large-scale-benchmark-for-real/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-voice-of-india-a-large-scale-benchmark-for-real/ 这篇论文旨在解决现有印度语言语音识别（Indic ASR）基准不反映真实场景、评估方法不公平的核心问题。为此，作者构建了“Voice of India”大规模基准，其数据源自3.6万名说话者的非脚本化电话对话，覆盖15种主要印度语言和139个地区集群，总计536小时。关键创新在于采用了考虑拼写变体的语音/音频论文速递 2026-04-22 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22/ 共分析 21 篇语音/AI 论文 A novel LSTM music generator based on the fractional time-frequency feature extraction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-novel-lstm-music-generator-based-on-the/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-novel-lstm-music-generator-based-on-the/ 本文提出了一种基于分数阶傅里叶变换（FrFT）和长短期记忆网络（LSTM）的新型AI音乐生成系统。**核心目标**是利用FrFT在分数阶域（时频平面的旋转表示）中提取比传统时域或频域更丰富的音乐信号特征，以解决传统LSTM在捕捉音乐复杂时频结构上的不足。**关键方法**是将输入音乐信号进行FrFT变 A state-space representation of the boundary integral equation for room acoustic modelling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-state-space-representation-of-the-boundary/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-state-space-representation-of-the-boundary/ 本文旨在解决传统房间声学建模中多种方法（如边界元法、延迟网络、几何声学）彼此独立、缺乏统一理论基础的问题。作者提出了一种名为**边界积分算子状态空间（BIOSS）** 的新框架。该框架的核心是将描述声场的边界积分方程重新表述为一个状态空间模型，其中**状态是房间边界上的声压分布函数**，系统动态由* Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-aligning-language-models-for-lyric-to-melody/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-aligning-language-models-for-lyric-to-melody/ 这篇论文旨在解决大语言模型在歌词到旋律生成任务中，通过监督微调（SFT）训练出的模型常产生音乐上不可行（如节奏怪异、音域超限）的“约束违反”问题。**核心贡献**是提出了一套无需人工标注、基于规则约束的自动化对齐框架。**关键方法**分为三步：首先对预训练LLM进行SFT以获得基础生成能力；其次，利 Anonymization, Not Elimination: Utility-Preserved Speech Anonymization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-anonymization-not-elimination-utility-preserved/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-anonymization-not-elimination-utility-preserved/ 这篇论文针对语音数据隐私保护中“隐私泄露”与“数据效用损失”的核心矛盾，提出了一个新颖的两阶段框架。首先，为解决语音匿名化（保护“谁在说”）中身份多样性不足和可控性差的问题，提出了基于流匹配的说话人嵌入匿名器（F3-VA），它能生成多样且与原始说话人充分分离的新身份。其次，为解决内容匿名化（保护“说 ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-artifactnet-detecting-ai-generated-music-via/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-artifactnet-detecting-ai-generated-music-via/ 这篇论文旨在解决AI生成音乐检测中普遍存在的泛化能力差的问题。当前主流方法（如CLAM、SpecTTTra）通过学习AI音乐的声音特征，在面对未见过的生成器时性能急剧下降。作者提出了一个核心假设：当前主流AI音乐生成器（如Suno, Udio）都依赖神经音频编解码器（如EnCodec）的残差矢量量化 Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ 本文旨在解决大型音频语言模型（LALMs）在复杂音频推理任务中能力不足、推理过程不透明的问题。**核心贡献**是提出了一个名为 **Audio-Cogito** 的完全开源解决方案，其核心是一个四阶段的自动化数据构建管道 **Cogito-Pipe**，用于生成高质量、多样化的音频推理链（CoT）数 Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-deepthinker-progressive-reasoning-aware/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-deepthinker-progressive-reasoning-aware/ 这篇论文旨在解决大型音频语言模型（LALMs）缺乏显式、高质量推理能力的问题。现有方法要么受限于监督数据的质量，要么使用粗糙的奖励，导致生成的思维链形式良好但缺乏声学依据。作者提出了**Audio-DeepThinker**框架，其核心贡献有三：1）设计了一种**混合推理相似度奖励**，结合LLM评 AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ 本文旨在解决多模态大模型在音视频联合推理任务上缺乏高质量训练数据的核心挑战。**核心贡献**是提出了AVRT框架，通过组合单模态专家模型的能力来合成多模态推理数据。**关键方法**分为两步：1）**数据生成**：使用专门的视觉教师（Kimi-VL-Thinking）和音频教师（Audio Flami Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-benign-fine-tuning-breaks-safety-alignment-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-benign-fine-tuning-breaks-safety-alignment-in/ 这篇论文首次系统研究了**良性音频数据微调对音频大模型安全对齐的破坏性影响**。核心问题是：用户出于提升性能的目的，在完全无害的音频数据上微调模型，是否会意外削弱其拒绝有害指令的能力？作者提出了一个**基于嵌入空间邻近性的过滤框架**，通过计算良性音频与有害音频在模型内部或外部参考编码器空间中的距离 BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ 这篇论文旨在解决印度语言NLP研究资源分散、缺乏统一概览的痛点。作者首次提出了一个以任务为中心的统一分类体系，系统性地梳理和整合了超过200个数据集、50个基准测试以及100多个模型、工具和系统，覆盖了从核心语言处理（如分词、词性标注）到文本分类、生成翻译、信息检索、语音与多模态，乃至社会文化任务（ ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-claricodec-optimising-neural-speech-codes-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-claricodec-optimising-neural-speech-codes-for/ 本文针对卫星、水下通信等超低比特率（200bps）场景下，传统神经语音编解码器因优化重建质量而牺牲可懂度的问题，提出了ClariCodec。其核心方法是将编码器的量化过程重新定义为一个随机策略，并利用强化学习（RL），以词错率（WER）作为奖励信号对编码器进行微调，而冻结解码器等声学重建管线。实验表 Coexisting Tempo Traditions in Beethoven's Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-coexisting-tempo-traditions-in-beethovens-piano/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-coexisting-tempo-traditions-in-beethovens-piano/ 本文旨在挑战音乐表演实证研究中普遍使用的单一回归分析模型，该模型常将历史速度变化描绘为一个单向、统一的过程。作者提出，这种模型掩盖了多种演奏传统并存的事实。研究通过对贝多芬五首钢琴与大提琴奏鸣曲（Op. 5, 69, 102）在1930-2012年间超过一百个乐章录音的逐小节速度数据进行K-mean FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-flip-towards-understanding-and-interpreting/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-flip-towards-understanding-and-interpreting/ 本文提出**FLiP**，一种**因子化线性投影模型**，旨在**理解并解释**多语言、多模态句子嵌入空间（如SONAR, LaBSE, Gemini）。核心思想是将嵌入空间的解释转化为一个**线性关键词提取任务**：通过一个简单的线性投影，从句子嵌入向量中恢复出构成该句子的词汇。实验表明，训练良好 FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-freezeempath-efficient-training-for-empathetic/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-freezeempath-efficient-training-for-empathetic/ 本文旨在解决训练共情语音聊天机器人时面临的**共情语音数据稀缺、模型泛化能力弱、以及微调导致LLM通用能力退化**三大难题。作者提出了**FreezeEmpath**，一种高效的端到端训练框架。其核心方法是**冻结基础LLM**，采用**语义-情感解耦编码策略**，通过独立的语义适配器和情感提取器从 From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-from-reactive-to-proactive-assessing-the/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-from-reactive-to-proactive-assessing-the/ 本文旨在解决现有语音代理评估基准主要关注被动响应，而忽略其主动感知与干预能力的问题。作者提出了**ProVoice-Bench**，这是首个专门用于评估主动式语音代理的基准测试框架。该框架通过一个包含数字状态构建、场景合成、对话生成、声学模拟和对话组装的多阶段数据合成管道，构建了包含1182个高质量 Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hard-to-be-heard-phoneme-level-asr-analysis-of/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hard-to-be-heard-phoneme-level-asr-analysis-of/ 这篇论文针对两种音系极其复杂、资源极度匮乏的濒危东高加索语言（Archi和Rutul），首次建立了语音识别（ASR）基准。作者们整合并标准化了现有的语言学记录，创建了约50分钟和1小时20分钟的语音-文本数据集。他们评估了多种前沿ASR模型（wav2vec2, Whisper, Qwen2-Audi HCFD: A Benchmark for Audio Deepfake Detection in Healthcare https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hcfd-a-benchmark-for-audio-deepfake-detection-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hcfd-a-benchmark-for-audio-deepfake-detection-in/ 本文针对医疗健康领域中神经音频编解码器生成的语音深伪检测问题，提出了一个全新的研究任务（HCFD）和基准数据集（HCFK）。研究发现，在健康语音上训练的现有深伪检测模型在病态语音上性能显著下降。为此，论文首先验证了预训练音频模型（如PaSST）能更好地应对病理语音带来的变异性。更重要的是，本文提出了 ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-iclad-in-context-learning-with-comparison/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-iclad-in-context-learning-with-comparison/ 本文针对音频深度伪造检测模型在真实场景（in-the-wild）中泛化能力差的核心问题，提出了一种名为ICLAD的全新范式。该框架利用音频语言模型（ALM）的上下文学习能力，实现了无需训练的快速适应。其核心是创新的**成对比较推理**策略：在离线阶段，引导ALM为每个样本同时生成“真实”和“伪造”的 Incremental learning for audio classification with Hebbian Deep Neural Networks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-incremental-learning-for-audio-classification/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-incremental-learning-for-audio-classification/ 本文针对音频分类中的增量学习（持续学习）问题，提出了一种受生物启发的解决方案。核心是解决深度学习模型在学习新任务时对旧知识的“灾难性遗忘”。作者首次将**Hebbian学习**（一种基于神经元同步激活的无监督、无反馈学习规则）与**增量学习**相结合，并设计了一个**核塑性**机制。该机制通过分析训 Latent Fourier Transform https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ 这篇论文旨在解决现有音乐生成模型难以对**任意时间尺度**上的音乐模式进行精确控制的问题。作者提出了**潜在傅里叶变换（LatentFT）** 框架，其核心是将离散傅里叶变换应用于由扩散自编码器编码得到的**潜在向量序列**，从而得到“潜在频谱”。通过在训练过程中对潜在频谱进行随机频率掩码，迫使解码 LLM-Codec: Neural Audio Codec Meets Language Model Objectives https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-llm-codec-neural-audio-codec-meets-language-model/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-llm-codec-neural-audio-codec-meets-language-model/ 本文旨在解决语音语言模型（SLM）中一个根本性矛盾：神经音频编码器以波形重建为目标进行优化，而语言模型以序列预测为目标进行优化，这种目标不匹配导致生成的离散语音令牌熵值高、难以预测。为此，作者提出了LLM-Codec训练框架，在不改变编码器和语言模型架构的前提下，通过引入两个面向语言模型的正则化目标 MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mimiclm-zero-shot-voice-imitation-through/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mimiclm-zero-shot-voice-imitation-through/ 这篇论文旨在解决零样本语音模仿任务中高质量平行训练数据稀缺的核心瓶颈。传统方法要么依赖复杂的解耦架构，要么使用合成语音作为训练目标，导致输出质量受限于合成系统的能力。作者提出了一种名为 **MimicLM** 的新框架，其核心创新在于**“角色交换”的数据构建策略**：使用TTS生成的语音作为**训 MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ 这篇论文旨在解决指令跟随文本转语音（TTS）领域缺乏系统化评估工具的问题。当前评估存在覆盖不全、诊断粒度粗、多语言支持弱等缺陷。为此，作者提出了**MINT-Bench**，一个全面的多语言基准测试。其核心方法包括：1）一个基于10种原子声学属性的**分层多轴分类法**，系统性地组织了从简单到复杂（ MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-move-translating-laughter-and-tears-via-mixture/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-move-translating-laughter-and-tears-via-mixture/ 这篇论文旨在解决语音到语音翻译（S2ST）系统普遍缺失非语言声音（如笑声、哭泣）和情感韵律的问题，这严重限制了跨语言交流的自然度和语用准确性。作者提出了三大贡献：1) 一个**可扩展的表达性数据合成管道**，能自动生成高质量、带情感标注的S2ST训练对，克服了数据稀缺瓶颈；2) **MoVE（混合声 Neural Encoding Detection is Not All You Need for Synthetic Speech Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-neural-encoding-detection-is-not-all-you-need-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-neural-encoding-detection-is-not-all-you-need-for/ 这篇综述论文的核心贡献在于**揭示并论证了当前合成语音检测领域的一个关键误区：过度依赖“神经编码检测”**。论文首先系统回顾了基于SincNet、自监督学习（SSL）和神经编码检测的三类数据驱动方法，指出当前性能最佳的SSL模型实际上主要捕捉的是声码器（vocoder）在波形生成阶段引入的痕迹，而非 NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-nim4-asr-towards-efficient-robust-and/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-nim4-asr-towards-efficient-robust-and/ 本文提出了NIM4-ASR，一个面向生产环境的高效、鲁棒且可定制的实时语音识别框架。该工作旨在解决现有LLM-based ASR在实际部署中的三大挑战：1) 轻量化模型性能严重下降（有限的向下扩展性）；2) 在声学挑战条件下产生幻觉；3) 缺乏生产就绪的热词定制机制。为此，作者提出了一套原则性的多阶 Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-omni-embed-audio-leveraging-multimodal-llms-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-omni-embed-audio-leveraging-multimodal-llms-for/ 这篇论文旨在解决当前音频-文本检索模型在**真实、多样化用户查询**下性能下降的问题。作者指出，现有基准测试（如AudioCaps, Clotho）依赖描述性标题式查询，与真实世界中简短、多变的搜索行为（如问题、命令、关键词、排除性查询）存在巨大差距。为此，论文提出了两大核心贡献：1) **Omni Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-prosody-as-supervision-bridging-the-non-verbal/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-prosody-as-supervision-bridging-the-non-verbal/ 这篇论文旨在解决低资源多语言语音情感识别（SER）中标注数据稀缺的核心瓶颈。作者提出了一个颠覆性的范式：**将SER重新定义为无监督的“非言语到言语”迁移问题**。其核心假设是，非言语发声（如笑、哭）中蕴含的韵律情感线索比言语更纯粹、更跨语言，因此可以作为更好的监督源。为此，作者设计了**NOVA- SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-self-emo-emotional-self-evolution-from/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-self-emo-emotional-self-evolution-from/ 本文旨在解决对话系统中情感识别（ERC）与情感表达能力受限于高质量标注数据稀缺且静态的问题。**核心贡献**是提出了一个心理学动机的自我进化框架 **SELF-EMO**。**关键方法**是构建一个角色扮演的自博弈范式，使模型同时充当“情绪识别者”和“对话响应者”，并通过一个“生成-筛选-重用”的数 Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-still-between-us-evaluating-and-improving-voice/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-still-between-us-evaluating-and-improving-voice/ 本文旨在解决语音语言模型（SLMs）在真实场景中无法有效区分主要用户与第三方插入语音（Third-Party Interruption, TPI）的问题，这会导致上下文理解失败。为此，作者首先创建了 **TPI-Train**，一个包含8.8万个样本的训练数据集，其核心设计是“说话人感知的难负例”， VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-vibe-voice-induced-open-ended-bias-evaluation-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-vibe-voice-induced-open-ended-bias-evaluation-for/ 这篇论文旨在解决大型音频语言模型（LALM）在开放生成任务中社会偏见评估不足的问题。现有基准多依赖合成语音和选择题（MCQ），无法捕捉模型在真实交互中自然流露的刻板印象。为此，作者提出了**VIBE**框架，其核心是使用**真实人声录音**输入模型，并通过**开放生成任务**（如故事创作、个性化推荐 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ 本文针对现有视频到音乐（V2M）生成模型缺乏对创作者风格、主题等细粒度意图控制的问题，提出了Video-Robin，一个结合文本提示的视频配乐框架。其核心方法是将生成过程解耦为两个阶段：首先，一个多模态自回归规划头（AR-Head）整合视频帧和文本提示，通过语义语言模型、有限标量量化（FSQ）和残差 VoxSafeBench: Not Just What Is Said, but Who, How, and Where https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-voxsafebench-not-just-what-is-said-but-who-how/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-voxsafebench-not-just-what-is-said-but-who-how/ 这篇论文旨在解决当前语音语言模型（SLM）社会对齐评估不全面、不深入的问题。现有基准要么只关注基础音频理解，要么孤立地研究单一风险，无法区分模型是因“不懂”还是因“没用对地方”而失败。为此，作者提出了**VoxSafeBench**，这是首个联合评估SLM在**安全、公平、隐私**三大社会对齐维度上 Where Do Self-Supervised Speech Models Become Unfair? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-where-do-self-supervised-speech-models-become/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-where-do-self-supervised-speech-models-become/ 这篇论文旨在探究自监督语音模型（S3M）的不公平性究竟在模型的哪个层级产生。研究团队采用了一种轻量级的线性探针方法，在多个S3M（如WavLM, Wav2Vec2, BEST-RQ, Whisper）的每一层嵌入上，同时评估了说话人识别（SID）和自动语音识别（ASR）任务的整体性能及对不同说话人组语音/音频论文速递 2026-04-21 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21/ 共分析 34 篇语音/AI 论文 ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-actormind-emulating-human-actor-reasoning-for/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-actormind-emulating-human-actor-reasoning-for/ 这篇论文旨在解决现有角色扮演研究局限于文本模态，而忽视了日常交流中主导的语音模态的问题。为此，作者首先**定义了“语音角色扮演”任务**，要求模型能根据角色、场景和对话历史，生成带有个性化语音特征（如特定情感、语调）的自发性回应。为此，他们构建了**ActorMindBench**，这是一个基于《老 ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-artifactnet-detecting-ai-generated-music-via/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-artifactnet-detecting-ai-generated-music-via/ 本文旨在解决AI生成音乐检测中泛化性差和模型参数效率低的问题。作者提出了一种名为**ArtifactNet**的新框架，其核心创新在于将问题**重新定义为“法医物理学”**，即直接提取和分析神经音频编解码器在生成音频中不可避免留下的物理痕迹（残留物）。该方法使用一个轻量级的**Bounded-mas AST: Adaptive, Seamless, and Training-Free Precise Speech Editing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ 本文针对现有语音编辑方法依赖任务特定训练、未编辑区域时间一致性差的问题，提出了AST（Adaptive, Seamless, and Training-free），一种基于预训练AM-FM（自回归-流匹配）范式TTS模型的精确语音编辑框架。AST首先通过逆Euler ODE求解器将原始语音反演至潜空 Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/ 本文旨在解决从单向“独白”式虚拟人生成迈向自然“全双工”交互式生成的核心挑战。**核心问题**在于，现有方法要么因严格的帧对齐而反应僵硬，要么因引入全局注意力而破坏唇同步。**关键方法**是提出一个基于多头高斯核（MHGK）的统一注意力架构，该机制通过为不同的注意力头分配从窄到宽的高斯分布感受野，使 BlasBench: An Open Benchmark for Irish Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-blasbench-an-open-benchmark-for-irish-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-blasbench-an-open-benchmark-for-irish-speech/ 这篇论文旨在解决爱尔兰语语音识别（ASR）领域缺乏统一、可靠评估标准的问题。现有工作或基准要么忽略爱尔兰语特有的文本规范（如保留fada变音符号、初始辅音突变），要么在不同数据集和归一化方法下进行，导致结果无法比较。为此，作者提出了**BlasBench**，一个开放的评估框架，其核心是一个**爱尔 Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-discrete-token-modeling-for-multi-stem-music/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-discrete-token-modeling-for-multi-stem-music/ 本文提出了一种用于多轨音乐源分离的生成式框架，其核心创新在于将分离任务重新定义为**条件离散令牌生成**问题。传统方法直接在时频域估计连续信号，而本文方法首先利用**HCodec**神经音频编解码器将音频波形转换为离散的声学与语义令牌序列。然后，一个基于**Conformer**的条件编码器从混合音 Elucidating the SNR-t Bias of Diffusion Probabilistic Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/ 这篇论文的核心贡献是识别并系统分析了扩散概率模型（DPMs）中一个基础性问题——信噪比-时间步（SNR-t）偏差。该偏差指推理时去噪样本的实际SNR与其所分配时间步t所理论对应的SNR不匹配，这种错位源于训练时的严格耦合在推理时被累积误差打破。作者通过详实的实验（滑动窗口测试、前向与反向过程对比）揭 Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-full-duplex-bench-v3-benchmarking-tool-use-for/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-full-duplex-bench-v3-benchmarking-tool-use-for/ 这篇论文针对当前全双工语音代理评估缺乏真实性（依赖合成语音）和任务简单性（单步调用）的问题，提出了**Full-Duplex-Bench-v3 (FDB-v3)** 基准。该基准的核心创新在于使用**100条真实人类录音**（含五种不流畅性注释），在四个任务域中设计了需要**多步API链式调用**的 Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-generalizable-audio-visual-navigation-via/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-generalizable-audio-visual-navigation-via/ 本文旨在解决音频-视觉导航（AVN）智能体在未见环境和未闻声音类别下泛化能力差的核心问题。作者指出，现有方法性能下降主要源于两个因素：一是音频表征混淆了语义与空间信息，导致对未闻声��定位不准；二是强化学习策略过拟合于训练环境的动态和布局。为此，本文提出了一个名为BDATP的即插即用框架。在感知层面 HARNESS: Lightweight Distilled Arabic Speech Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ 这篇论文针对阿拉伯语语音识别、方言识别和情感识别中通用多语言/英语模型性能不足、且大模型难以部署的问题，提出了 HArnESS——一个以阿拉伯语为中心的自监督语音模型家族。作者采用 HuBERT 风格的迭代自蒸馏框架，先在大规模阿拉伯语-英语双语数据（约 23K 小时）上训练 24 层的教师模型 H Hierarchical Codec Diffusion for Video-to-Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/ 本论文针对 Video-to-Speech（VTS）生成中视觉-语音模态信息不对称的问题，提出现有方法忽略了语音从粗粒度语义到细粒度韵律的层次结构，导致视觉条件无法与语音表示精准对齐。为此，作者提出 HiCoDiT（Hierarchical Codec Diffusion Transformer）， Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-interactive-asr-towards-human-like-interaction/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-interactive-asr-towards-human-like-interaction/ 这篇论文针对传统ASR的两大盲区——WER指标对语义错误不敏感、以及系统无法通过自然交互进行纠错——提出了Interactive ASR框架。首先，作者引入S²ER（Sentence-level Semantic Error Rate），利用LLM-as-a-Judge二元判断识别结果与参考文本是否 Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-joint-centric-dual-contrastive-alignment-with/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-joint-centric-dual-contrastive-alignment-with/ 这篇论文旨在解决音频-文本多模态表示学习中的一个关键挑战：如何在低资源、长序列且模态维度严重不平衡（音频高维、文本低维）的情况下，实现有效的跨模态对齐，同时保留各自的特异性信息。为此，作者提出了HILBERT框架。该方法首先利用冻结的预训练音频（如HuBERT）和文本（如T5）编码器提取片段级特征， MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-moshirag-asynchronous-knowledge-retrieval-for/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-moshirag-asynchronous-knowledge-retrieval-for/ 本文旨在解决全双工语音语言模型（如Moshi）事实性不足的核心问题，同时不牺牲其高交互性。**问题**：全双工模型能实时打断和回应，但因训练数据规模远小于文本，其知识储备和事实准确性较弱。**方法**：提出了MoshiRAG，一个模块化框架。它在Moshi模型中引入一个特殊的`<ret>`检索触发令 MUSCAT: MUltilingual, SCientific ConversATion Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ 本文提出了 MUSCAT，一个用于评估多语言科学对话场景下自动语音识别（ASR）性能的新基准。数据集包含 6 组双语对话录音（共约 65 分钟，9,066 词），涉及英语与德语、土耳其语、中文、越南语的配对对话；每组对话使用 Meeting Owl 3、ReSpeaker USB 麦克风阵列和 Me NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-naijas2st-a-multi-accent-benchmark-for-speech-to/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-naijas2st-a-multi-accent-benchmark-for-speech-to/ 这篇论文旨在解决非洲低资源语言在语音翻译（S2ST和S2TT）研究中面临的高质量、多口音平行语音数据严重匮乏的核心瓶颈。为此，作者构建了**NaijaS2ST**数据集，涵盖豪萨语、伊博语、约鲁巴语和尼日利亚皮钦语与英语的平行语音，每种语言约50小时，捕获了真实的说话者与口音多样性。基于此数据集，论 NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-nvbench-a-benchmark-for-speech-synthesis-with-non/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-nvbench-a-benchmark-for-speech-synthesis-with-non/ 本文旨在解决语音合成（TTS）领域中非语言声音（NVV，如笑声、叹息、哭泣）缺乏标准化评估框架的问题。为此，作者提出了NVBench，一个双语（英/中）基准测试。其核心方法包括：1）设计了一个涵盖45种NVV类型的统一分类法；2）构建了一个类型均衡的高质量双语评估数据集；3）提出了一套多轴评估协议， PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ps-tts-phonetic-synchronization-in-text-to-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ps-tts-phonetic-synchronization-in-text-to-speech/ 这篇论文旨在解决自动配音（AD）中目标语音与源语音在时长和唇形上的同步难题。其核心贡献是提出了一套两阶段的文本改写方法，并集成到TTS系统中：首先通过语言模型进行**等时性**改写，确保目标语音时长匹配源语音；其次引入**音素同步（PS）**，使用动态时间规整（DTW）和从训练数据中学习的元音距离， Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-qwen35-omni-technical-report/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-qwen35-omni-technical-report/ Qwen3.5-Omni 是一个旨在统一理解、推理、生成与行动的全模态大语言模型。它**解决**了现有模型在实时交互、长上下文音视频处理、流式语音生成稳定性以及多语言支持等方面的局限性。**方法上**，它基于Thinker-Talker架构，引入了Hybrid MoE以提升效率，采用显式时间戳替代稀 Spatial-Aware Conditioned Fusion for Audio-Visual Navigation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-spatial-aware-conditioned-fusion-for-audio-visual/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-spatial-aware-conditioned-fusion-for-audio-visual/ 本论文针对音频-视觉导航（AVN）中目标空间意图模糊、视觉特征缺乏听觉条件引导两大问题，提出了 Spatial-Aware Conditioned Fusion（SACF）框架。该框架首先设计了 Spatially Discretized Localization Descriptor（SDLD）， Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-temporal-contrastive-decoding-a-training-free/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-temporal-contrastive-decoding-a-training-free/ 统一的大型音频-语言模型（LALMs）在自回归解码时存在“时间平滑偏差”：短暂、瞬态的声学线索（如电话铃声、乐器拨弦）容易被语言先验和时间上平滑的上下文所淹没，导致生成结果缺乏音频特异性。本文提出 Temporal Contrastive Decoding (TCD)，一种完全免训练、仅在推理时生效 The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-the-acoustic-camouflage-phenomenon-re-evaluating/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-the-acoustic-camouflage-phenomenon-re-evaluating/ 本研究探讨了在企业财报电话会议中，副语言声学特征（音高、抖动、停顿等）对预测灾难性股价下跌的效用。作者基于MAEC数据集，提取了两种模态的特征：文本端使用FinBERT计算脚本化开场白与即兴Q&A之间的情感极性差异（Sentiment Delta），音频端提取临床语音压力标记的方差特征（音高方差、抖 TinyMU: A Compact Audio-Language Model for Music Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-tinymu-a-compact-audio-language-model-for-music/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-tinymu-a-compact-audio-language-model-for-music/ 本文针对现有大型音频语言模型（LALM）参数庞大（数十亿级）、训练推理成本高、难以部署在边缘设备的问题，提出了 TinyMU——一个仅有 229M 参数的紧凑音乐语言模型。为此，作者构建了 MusicSkills-3.5M 数据集，包含 350 万个涵盖多选、二元判断和开放式格式的音乐问答样本，结合 VoxMind: An End-to-End Agentic Spoken Dialogue System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ 端到端语音对话模型在自然交互上进步迅速，但普遍缺乏处理复杂任务的agent能力（工具调用、规划、推理）。本文首先形式化定义了"端到端语音智能体"的四大维度——画像（Profile）、记忆（Memory）、规划（Planning）与执行（Action Execution），填补了该领域理论标准的空白。语音/音频论文速递 2026-04-20 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20/ 共分析 24 篇语音/AI 论文 A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven's Piano and Cello Sonatas https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-a-manual-bar-by-bar-tempo-measurement-protocol/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-a-manual-bar-by-bar-tempo-measurement-protocol/ 本文旨在解决现有自动化节拍提取工具在分析历史复调室内乐录音（特别是贝多芬钢琴与大提琴奏鸣曲）时出现的系统性失败问题。作者与一名VLSI工程师合作，设计并验证了一套形式化的手动逐小节速度测量协议。该协议 Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-adaptive-test-time-scaling-for-zero-shot/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-adaptive-test-time-scaling-for-zero-shot/ 本文旨在解决零样本呼吸音频分类中“一刀切”的推理计算浪费问题。为此，提出了TRIAGE框架，这是一个三层自适应推理管道：第一层（Tier-L）进行快速的标签-文本相似度匹配；若置信度不足则升级至第二层 An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ 这篇论文旨在解决实时交互式语音合成中**推理延迟高**与**声学质量（尤其是高频细节）易受损**的核心矛盾。传统流水线依赖计算密集的神经声码器进行波形重建，且基于连续回归的声学模型易导致频谱过平滑。为 Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-source-separation-in-reverberant/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-source-separation-in-reverberant/ 本文针对混响环境下的多通道音频源分离问题，提出了一种基于β-散度非负因子分解的参数估计新方法。传统方法依赖期望最大化（EM）算法估计源频谱方差和空间协方差矩阵，本文则利用包含源频谱先验信息的基矩阵（可 Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-cogito-towards-deep-audio-reasoning-in/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-cogito-towards-deep-audio-reasoning-in/ 这篇论文旨在解决大型音频语言模型（LALMs）在复杂音频推理任务上能力不足且依赖昂贵闭源数据的问题。作者提出了一个名为**Audio-Cogito**的全开源解决方案，其核心是**Cogito-Pip AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-avid-a-benchmark-for-omni-modal-audio-visual/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-avid-a-benchmark-for-omni-modal-audio-visual/ 这篇论文旨在解决当前全模态大模型在音视频不一致性理解能力上缺乏系统性评估的问题。现有基准要么只关注音视频对齐事件，要么局限于检测深度伪造中的低级伪影，无法评估模型对长视频中语义级矛盾的理解。为此，作者 Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/ 这篇论文旨在解决当前音频大语言模型（AudioLLMs）在细粒度声学感知任务上表现不佳的核心问题。作者指出，主流的以自动语音识别（ASR）为中心的训练范式，通过将音频映射到纯文本转录，系统性地丢弃了副 ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ 这篇论文旨在解决卫星、水下等极端带宽受限场景下（如200bps）语音通信清晰度严重下降的问题。传统编解码器以波形重建为目标，在超低比特率下会将宝贵的比特分配给不必要的声学细节，而非核心语义信息。为此， Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-classical-machine-learning-baselines-for-deepfake/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-classical-machine-learning-baselines-for-deepfake/ 本文旨在解决深度伪造音频检测领域缺乏透明、可解释基线的问题。研究团队采用经典机器学习方法，在Fake-or-Real (FoR) 数据集上构建了一个完整的检测流程。他们从高保真（44.1 kHz）和电 Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-comparison-of-window-shapes-and-lengths-in-short/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-comparison-of-window-shapes-and-lengths-in-short/ 本文针对心音信号（PCG）分类任务中，因信号非-stationarity而采用滑动窗口分段提取特征时，窗函数形状和长度选择缺乏系统性研究的问题，进行了一项实验性评估。作者使用双向长短期记忆网络（biL Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-contextual-biasing-for-asr-in-speech-llm-with/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-contextual-biasing-for-asr-in-speech-llm-with/ 这篇论文旨在解决语音大模型（SLLM）在识别训练数据中稀有或未见的“偏置词”时性能不佳的问题。传统方法依赖于为偏置词提供精确的音素序列（通过G2P系统生成），但这对用户有专业要求且工具兼容性差。为此， ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/ 本文提出了ControlFoley，一个统一且可控的视频到音频生成框架，旨在解决现有方法在跨模态冲突下文本控制力弱、以及参考音频控制中音色与时间信息纠缠的问题。其核心贡献包括：1）提出联合视觉编码范式 CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/ 本文针对电影配音（视觉语音克隆）中音色保真度与唇形同步难以兼得的痛点，提出了一种基于流匹配的认知同步扩散Transformer（CoSyncDiT）框架。该方法受专业配音员认知过程启发，将噪声到语音的 Diffusion Language Models for Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/ 这篇论文探索了将扩散语言模型（DLM）应用于自动语音识别（ASR）任务的新方法。其核心目标是利用扩散模型的双向注意和并行生成能力，来提升基于传统编码器（如CTC）生成的ASR候选假设的准确性。论文主要 Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-dual-axis-generative-reward-model-toward-semantic/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-dual-axis-generative-reward-model-toward-semantic/ 本文旨在解决全双工语音对话模型（SDMs）实现类人交互的核心挑战。现有自动化评估指标流于表面（如统计行为或预测时机准确率），无法为强化学习提供可靠的奖励信号，而人工评估成本高昂且难以扩展。为此，作者提 Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-elastic-net-regularization-and-gabor-dictionary/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-elastic-net-regularization-and-gabor-dictionary/ 本文旨在解决心音信号（PCG）的多分类问题，以辅助心血管疾病的自动诊断。核心贡献在于提出了一套结合**优化Gabor字典**和**弹性网络正则化**的特征提取框架，并与**CNN-LSTM深度学习网络 Enhancing time-frequency resolution with optimal transport and barycentric fusion of multiple spectrogram https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-enhancing-time-frequency-resolution-with-optimal/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-enhancing-time-frequency-resolution-with-optimal/ **核心问题**：短时傅里叶变换（STFT）生成的谱图受制于不确定性原理，无法同时获得优异的时间和频率分辨率。传统融合方法（如几何平均）要求输入谱图网格对齐，且性能有限。 **核心方法**：本文提出一 Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-few-shot-and-pseudo-label-guided-speech-quality/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-few-shot-and-pseudo-label-guided-speech-quality/ 本文旨在解决非侵入式语音质量评估在标注数据有限场景下的性能瓶颈。作者提出了GatherMOS框架，其核心是将大语言模型（如GPT-5）作为一个元评估器，通过精心设计的文本提示，融合多类异构信号：包括手 Four Decades of Digital Waveguides https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-four-decades-of-digital-waveguides/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-four-decades-of-digital-waveguides/ 这篇论文旨在全面回顾数字波导物理建模技术自诞生以来四十年的发展历程、核心应用与最新进展。它要解决的核心问题是，如何在保证物理模拟准确性的同时，实现声波传播模拟的高效计算，以满足实时音频处理（如虚拟乐器 From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ 本文旨在解决当前语音代理评估中过度关注被动响应，而忽视其主动交互能力的问题。为此，作者提出了首个专门评估主动语音代理的基准测试框架 **ProVoice-Bench**。该框架包含四个新颖的任务，用以 Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ 这篇论文提出了一个名为 **Geo2Sound** 的新任务和框架，旨在从卫星图像生成地理上一致且逼真的声音景观。**要解决的问题**是现有图像到音频模型在处理自上而下的卫星视图时面临三大挑战：缺乏结 Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-hijacking-large-audio-language-models-via-context/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-hijacking-large-audio-language-models-via-context/ 这篇论文揭示了针对音频大语言模型（LALM）的一种新型安全威胁：**上下文无关且不可感知的音频提示注入攻击**。攻击者仅需篡改输入音频数据（如会议录音、音乐片段），即可在用户不知情的情况下，劫持模型行 Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listen-pause-and-reason-toward-perception/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listen-pause-and-reason-toward-perception/ 本文旨在解决大型音频语言模型在复杂音频场景中因感知错误导致的推理失败问题。受听觉场景分析启发，作者提出了一个感知接地的混合推理框架。首先，他们构建了一个名为PAQA的新数据集，通过层次化解耦策略（区分 Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/ 本文首次提出了“聆听深度伪造检测”这一新任务，旨在识别视频中人物在倾听状态下（非说话时）的伪造反应，弥补了现有研究主要集中于“说话”场景的不足。为解决此任务数据稀缺的问题，作者构建了首个专门数据集Li MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-moshirag-asynchronous-knowledge-retrieval-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-moshirag-asynchronous-knowledge-retrieval-for/ 本文提出了MoshiRAG，这是首个集成检索增强生成功能的全双工语音语言模型。**要解决的问题**是全双工语音模型在保持实时交互性的同时，事实准确性不足的挑战。**核心方法**是基于Moshi模型，设 On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ 本文针对现有语音变分自编码器（VAE）在统一语音重建、理解和生成任务上表现不平衡的问题（尤其是理解能力差），系统性地研究了蒸馏损失函数的设计空间。作者探索了三种将自监督学习（SSL）模型知识蒸馏到VA ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-prosdd-learning-prosodic-representations-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-prosdd-learning-prosodic-representations-for/ 这篇论文旨在解决当前语音深度伪造检测（SDD）系统在面对富有表现力和情感的合成语音攻击时泛化能力不足的核心问题。现有方法过度依赖伪造数据，容易学习数据集特定的伪影，而非自然语音的可迁移特征。为此，作者 Room compensation for loudspeaker reproduction using a supporting source https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-room-compensation-for-loudspeaker-reproduction/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-room-compensation-for-loudspeaker-reproduction/ 本文针对传统房间补偿技术仅能修正频谱（音色）而无法控制空间感知（如距离感）的局限，提出了一种创新的补偿方法。该方法通过引入一个延迟的、经过频谱滤波的辅助扬声器，选择性地向房间的混响声场中添加能量，从而 Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-sky-ear-an-unmanned-aerial-vehicle-enabled-victim/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-sky-ear-an-unmanned-aerial-vehicle-enabled-victim/ 本文针对无人机搜救任务中视觉系统受遮蔽、能耗高的问题，提出了一个名为“Sky-Ear”的音频驱动受害者检测与定位系统。核心方法是设计了一个基于环形麦克风阵列的两阶段处理框架：在“哨兵阶段”，系统利用单 SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-speakerrpl-v2-robust-open-set-speaker/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-speakerrpl-v2-robust-open-set-speaker/ 本文旨在解决开放集说话人识别中的鲁棒性问题，即系统在仅有少量目标说话人注册样本的情况下，需同时准确识别已知说话人并可靠拒识未知说话人。作者在先前SpeakerRPL V1框架基础上提出了三项关键改进： SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ 本文旨在解决大型音频语言模型在**细粒度音频事件时间定位**上的不足。现有模型因训练数据缺乏精确时间戳、基准测试过于简单，导致在长音频中定位短暂事件（“大海捞针”）时表现不可靠。为此，作者提出了**S StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-streammark-a-deep-learning-based-semi-fragile/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-streammark-a-deep-learning-based-semi-fragile/ 本文针对生成式AI带来的音频深度伪造威胁，提出了一种名为StreamMark的主动防御框架。该框架是一种基于深度学习的半脆弱音频水印系统，其核心创新在于重新定义了水印的目标：不是追求对所有变换的绝对鲁 TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tokense-a-mamba-based-discrete-token-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tokense-a-mamba-based-discrete-token-speech/ 本文针对人工耳蜗用户在噪声和混响环境下语音理解困难的问题，提出了一种名为TokenSE的语音增强框架。该框架的核心创新在于将语音增强任务从传统的时频域或波形域转换到神经音频编解码器的离散令牌空间中进行 Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tora3-trajectory-guided-audio-video-generation/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tora3-trajectory-guided-audio-video-generation/ 本文针对现有音视频（AV）生成模型中存在的运动不真实、声音与运动事件不同步、声音强度与运动强度不匹配等问题，提出了Tora3框架。其核心创新在于**将物体轨迹视为连接视觉与听觉模态的共享运动学先验** Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ 这篇论文旨在解决大型音频语言模型（LALM）在细粒度时间感知（如精确定位声音事件的起止时间）上的不足。作者提出了**TimePro-RL**框架，其核心是两步走策略：首先，提出**音频侧时间提示（AS Transformer Based Machine Fault Detection From Audio Input https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-transformer-based-machine-fault-detection-from/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-transformer-based-machine-fault-detection-from/ 本文旨在探讨基于Transformer的架构在机器故障音频检测任务上相对于传统卷积神经网络（CNN）的潜在优势。**要解决的问题**是传统CNN在处理频谱图时固有的局部性和平移不变性等归纳偏置，可能并 UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-unipase-a-generative-model-for-universal-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-unipase-a-generative-model-for-universal-speech/ 这篇论文旨在解决通用语音增强（USE）中生成模型面临的“高感知质量”与“低内容幻觉”难以兼得的核心矛盾。作者提出了UniPASE框架，它扩展了其先前的低幻觉PASE模型，以处理包括噪声、混响、丢包、风 VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ 本文旨在解决语音处理中一个基础但被忽视的问题：如何系统化地识别语音音频所经过的后期处理效果及其参数。现实中，语音几乎都经过了降噪、压缩等效果处理，但现有数据集缺乏此类精确标注，阻碍了相关研究。为此，作 VoxSafeBench: Not Just What Is Said, but Who, How, and Where https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ 这篇论文旨在解决一个关键问题：当语音大模型（SLM）进入多用户共享环境时，仅基于文本内容的安全对齐策略是不足的，说话人身份、副语言特征和声学场景等音频上下文信息会根本性地改变请求的性质。为此，作者提出 WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-wavalign-enhancing-intelligence-and/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-wavalign-enhancing-intelligence-and/ 这篇论文旨在解决端到端语音对话模型在智能（IQ）和表达力（EQ）上难以同时提升的核心挑战。作者发现，直接对混合文本-语音序列应用统一的偏好优化（如DPO、GRPO）会导致问题：稀疏的偏好信号被淹没在密 Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-who-is-speaking-or-who-is-depressed-a-controlled/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-who-is-speaking-or-who-is-depressed-a-controlled/ 这篇论文的核心贡献在于系统性地揭示并量化了语音抑郁症检测模型中普遍存在的“说话人身份泄露”问题。作者指出，当前许多报告高准确率的模型，其性能可能严重依赖于对说话人身份（声纹）的记忆，而非对抑郁相关声学 Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ 这篇论文深入探讨了在端到端音频语言模型中，将视觉信息融入音频分词器时普遍存在的“理解提升但重建质量下降”的核心矛盾。作者通过系统性实验，揭示了三个关键发现：融合位置（在量化前还是量化后）至关重要；在离 X-VC: Zero-shot Streaming Voice Conversion in Codec Space https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-x-vc-zero-shot-streaming-voice-conversion-in/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-x-vc-zero-shot-streaming-voice-conversion-in/ 这篇论文旨在解决零样本语音转换中**高保真说话人迁移**与**低延迟流式推理**难以兼得的核心挑战。作者提出了**X-VC**系统，其核心创新在于**在预训练神经编解码器（SAC）的潜在空间中进行一步语音/音频论文速递 2026-04-19 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/ 共分析 42 篇语音/AI 论文语音/音频论文速递 2026-04-18 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/ Sat, 18 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/ 共分析 39 篇语音/AI 论文