基准测试 on 语音/音频论文速递

基准测试 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E5%9F%BA%E5%87%86%E6%B5%8B%E8%AF%95/ Recent content in 基准测试 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 A Framework for Controlled Multi-Speaker Audio Synthesis for Robustness Evaluation of Speaker Diarisation Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-framework-for-controlled-multi-speaker-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-framework-for-controlled-multi-speaker-audio/ 说话人日志 | 7.5/10 A Superb-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-superb-style-benchmark-of-self-supervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-superb-style-benchmark-of-self-supervised/ 音频深度伪造检测 | 7.0/10 Aligning Generative Speech Enhancement with Perceptual Feedback https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-generative-speech-enhancement-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-generative-speech-enhancement-with/ 语音增强 | 7.5/10 AMBISONIC-DML: A Benchmark Dataset for Dynamic Higher-Order Ambisonics Music with Motion-Aligned Stems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ambisonic-dml-a-benchmark-dataset-for-dynamic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ambisonic-dml-a-benchmark-dataset-for-dynamic/ 数据集 | 7.5/10 AQUA-Bench: Beyond finding answers to knowing when there are None in Audio Question Answering https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aqua-bench-beyond-finding-answers-to-knowing-when/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aqua-bench-beyond-finding-answers-to-knowing-when/ 音频问答 | 7.0/10 AR-BSNet: Towards Ultra-Low Complexity Autoregressive Target Speaker Extraction With Band-Split Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ar-bsnet-towards-ultra-low-complexity/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ar-bsnet-towards-ultra-low-complexity/ 语音分离 | 7.0/10 Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-assessing-identity-leakage-in-talking-face/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-assessing-identity-leakage-in-talking-face/ 说话人脸生成 | 7.5/10 Audio-Visual Deepfake Generation and Detection: An Exploratory Survey https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-visual-deepfake-generation-and-detection-an/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-visual-deepfake-generation-and-detection-an/ 音频深度伪造检测 | 6.5/10 Auditory Illusion Benchmark for Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auditory-illusion-benchmark-for-large-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auditory-illusion-benchmark-for-large-audio/ 模型评估 | 7.0/10 Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-benchmarking-music-autotagging-with-mgphot-expert/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-benchmarking-music-autotagging-with-mgphot-expert/ 音乐信息检索 | 7.5/10 Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/ 音频深度伪造检测 | 8.1/10 Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-can-large-audio-language-models-understand-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-can-large-audio-language-models-understand-audio/ 基准测试 | 7.0/10 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clawmark-a-living-world-benchmark-for-multi-turn/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clawmark-a-living-world-benchmark-for-multi-turn/ 基准测试 | 7.0/10 Combining SSL Speech Features, Contextual Transformers and Mamba Models for Realistic Audio Spoofing Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-ssl-speech-features-contextual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-ssl-speech-features-contextual/ 音频深度伪造检测 | 7.5/10 Content-Preserving Speech Representation Learning Via Adaptive Segment-Level Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-preserving-speech-representation-learning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-preserving-speech-representation-learning/ 语音识别 | 7.5/10 CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cova-text-guided-composed-video-retrieval-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cova-text-guided-composed-video-retrieval-for/ 跨模态检索 | 6.5/10 Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-cultural-bias-in-mel-scale-representations/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-cultural-bias-in-mel-scale-representations/ 语音识别 | 7.0/10 Cross-Lingual Interleaving for Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ 语音大模型 | 7.5/10 Do Bias Benchmarks Generalise? Evidence from Voice-Based Evaluation of Gender Bias in Speechllms https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-bias-benchmarks-generalise-evidence-from-voice/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-bias-benchmarks-generalise-evidence-from-voice/ 模型评估 | 8.0/10 EchoFake: A Replay-Aware Dataset For Practical Speech Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echofake-a-replay-aware-dataset-for-practical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echofake-a-replay-aware-dataset-for-practical/ 音频深度伪造检测 | 8.5/10 Enhancing Audio Question-Answering Performance Through Log-Likelihood Guided Reward Functions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-audio-question-answering-performance/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-audio-question-answering-performance/ 音频问答 | 8.5/10 Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-bias-in-spoken-dialogue-llms-for-real/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-bias-in-spoken-dialogue-llms-for-real/ 模型评估 | 7.0/10 Evaluating Compositional Structure in Audio Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-compositional-structure-in-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-compositional-structure-in-audio/ 模型评估 | 7.0/10 Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-emotion-recognition-in-spoken-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-emotion-recognition-in-spoken-language/ 语音情感识别 | 7.5/10 Evaluating Pretrained Speech Embedding Systems for Dysarthria Detection Across Heterogenous Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-pretrained-speech-embedding-systems/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-pretrained-speech-embedding-systems/ 语音生物标志物 | 7.5/10 Face-Voice Association with Inductive Bias for Maximum Class Separation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-face-voice-association-with-inductive-bias-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-face-voice-association-with-inductive-bias-for/ 说话人验证 | 7.0/10 Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fake-speech-wild-detecting-deepfake-speech-on/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fake-speech-wild-detecting-deepfake-speech-on/ 语音伪造检测 | 7.0/10 FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-flowse-grpo-training-flow-matching-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-flowse-grpo-training-flow-matching-speech/ 语音增强 | 7.5/10 FoleyBench: A Benchmark for Video-to-Audio Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-foleybench-a-benchmark-for-video-to-audio-models/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-foleybench-a-benchmark-for-video-to-audio-models/ 音频生成 | 7.5/10 From Human Speech to Ocean Signals: Transferring Speech Large Models for Underwater Acoustic Target Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-human-speech-to-ocean-signals-transferring/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-human-speech-to-ocean-signals-transferring/ 水下声学目标识别 | 7.0/10 Game-Time: Evaluating Temporal Dynamics in Spoken Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-game-time-evaluating-temporal-dynamics-in-spoken/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-game-time-evaluating-temporal-dynamics-in-spoken/ 语音对话系统 | 7.5/10 Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hashing-baseline-rethinking-hashing-in-the-age-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hashing-baseline-rethinking-hashing-in-the-age-of/ 音频检索音频分类 | 8.0/10 ICASSP 2026 - 基准测试论文列表 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-010/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-010/ 共 5 篇 ICASSP 2026 基准测试方向论文 Leveraging prediction entropy for Automatic prompt weighting in Zero-Shot Audio-Language Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-prediction-entropy-for-automatic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-prediction-entropy-for-automatic/ 音频分类 | 7.5/10 LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ 基准测试 | 7.8/10 Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-measuring-prosody-diversity-in-zero-shot-tts-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-measuring-prosody-diversity-in-zero-shot-tts-a/ 语音合成 | 8.0/10 Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/ 多模态模型 | 7.5/10 MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/ 基准测试 | 7.5/10 Multi-Layer Attentive Probing Improves Transfer of Audio Representations for Bioacoustics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-layer-attentive-probing-improves-transfer/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-layer-attentive-probing-improves-transfer/ 生物声学 | 7.5/10 MusiCRS: Benchmarking Audio-Centric Conversational Recommendation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musicrs-benchmarking-audio-centric-conversational/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musicrs-benchmarking-audio-centric-conversational/ 音乐推荐 | 7.5/10 PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ 基准测试 | 7.5/10 RHO-PERFECT: Correlation Ceiling for Subjective Evaluation Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rho-perfect-correlation-ceiling-for-subjective/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rho-perfect-correlation-ceiling-for-subjective/ 模型评估 | 7.5/10 RMODGDF: A Robust STFT-Derived Feature for Musical Instrument Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rmodgdf-a-robust-stft-derived-feature-for-musical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rmodgdf-a-robust-stft-derived-feature-for-musical/ 音乐信息检索 | 7.0/10 Savgbench: Benchmarking Spatially Aligned Audio-Video Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/ 基准测试 | 7.5/10 SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sed-structural-entropy-based-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sed-structural-entropy-based-speech/ 语音识别 | 6.5/10 SingMOS-Pro: An Comprehensive Benchmark For Singing Quality Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-singmos-pro-an-comprehensive-benchmark-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-singmos-pro-an-comprehensive-benchmark-for/ 歌唱语音合成 | 7.5/10 SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ 语音合成 | 7.0/10 Step-Audio-R1.5 Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-step-audio-r15-technical-report/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-step-audio-r15-technical-report/ 语音对话系统 | 8.0/10 Streamingbench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streamingbench-assessing-the-gap-for-mllms-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streamingbench-assessing-the-gap-for-mllms-to/ 基准测试 | 7.5/10 StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stylebench-evaluating-speech-language-models-on/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stylebench-evaluating-speech-language-models-on/ 基准测试 | 8.5/10 SURE: Synergistic Uncertainty-Aware Reasoning for Multimodal Emotion Recognition in Conversations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sure-synergistic-uncertainty-aware-reasoning-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sure-synergistic-uncertainty-aware-reasoning-for/ 语音情感识别 | 7.5/10 Taming Audio VAEs via Target-KL Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-taming-audio-vaes-via-target-kl-regularization/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-taming-audio-vaes-via-target-kl-regularization/ 音频生成 | 6.5/10 Target Speaker Anonymization in Multi-Speaker Recordings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-anonymization-in-multi-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-anonymization-in-multi-speaker/ 语音匿名化 | 7.6/10 TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tau-a-benchmark-for-cultural-sound-understanding/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tau-a-benchmark-for-cultural-sound-understanding/ 音频问答 | 7.5/10 TextlessRAG: End-to-End Visual Document RAG by Speech without Text https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/ 语音问答 | 8.5/10 The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ 语音增强 | 7.5/10 The Muse Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-muse-benchmark-probing-music-perception-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-muse-benchmark-probing-music-perception-and/ 音乐理解 | 8.5/10 The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-singing-voice-conversion-challenge-2025-from/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-singing-voice-conversion-challenge-2025-from/ 歌唱语音转换 | 7.0/10 The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ 基准测试 | 7.0/10 Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-orthographically-informed-evaluation-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-orthographically-informed-evaluation-of/ 语音识别 | 7.0/10 Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-walking-through-uncertainty-an-empirical-study-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-walking-through-uncertainty-an-empirical-study-of/ 音频问答 | 7.5/10 When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-silence-matters-the-impact-of-irrelevant/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-silence-matters-the-impact-of-irrelevant/ 模型评估 | 7.0/10 When Voice Matters: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-voice-matters-a-controlled-study-of-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-voice-matters-a-controlled-study-of-audio/ 模型评估 | 7.0/10 Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-why-do-speech-language-models-fail-to-generate/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-why-do-speech-language-models-fail-to-generate/ 语音生成 | 7.0/10 语音/音频论文速递 2026-04-29 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/ 共分析 29 篇语音/AI 论文 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ 语音质量评估 | 7.5/10 语音/音频论文速递 2026-04-27 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27/ 共分析 13 篇语音/AI 论文 AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ 音频问答 | 6.5/10 Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ 语音识别 | 7.5/10 Evaluation of Automatic Speech Recognition Using Generative Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ 语音识别 | 7.5/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-mer-2026-from-discriminative-emotion-recognition/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-mer-2026-from-discriminative-emotion-recognition/ 语音情感识别 | 6.0/10 Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ 语音合成 | 7.5/10 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ 音乐生成 | 7.0/10 语音/音频论文速递 2026-04-24 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24/ 共分析 21 篇语音/AI 论文 ATIR: Towards Audio-Text Interleaved Contextual Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-atir-towards-audio-text-interleaved-contextual/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-atir-towards-audio-text-interleaved-contextual/ 这篇论文旨在解决现有音频-文本检索方法无法处理查询和文档中音频与文本交错出现（如多轮对话、混合输入）的局限性。为此，作者定义了音频-文本交错上下文检索（ATIR）任务，并构建了一个包含约8.8万对样本的大规模基准。为解决直接应用多模态大语言模型（MLLM）时音频token冗余导致的效率和精度问题，论 Environmental Sound Deepfake Detection Using Deep-Learning Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ 1. **问题**：针对环境声音（包括声音场景和声音事件）的深度伪造检测（ESDD）任务，现有研究不足，且尚不清楚声音场景与声音事件的伪造检测是否需要不同模型。 2. **方法核心**：提出一个深度学习框架，核心是采用预训练的音频模型（BEATs）作为特征提取器，并结合一种三阶段训练策略（包含对 ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/ 1. **问题**：当前多模态大模型在音乐符号处理（Omnimodal Notation Processing, ONP）领域存在严重缺陷：研究碎片化、模型存在严重的符号偏差（偏向五线谱）、且普遍依赖不可靠的“LLM-as-a-Judge”评估方法，掩盖了模型在音乐理论推理上的系统性失败。 2. SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ 1. **解决的问题**：针对神经退行性疾病（特别是肌萎缩侧索硬化症ALS）的早期诊断和监测，缺乏大规模、有临床标注的语音数据集，以及标准化的算法评估框架。 2. **方法核心**：构建并发布了名为SAND的挑战赛，其核心是提供一个扩展的、包含纵向数据的ALS患者与健康对照语音数据集（VOC-A SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ 1. **问题**：现有大型音频语言模型在副语言（如情绪、语气、音色）生成与理解能力上的评估存在特征覆盖不全、评估方法主观且不可扩展的问题。 2. **方法**：提出了SpeechParaling-Bench，一个包含1000余个中英平行语音查询、覆盖超过100个细粒度副语言特征的综合基准。基准语音/音频论文速递 2026-04-23 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/ 共分析 27 篇语音/AI 论文 HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-halluaudio-a-comprehensive-benchmark-for/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-halluaudio-a-comprehensive-benchmark-for/ 这篇论文旨在解决大型音频语言模型（LALM）中普遍存在的“幻觉”问题（即生成与音频证据不符的内容）缺乏系统性评估工具的难题。为此，作者构建并发布了**HalluAudio**，这是首个大规模、多领域（语音、环境声、音乐）、多任务（二分类、多选、属性验证、开放生成）的人工验证音频幻觉检测基准，包含超过 MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-mtr-duplexbench-towards-a-comprehensive/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-mtr-duplexbench-towards-a-comprehensive/ 这篇论文旨在解决当前全双工语音语言模型（FD-SLMs）评测体系的一个关键缺陷：缺乏对多轮、连续对话能力的系统性评估。现有基准多关注单轮交互或特定对话特性（如打断），忽略了模型在多轮语境下维持指令遵循、安全等核心能力的一致性。为此，作者提出了**MTR-DuplexBench**，一个全新的多轮全双 NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ 这篇论文旨在解决语音合成（TTS）领域中一个关键但被忽视的问题：如何标准化评估系统生成非语言声音（NVV，如笑声、叹息）的能力。作者提出了**NVBench**，一个包含**45类NVV统一分类体系**的双语（英/中）基准。其核心方法包括：1）构建了一个每类50例、总计4500例的高质量平衡评估数据 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ 本文针对文本转语音（TTS）任务，提出了一种名为“细节链”（Chain-of-Details, CoD）的新框架。**要解决的问题**是现有TTS方法在建模语音生成的时域动态（从粗略时序到精细声学细节的渐进过程）方面存在不足。**使用的方法**是将语音生成分解为多个时间分辨率递增的阶段，在每个阶段使语音/音频论文速递 2026-04-22 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22/ 共分析 21 篇语音/AI 论文 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-benign-fine-tuning-breaks-safety-alignment-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-benign-fine-tuning-breaks-safety-alignment-in/ 这篇论文首次系统研究了**良性音频数据微调对音频大模型安全对齐的破坏性影响**。核心问题是：用户出于提升性能的目的，在完全无害的音频数据上微调模型，是否会意外削弱其拒绝有害指令的能力？作者提出了一个**基于嵌入空间邻近性的过滤框架**，通过计算良性音频与有害音频在模型内部或外部参考编码器空间中的距离 BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ 这篇论文旨在解决印度语言NLP研究资源分散、缺乏统一概览的痛点。作者首次提出了一个以任务为中心的统一分类体系，系统性地梳理和整合了超过200个数据集、50个基准测试以及100多个模型、工具和系统，覆盖了从核心语言处理（如分词、词性标注）到文本分类、生成翻译、信息检索、语音与多模态，乃至社会文化任务（ From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-from-reactive-to-proactive-assessing-the/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-from-reactive-to-proactive-assessing-the/ 本文旨在解决现有语音代理评估基准主要关注被动响应，而忽略其主动感知与干预能力的问题。作者提出了**ProVoice-Bench**，这是首个专门用于评估主动式语音代理的基准测试框架。该框架通过一个包含数字状态构建、场景合成、对话生成、声学模拟和对话组装的多阶段数据合成管道，构建了包含1182个高质量 HCFD: A Benchmark for Audio Deepfake Detection in Healthcare https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hcfd-a-benchmark-for-audio-deepfake-detection-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-hcfd-a-benchmark-for-audio-deepfake-detection-in/ 本文针对医疗健康领域中神经音频编解码器生成的语音深伪检测问题，提出了一个全新的研究任务（HCFD）和基准数据集（HCFK）。研究发现，在健康语音上训练的现有深伪检测模型在病态语音上性能显著下降。为此，论文首先验证了预训练音频模型（如PaSST）能更好地应对病理语音带来的变异性。更重要的是，本文提出了 MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ 这篇论文旨在解决指令跟随文本转语音（TTS）领域缺乏系统化评估工具的问题。当前评估存在覆盖不全、诊断粒度粗、多语言支持弱等缺陷。为此，作者提出了**MINT-Bench**，一个全面的多语言基准测试。其核心方法包括：1）一个基于10种原子声学属性的**分层多轴分类法**，系统性地组织了从简单到复杂（ Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-omni-embed-audio-leveraging-multimodal-llms-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-omni-embed-audio-leveraging-multimodal-llms-for/ 这篇论文旨在解决当前音频-文本检索模型在**真实、多样化用户查询**下性能下降的问题。作者指出，现有基准测试（如AudioCaps, Clotho）依赖描述性标题式查询，与真实世界中简短、多变的搜索行为（如问题、命令、关键词、排除性查询）存在巨大差距。为此，论文提出了两大核心贡献：1) **Omni Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-still-between-us-evaluating-and-improving-voice/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-still-between-us-evaluating-and-improving-voice/ 本文旨在解决语音语言模型（SLMs）在真实场景中无法有效区分主要用户与第三方插入语音（Third-Party Interruption, TPI）的问题，这会导致上下文理解失败。为此，作者首先创建了 **TPI-Train**，一个包含8.8万个样本的训练数据集，其核心设计是“说话人感知的难负例”， VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-vibe-voice-induced-open-ended-bias-evaluation-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-vibe-voice-induced-open-ended-bias-evaluation-for/ 这篇论文旨在解决大型音频语言模型（LALM）在开放生成任务中社会偏见评估不足的问题。现有基准多依赖合成语音和选择题（MCQ），无法捕捉模型在真实交互中自然流露的刻板印象。为此，作者提出了**VIBE**框架，其核心是使用**真实人声录音**输入模型，并通过**开放生成任务**（如故事创作、个性化推荐 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ 本文针对现有视频到音乐（V2M）生成模型缺乏对创作者风格、主题等细粒度意图控制的问题，提出了Video-Robin，一个结合文本提示的视频配乐框架。其核心方法是将生成过程解耦为两个阶段：首先，一个多模态自回归规划头（AR-Head）整合视频帧和文本提示，通过语义语言模型、有限标量量化（FSQ）和残差语音/音频论文速递 2026-04-21 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21/ 共分析 34 篇语音/AI 论文 ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-actormind-emulating-human-actor-reasoning-for/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-actormind-emulating-human-actor-reasoning-for/ 这篇论文旨在解决现有角色扮演研究局限于文本模态，而忽视了日常交流中主导的语音模态的问题。为此，作者首先**定义了“语音角色扮演”任务**，要求模型能根据角色、场景和对话历史，生成带有个性化语音特征（如特定情感、语调）的自发性回应。为此，他们构建了**ActorMindBench**，这是一个基于《老 Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-full-duplex-bench-v3-benchmarking-tool-use-for/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-full-duplex-bench-v3-benchmarking-tool-use-for/ 这篇论文针对当前全双工语音代理评估缺乏真实性（依赖合成语音）和任务简单性（单步调用）的问题，提出了**Full-Duplex-Bench-v3 (FDB-v3)** 基准。该基准的核心创新在于使用**100条真实人类录音**（含五种不流畅性注释），在四个任务域中设计了需要**多步API链式调用**的 HARNESS: Lightweight Distilled Arabic Speech Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ 这篇论文针对阿拉伯语语音识别、方言识别和情感识别中通用多语言/英语模型性能不足、且大模型难以部署的问题，提出了 HArnESS——一个以阿拉伯语为中心的自监督语音模型家族。作者采用 HuBERT 风格的迭代自蒸馏框架，先在大规模阿拉伯语-英语双语数据（约 23K 小时）上训练 24 层的教师模型 H MUSCAT: MUltilingual, SCientific ConversATion Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ 本文提出了 MUSCAT，一个用于评估多语言科学对话场景下自动语音识别（ASR）性能的新基准。数据集包含 6 组双语对话录音（共约 65 分钟，9,066 词），涉及英语与德语、土耳其语、中文、越南语的配对对话；每组对话使用 Meeting Owl 3、ReSpeaker USB 麦克风阵列和 Me NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-naijas2st-a-multi-accent-benchmark-for-speech-to/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-naijas2st-a-multi-accent-benchmark-for-speech-to/ 这篇论文旨在解决非洲低资源语言在语音翻译（S2ST和S2TT）研究中面临的高质量、多口音平行语音数据严重匮乏的核心瓶颈。为此，作者构建了**NaijaS2ST**数据集，涵盖豪萨语、伊博语、约鲁巴语和尼日利亚皮钦语与英语的平行语音，每种语言约50小时，捕获了真实的说话者与口音多样性。基于此数据集，论 Spatial-Aware Conditioned Fusion for Audio-Visual Navigation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-spatial-aware-conditioned-fusion-for-audio-visual/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-spatial-aware-conditioned-fusion-for-audio-visual/ 本论文针对音频-视觉导航（AVN）中目标空间意图模糊、视觉特征缺乏听觉条件引导两大问题，提出了 Spatial-Aware Conditioned Fusion（SACF）框架。该框架首先设计了 Spatially Discretized Localization Descriptor（SDLD），语音/音频论文速递 2026-04-20 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20/ 共分析 24 篇语音/AI 论文 AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-avid-a-benchmark-for-omni-modal-audio-visual/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-avid-a-benchmark-for-omni-modal-audio-visual/ 这篇论文旨在解决当前全模态大模型在音视频不一致性理解能力上缺乏系统性评估的问题。现有基准要么只关注音视频对齐事件，要么局限于检测深度伪造中的低级伪影，无法评估模型对长视频中语义级矛盾的理解。为此，作者 Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-classical-machine-learning-baselines-for-deepfake/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-classical-machine-learning-baselines-for-deepfake/ 本文旨在解决深度伪造音频检测领域缺乏透明、可解释基线的问题。研究团队采用经典机器学习方法，在Fake-or-Real (FoR) 数据集上构建了一个完整的检测流程。他们从高保真（44.1 kHz）和电 Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-comparison-of-window-shapes-and-lengths-in-short/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-comparison-of-window-shapes-and-lengths-in-short/ 本文针对心音信号（PCG）分类任务中，因信号非-stationarity而采用滑动窗口分段提取特征时，窗函数形状和长度选择缺乏系统性研究的问题，进行了一项实验性评估。作者使用双向长短期记忆网络（biL ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/ 本文提出了ControlFoley，一个统一且可控的视频到音频生成框架，旨在解决现有方法在跨模态冲突下文本控制力弱、以及参考音频控制中音色与时间信息纠缠的问题。其核心贡献包括：1）提出联合视觉编码范式 From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ 本文旨在解决当前语音代理评估中过度关注被动响应，而忽视其主动交互能力的问题。为此，作者提出了首个专门评估主动语音代理的基准测试框架 **ProVoice-Bench**。该框架包含四个新颖的任务，用以 Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ 这篇论文提出了一个名为 **Geo2Sound** 的新任务和框架，旨在从卫星图像生成地理上一致且逼真的声音景观。**要解决的问题**是现有图像到音频模型在处理自上而下的卫星视图时面临三大挑战：缺乏结 SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-spotsound-enhancing-large-audio-language-models/ 本文旨在解决大型音频语言模型在**细粒度音频事件时间定位**上的不足。现有模型因训练数据缺乏精确时间戳、基准测试过于简单，导致在长音频中定位短暂事件（“大海捞针”）时表现不可靠。为此，作者提出了**S VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ 本文旨在解决语音处理中一个基础但被忽视的问题：如何系统化地识别语音音频所经过的后期处理效果及其参数。现实中，语音几乎都经过了降噪、压缩等效果处理，但现有数据集缺乏此类精确标注，阻碍了相关研究。为此，作 VoxSafeBench: Not Just What Is Said, but Who, How, and Where https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxsafebench-not-just-what-is-said-but-who-how/ 这篇论文旨在解决一个关键问题：当语音大模型（SLM）进入多用户共享环境时，仅基于文本内容的安全对齐策略是不足的，说话人身份、副语言特征和声学场景等音频上下文信息会根本性地改变请求的性质。为此，作者提出 Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-who-is-speaking-or-who-is-depressed-a-controlled/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-who-is-speaking-or-who-is-depressed-a-controlled/ 这篇论文的核心贡献在于系统性地揭示并量化了语音抑郁症检测模型中普遍存在的“说话人身份泄露”问题。作者指出，当前许多报告高准确率的模型，其性能可能严重依赖于对说话人身份（声纹）的记忆，而非对抑郁相关声学语音/音频论文速递 2026-04-19 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/ 共分析 42 篇语音/AI 论文语音/音频论文速递 2026-04-18 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/ Sat, 18 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/ 共分析 39 篇语音/AI 论文