数据集 on 语音/音频论文速递

数据集 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E6%95%B0%E6%8D%AE%E9%9B%86/ Recent content in 数据集 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 3D Mesh Grid Room Impulse Responses Measured with A Linear Microphone Array And Suppression of Frame Reflections https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-3d-mesh-grid-room-impulse-responses-measured-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-3d-mesh-grid-room-impulse-responses-measured-with/ 空间音频 | 8.3/10 A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dataset-of-robot-patient-and-doctor-patient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dataset-of-robot-patient-and-doctor-patient/ 语音对话系统 | 7.5/10 A New Method and Dataset for Classroom Teaching Stage Segmentation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-new-method-and-dataset-for-classroom-teaching/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-new-method-and-dataset-for-classroom-teaching/ 课堂阶段分割 | 6.5/10 A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ 语音识别 | 7.5/10 ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acavcaps-enabling-large-scale-training-for-fine/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acavcaps-enabling-large-scale-training-for-fine/ 音频分类 | 8.5/10 AI-Generated Music Detection in Broadcast Monitoring https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ai-generated-music-detection-in-broadcast/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ai-generated-music-detection-in-broadcast/ 音频深度伪造检测 | 7.0/10 AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aishell6-whisper-a-chinese-mandarin-audio-visual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aishell6-whisper-a-chinese-mandarin-audio-visual/ 语音识别 | 8.3/10 Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-language-models-for-lyric-to-melody/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-language-models-for-lyric-to-melody/ 音乐生成 | 7.5/10 AMBISONIC-DML: A Benchmark Dataset for Dynamic Higher-Order Ambisonics Music with Motion-Aligned Stems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ambisonic-dml-a-benchmark-dataset-for-dynamic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ambisonic-dml-a-benchmark-dataset-for-dynamic/ 数据集 | 7.5/10 AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-animalclap-taxonomy-aware-language-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-animalclap-taxonomy-aware-language-audio/ 音频分类 | 8.0/10 Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-text System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention2probability-attention-driven/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention2probability-attention-driven/ 语音识别 | 7.0/10 Audio-Visual Deepfake Generation and Detection: An Exploratory Survey https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-visual-deepfake-generation-and-detection-an/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-visual-deepfake-generation-and-detection-an/ 音频深度伪造检测 | 6.5/10 AUDIOCARDS: Structured Metadata Improves Audio Language Models for Sound Design https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiocards-structured-metadata-improves-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiocards-structured-metadata-improves-audio/ 音频检索 | 7.5/10 AVO-65: A Large-Scale Hierarchical Audio-Visual Object Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-avo-65-a-large-scale-hierarchical-audio-visual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-avo-65-a-large-scale-hierarchical-audio-visual/ 音视频 | 7.0/10 BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on POP and Classical Music https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bachi-boundary-aware-symbolic-chord-recognition/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bachi-boundary-aware-symbolic-chord-recognition/ 音乐信息检索 | 7.5/10 Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/ 音频深度伪造检测 | 8.1/10 Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-global-emotion-fine-grained-emotional/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-global-emotion-fine-grained-emotional/ 语音合成 | 7.5/10 BioSEN: A Bio-Acoustic Signal Enhancement Network for Animal Vocalizations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-biosen-a-bio-acoustic-signal-enhancement-network/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-biosen-a-bio-acoustic-signal-enhancement-network/ 生物声学 | 7.5/10 Bleed No More: Generative Interference Reduction for Musical Recordings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bleed-no-more-generative-interference-reduction/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bleed-no-more-generative-interference-reduction/ 音乐源分离 | 7.0/10 CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-castella-long-audio-dataset-with-captions-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-castella-long-audio-dataset-with-captions-and/ 音频检索 | 8.5/10 Clue2Emo: A Brain-Inspired Framework for Open-Vocabulary Multimodal Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clue2emo-a-brain-inspired-framework-for-open/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-clue2emo-a-brain-inspired-framework-for-open/ 语音情感识别 | 8.5/10 CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-Spoofing Countermeasures https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-compspoof-a-dataset-and-joint-learning-framework/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-compspoof-a-dataset-and-joint-learning-framework/ 音频深度伪造检测 | 7.0/10 Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ 语音增强 | 6.5/10 Content Leakage in Librispeech and its Impact on the Privacy Evaluation of Speaker Anonymization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-leakage-in-librispeech-and-its-impact-on/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-leakage-in-librispeech-and-its-impact-on/ 语音匿名化 | 7.5/10 CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cova-text-guided-composed-video-retrieval-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cova-text-guided-composed-video-retrieval-for/ 跨模态检索 | 6.5/10 Cross-Lingual Interleaving for Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ 语音大模型 | 7.5/10 Denoising Of Stochastic Ray Tracing Room Impulse Responses https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-denoising-of-stochastic-ray-tracing-room-impulse/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-denoising-of-stochastic-ray-tracing-room-impulse/ 空间音频 | 7.5/10 Detecting and Attributing Synthetic Spanish Speech: The HISPASpoof Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-detecting-and-attributing-synthetic-spanish/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-detecting-and-attributing-synthetic-spanish/ 语音伪造检测 | 7.5/10 Do Bias Benchmarks Generalise? Evidence from Voice-Based Evaluation of Gender Bias in Speechllms https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-bias-benchmarks-generalise-evidence-from-voice/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-bias-benchmarks-generalise-evidence-from-voice/ 模型评估 | 8.0/10 Do You Hear What I Mean? Quantifying the Instruction-Perception GAP in Instruction-Guided Expressive Text-to-Speech Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-you-hear-what-i-mean-quantifying-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-you-hear-what-i-mean-quantifying-the/ 语音合成 | 8.0/10 Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-easy-turn-integrating-acoustic-and-linguistic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-easy-turn-integrating-acoustic-and-linguistic/ 语音对话系统 | 7.0/10 EchoFake: A Replay-Aware Dataset For Practical Speech Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echofake-a-replay-aware-dataset-for-practical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echofake-a-replay-aware-dataset-for-practical/ 音频深度伪造检测 | 8.5/10 EEG and Eye-Tracking Driven Dynamic Target Speaker Extraction with Spontaneous Attention Switching https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eeg-and-eye-tracking-driven-dynamic-target/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eeg-and-eye-tracking-driven-dynamic-target/ 语音分离 | 7.0/10 Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ 语音识别 | 7.5/10 Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enabling-multi-species-bird-classification-on-low/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enabling-multi-species-bird-classification-on-low/ 生物声学 | 8.0/10 Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-bias-in-spoken-dialogue-llms-for-real/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-bias-in-spoken-dialogue-llms-for-real/ 模型评估 | 7.0/10 Evaluating Compositional Structure in Audio Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-compositional-structure-in-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-compositional-structure-in-audio/ 模型评估 | 7.0/10 Evaluating Disentangled Representations for Controllable Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-disentangled-representations-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-disentangled-representations-for/ 音乐生成 | 7.5/10 Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-emotion-recognition-in-spoken-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-emotion-recognition-in-spoken-language/ 语音情感识别 | 7.5/10 Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-high-resolution-piano-sustain-pedal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-high-resolution-piano-sustain-pedal/ 音乐信息检索 | 8.0/10 Evaluating Pretrained Speech Embedding Systems for Dysarthria Detection Across Heterogenous Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-pretrained-speech-embedding-systems/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-evaluating-pretrained-speech-embedding-systems/ 语音生物标志物 | 7.5/10 Generalizability of Predictive and Generative Speech Enhancement Models to Pathological Speakers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generalizability-of-predictive-and-generative/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generalizability-of-predictive-and-generative/ 语音增强 | 7.0/10 Generative Audio Extension and Morphing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-audio-extension-and-morphing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-audio-extension-and-morphing/ 音频生成 | 7.5/10 Hair Noise Analysis and Mitigation for Smart Glasses Audio Captures https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hair-noise-analysis-and-mitigation-for-smart/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hair-noise-analysis-and-mitigation-for-smart/ 语音增强 | 7.5/10 HiFi-HARP: A High-Fidelity 7th-Order Ambisonic Room Impulse Response Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hifi-harp-a-high-fidelity-7th-order-ambisonic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hifi-harp-a-high-fidelity-7th-order-ambisonic/ 数据集 | 7.5/10 High-Fidelity Speech Enhancement Via Discrete Audio Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ 语音增强 | 7.5/10 How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-to-label-resynthesized-audio-the-dual-role-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-to-label-resynthesized-audio-the-dual-role-of/ 音频深度伪造检测 | 7.5/10 Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-human-1-by-josh-talks-a-full-duplex/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-human-1-by-josh-talks-a-full-duplex/ 语音对话系统 | 7.5/10 ICASSP 2026 - 数据集论文列表 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-030/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-030/ 共 3 篇 ICASSP 2026 数据集方向论文 Interpretable Music Harmonic Analysis Through Multilinear Mixture of Experts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-interpretable-music-harmonic-analysis-through/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-interpretable-music-harmonic-analysis-through/ 音乐理解 | 7.5/10 Leveraging Large Speech Language Models as Evaluators for Expressive Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-speech-language-models-as/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-speech-language-models-as/ 语音情感识别 | 6.5/10 LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ 基准测试 | 7.8/10 LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lotusdis-a-thai-far-field-meeting-corpus-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lotusdis-a-thai-far-field-meeting-corpus-for/ 语音识别 | 7.5/10 Marco-Voice: A Unified Framework for Expressive Speech Synthesis with Voice Cloning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-marco-voice-a-unified-framework-for-expressive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-marco-voice-a-unified-framework-for-expressive/ 语音合成 | 8.0/10 MCF: Text LLMS for Multimodal Emotional Causality https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mcf-text-llms-for-multimodal-emotional-causality/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mcf-text-llms-for-multimodal-emotional-causality/ 情感分析 | 8.0/10 MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mnv-17-a-high-quality-performative-mandarin/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mnv-17-a-high-quality-performative-mandarin/ 语音识别 | 7.5/10 Multimodal LLMs as Expert Speech Annotators: Acoustic Macro-Descriptors for Parkinson's Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-llms-as-expert-speech-annotators/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-llms-as-expert-speech-annotators/ 语音生物标志物 | 6.5/10 Multimodal Transformer with Multiperspective Training for Predicting Self-Expression Skills from Video Interview https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-transformer-with-multiperspective/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-transformer-with-multiperspective/ 多模态模型 | 7.0/10 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musetok-symbolic-music-tokenization-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musetok-symbolic-music-tokenization-for/ 音乐生成 | 8.5/10 No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-no-verifiable-reward-for-prosody-toward/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-no-verifiable-reward-for-prosody-toward/ 语音合成 | 8.0/10 OV-INSTRUCTTTS: Towards Open-Vocabulary Instruct Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ov-instructtts-towards-open-vocabulary-instruct/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ov-instructtts-towards-open-vocabulary-instruct/ 语音合成 | 8.0/10 Perceptual Quality Assessment for Stylized Talking Heads https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-perceptual-quality-assessment-for-stylized/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-perceptual-quality-assessment-for-stylized/ 模型评估 | 7.5/10 Pianoroll-Event: A Novel Score Representation for Symbolic Music https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pianoroll-event-a-novel-score-representation-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pianoroll-event-a-novel-score-representation-for/ 音乐生成 | 6.5/10 Random Matrix-Driven Graph Representation Learning For Bioacoustic Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-random-matrix-driven-graph-representation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-random-matrix-driven-graph-representation/ 生物声学 | 7.5/10 RAS: a Reliability Oriented Metric for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ 语音识别 | 7.5/10 Reliable AI via Age-Balanced Validation: Fair Model Selection for Parkinson’s Detection from Voice https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reliable-ai-via-age-balanced-validation-fair/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reliable-ai-via-age-balanced-validation-fair/ 语音生物标志物 | 7.5/10 Representation-Based Data Quality Audits for Audio https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-representation-based-data-quality-audits-for-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-representation-based-data-quality-audits-for-audio/ 数据集 | 7.5/10 Rethinking Entity Disambiguation in Complex Modalities https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rethinking-entity-disambiguation-in-complex/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rethinking-entity-disambiguation-in-complex/ 实体消歧 | 8.0/10 Rethinking Music Captioning with Music Metadata LLMS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rethinking-music-captioning-with-music-metadata/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rethinking-music-captioning-with-music-metadata/ 音乐理解 | 7.0/10 RFM-Editing: Rectified Flow Matching for Text-Guided Audio Editing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rfm-editing-rectified-flow-matching-for-text/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rfm-editing-rectified-flow-matching-for-text/ 音频编辑 | 7.5/10 RHO-PERFECT: Correlation Ceiling for Subjective Evaluation Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rho-perfect-correlation-ceiling-for-subjective/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rho-perfect-correlation-ceiling-for-subjective/ 模型评估 | 7.5/10 S2Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s2voice-style-aware-autoregressive-modeling-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s2voice-style-aware-autoregressive-modeling-with/ 歌唱语音转换 | 7.0/10 SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ 脑机接口 | 8.0/10 Scalable Evaluation for Audio Identification Via Synthetic Latent Fingerprint Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scalable-evaluation-for-audio-identification-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scalable-evaluation-for-audio-identification-via/ 音频检索 | 7.0/10 Sing What You Fit: A Perception-Based Dataset and Benchmark for Vocal-Song Suitability Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sing-what-you-fit-a-perception-based-dataset-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sing-what-you-fit-a-perception-based-dataset-and/ 音乐信息检索 | 7.0/10 SingMOS-Pro: An Comprehensive Benchmark For Singing Quality Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-singmos-pro-an-comprehensive-benchmark-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-singmos-pro-an-comprehensive-benchmark-for/ 歌唱语音合成 | 7.5/10 SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ 语音合成 | 7.0/10 SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechct-clip-distilling-text-image-knowledge-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechct-clip-distilling-text-image-knowledge-to/ 医疗AI | 7.5/10 Spring Reverb Emulation with Hybrid Gated Convolutional Networks and State Space Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spring-reverb-emulation-with-hybrid-gated/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spring-reverb-emulation-with-hybrid-gated/ 音频生成 | 7.5/10 Still Thinking or Stopped Talking? Dialogue Silence Intention Classification Using Multimodal Large Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-still-thinking-or-stopped-talking-dialogue/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-still-thinking-or-stopped-talking-dialogue/ 语音对话系统 | 6.5/10 StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streammark-a-deep-learning-based-semi-fragile/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streammark-a-deep-learning-based-semi-fragile/ 音频深度伪造检测 | 8.0/10 Symphony Rendering: Midi and Composer-Conditioned Auto Orchestration with Flow-Matching Transformers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphony-rendering-midi-and-composer-conditioned/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphony-rendering-midi-and-composer-conditioned/ 音乐生成 | 7.0/10 SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ 音乐生成 | 7.5/10 SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synparaspeech-automated-synthesis-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synparaspeech-automated-synthesis-of/ 语音合成 | 7.5/10 TAGARELA - A Portuguese Speech Dataset from Podcasts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ 语音识别语音合成 | 7.0/10 TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tau-a-benchmark-for-cultural-sound-understanding/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tau-a-benchmark-for-cultural-sound-understanding/ 音频问答 | 7.5/10 Text2Move: Text-To-Moving Sound Generation via Trajectory Prediction and Temporal Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2move-text-to-moving-sound-generation-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2move-text-to-moving-sound-generation-via/ 空间音频 | 8.0/10 The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ 语音增强 | 7.5/10 The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-singing-voice-conversion-challenge-2025-from/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-singing-voice-conversion-challenge-2025-from/ 歌唱语音转换 | 7.0/10 The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/ 基准测试 | 7.0/10 Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-time-vs-layer-locating-predictive-cues-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-time-vs-layer-locating-predictive-cues-for/ 语音质量评估 | 7.5/10 TinyMU: A Compact Audio-Language Model for Music Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tinymu-a-compact-audio-language-model-for-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tinymu-a-compact-audio-language-model-for-music/ 音乐理解 | 7.5/10 TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for Ü-Tsang, Amdo and Kham Speech Dataset Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tmd-tts-a-unified-tibetan-multi-dialect-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tmd-tts-a-unified-tibetan-multi-dialect-text-to/ 语音合成 | 7.5/10 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-robust-dysarthric-speech-recognition-llm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-robust-dysarthric-speech-recognition-llm/ 语音识别 | 9.0/10 Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-dynamics-aware-multi-factor-curriculum/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-dynamics-aware-multi-factor-curriculum/ 语音分离 | 7.0/10 Training Flow Matching Models with Reliable Labels via Self-Purification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-flow-matching-models-with-reliable/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-flow-matching-models-with-reliable/ 语音合成 | 7.5/10 Unsupervised Discovery and Analysis of the Vocal Repertoires and Patterns of Select Corvid Species https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unsupervised-discovery-and-analysis-of-the-vocal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-unsupervised-discovery-and-analysis-of-the-vocal/ 生物声学 | 7.5/10 UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uti-llm-a-personalized-articulatory-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uti-llm-a-personalized-articulatory-speech/ 语音对话系统 | 7.5/10 Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/ 音乐生成 | 7.5/10 ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vitex-visual-texture-control-for-multi-track/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vitex-visual-texture-control-for-multi-track/ 音乐生成 | 7.0/10 WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wav2lev-predicting-levenshtein-edit-operation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wav2lev-predicting-levenshtein-edit-operation/ 语音识别 | 7.5/10 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 RTCFake: Speech Deepfake Detection in Real-Time Communication https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-rtcfake-speech-deepfake-detection-in-real-time/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-rtcfake-speech-deepfake-detection-in-real-time/ 语音伪造检测 | 7.0/10 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-tts-prism-a-perceptual-reasoning-and/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-tts-prism-a-perceptual-reasoning-and/ 语音合成评估 | 7.0/10 Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-listening-with-time-precise-temporal-awareness/ 音频场景理解 | 8.0/10 Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven's Piano and Cello Sonatas, 1930--2012 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-spectrographic-portamento-gradient-analysis-a/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-spectrographic-portamento-gradient-analysis-a/ 音乐信息检索 | 7.5/10 AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-audita-a-new-dataset-to-audit-humans-vs-ai-skill/ 音频问答 | 6.5/10 Beyond Rules: Towards Basso Continuo Personal Style Identification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-beyond-rules-towards-basso-continuo-personal/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-beyond-rules-towards-basso-continuo-personal/ 音乐理解 | 7.0/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-time-vs-layer-locating-predictive-cues-for/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-time-vs-layer-locating-predictive-cues-for/ 语音生物标志物 | 7.0/10 Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ 1. **问题**：当前口吃语音技术研究与口吃者（PWS）及言语语言病理学家（SLP）的实际需求存在系统性脱节，研究重点、任务定义和评估方法未能充分以用户为中心。 2. **方法核心**：通过两部分结合分析：1）对228篇相关论文进行范围综述，提出研究任务分类法并分析研究现状；2）对70名利益相 Centering Ecological Goals in Automated Identification of Individual Animals https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-centering-ecological-goals-in-automated/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-centering-ecological-goals-in-automated/ 这篇论文旨在解决一个关键问题：为什么近年来在动物个体自动识别（基于图像或声音）上报告的高准确率算法，却很少转化为生态学实践中的常规工具？其方法核心是提出一个“以生态目标为中心”的评估与部署框架，强调自动化识别的有用性取决于其服务的具体生态问题、可用数据以及错误类型带来的实际后果。与以往主要关注算法准 MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ 这篇论文旨在解决语音到语音翻译（S2ST）系统普遍丢失源语音中非语言声音（如笑声、哭声）和情感信息的问题，这严重影响了跨语言交流的自然度和准确性。为此，作者提出了三项核心贡献：首先，设计了一个可扩展的自动化数据合成管道，用于生成大规模、高质量的英中富有表现力S2ST平行语料，克服了训练数据稀缺的瓶颈 SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-sand-the-challenge-on-speech-analysis-for/ 1. **解决的问题**：针对神经退行性疾病（特别是肌萎缩侧索硬化症ALS）的早期诊断和监测，缺乏大规模、有临床标注的语音数据集，以及标准化的算法评估框架。 2. **方法核心**：构建并发布了名为SAND的挑战赛，其核心是提供一个扩展的、包含纵向数据的ALS患者与健康对照语音数据集（VOC-A Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ 1. **问题**：现有的古兰经语音数据集在规模、诵读者多样性、音频质量和标注深度上存在严重不足，限制了古兰经ASR、诵读者识别等任务的研究进展。 2. **方法核心**：提出Tadabur数据集及其构建流水线。流水线核心是“古兰经经文对齐模块”（AAM），它结合WhisperX进行初步转录，再利用 BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ 本文针对符号音乐生成中主流的事件序列（event-based）tokenization方法隐含处理时间规律、导致模型需额外学习时间网格的问题，提出了一种名为**BEAT**的新型网格化tokenization框架。其核心思想是将音乐在时间上均匀离散化为“拍”（beat）作为基本单位，将每拍内每个音高 Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ 这篇论文旨在解决将连续变化的基频（F0）曲线映射到首尔韩语中离散、不变的音高重音类别（如LHLH, HHLH）这一难题。传统方法易受F0测量噪声和说话人差异的影响。为此，作者提出了**Dual-Glob**，一个深度监督对比学习框架。其核心是通过一个**双分支（干净视图和增强视图）编码器**，在共享 Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ 本文旨在解决古兰经语音研究领域缺乏大规模、多样化、细粒度标注数据集的问题。为此，作者提出了**Tadabur**数据集及其自动化构建流水线。该流水线首先从公共平台收集音频，并利用大语言模型（Gemini）从非结构化文本中提取标准化元数据（如章节、朗诵者）。核心步骤是**Ayah Alignment A novel LSTM music generator based on the fractional time-frequency feature extraction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-novel-lstm-music-generator-based-on-the/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-a-novel-lstm-music-generator-based-on-the/ 本文提出了一种基于分数阶傅里叶变换（FrFT）和长短期记忆网络（LSTM）的新型AI音乐生成系统。**核心目标**是利用FrFT在分数阶域（时频平面的旋转表示）中提取比传统时域或频域更丰富的音乐信号特征，以解决传统LSTM在捕捉音乐复杂时频结构上的不足。**关键方法**是将输入音乐信号进行FrFT变 Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ 本文旨在解决大型音频语言模型（LALMs）在复杂音频推理任务中能力不足、推理过程不透明的问题。**核心贡献**是提出了一个名为 **Audio-Cogito** 的完全开源解决方案，其核心是一个四阶段的自动化数据构建管道 **Cogito-Pipe**，用于生成高质量、多样化的音频推理链（CoT）数 AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ 本文旨在解决多模态大模型在音视频联合推理任务上缺乏高质量训练数据的核心挑战。**核心贡献**是提出了AVRT框架，通过组合单模态专家模型的能力来合成多模态推理数据。**关键方法**分为两步：1）**数据生成**：使用专门的视觉教师（Kimi-VL-Thinking）和音频教师（Audio Flami BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-bhashasutra-a-task-centric-unified-survey-of/ 这篇论文旨在解决印度语言NLP研究资源分散、缺乏统一概览的痛点。作者首次提出了一个以任务为中心的统一分类体系，系统性地梳理和整合了超过200个数据集、50个基准测试以及100多个模型、工具和系统，覆盖了从核心语言处理（如分词、词性标注）到文本分类、生成翻译、信息检索、语音与多模态，乃至社会文化任务（ Coexisting Tempo Traditions in Beethoven's Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-coexisting-tempo-traditions-in-beethovens-piano/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-coexisting-tempo-traditions-in-beethovens-piano/ 本文旨在挑战音乐表演实证研究中普遍使用的单一回归分析模型，该模型常将历史速度变化描绘为一个单向、统一的过程。作者提出，这种模型掩盖了多种演奏传统并存的事实。研究通过对贝多芬五首钢琴与大提琴奏鸣曲（Op. 5, 69, 102）在1930-2012年间超过一百个乐章录音的逐小节速度数据进行K-mean Latent Fourier Transform https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ 这篇论文旨在解决现有音乐生成模型难以对**任意时间尺度**上的音乐模式进行精确控制的问题。作者提出了**潜在傅里叶变换（LatentFT）** 框架，其核心是将离散傅里叶变换应用于由扩散自编码器编码得到的**潜在向量序列**，从而得到“潜在频谱”。通过在训练过程中对潜在频谱进行随机频率掩码，迫使解码 Neural Encoding Detection is Not All You Need for Synthetic Speech Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-neural-encoding-detection-is-not-all-you-need-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-neural-encoding-detection-is-not-all-you-need-for/ 这篇综述论文的核心贡献在于**揭示并论证了当前合成语音检测领域的一个关键误区：过度依赖“神经编码检测”**。论文首先系统回顾了基于SincNet、自监督学习（SSL）和神经编码检测的三类数据驱动方法，指出当前性能最佳的SSL模型实际上主要捕捉的是声码器（vocoder）在波形生成阶段引入的痕迹，而非 AST: Adaptive, Seamless, and Training-Free Precise Speech Editing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ 本文针对现有语音编辑方法依赖任务特定训练、未编辑区域时间一致性差的问题，提出了AST（Adaptive, Seamless, and Training-free），一种基于预训练AM-FM（自回归-流匹配）范式TTS模型的精确语音编辑框架。AST首先通过逆Euler ODE求解器将原始语音反演至潜空 Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/ 本文旨在解决从单向“独白”式虚拟人生成迈向自然“全双工”交互式生成的核心挑战。**核心问题**在于，现有方法要么因严格的帧对齐而反应僵硬，要么因引入全局注意力而破坏唇同步。**关键方法**是提出一个基于多头高斯核（MHGK）的统一注意力架构，该机制通过为不同的注意力头分配从窄到宽的高斯分布感受野，使 TinyMU: A Compact Audio-Language Model for Music Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-tinymu-a-compact-audio-language-model-for-music/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-tinymu-a-compact-audio-language-model-for-music/ 本文针对现有大型音频语言模型（LALM）参数庞大（数十亿级）、训练推理成本高、难以部署在边缘设备的问题，提出了 TinyMU——一个仅有 229M 参数的紧凑音乐语言模型。为此，作者构建了 MusicSkills-3.5M 数据集，包含 350 万个涵盖多选、二元判断和开放式格式的音乐问答样本，结合 VoxMind: An End-to-End Agentic Spoken Dialogue System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ 端到端语音对话模型在自然交互上进步迅速，但普遍缺乏处理复杂任务的agent能力（工具调用、规划、推理）。本文首先形式化定义了"端到端语音智能体"的四大维度——画像（Profile）、记忆（Memory）、规划（Planning）与执行（Action Execution），填补了该领域理论标准的空白。 From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/ 本文旨在解决当前语音代理评估中过度关注被动响应，而忽视其主动交互能力的问题。为此，作者提出了首个专门评估主动语音代理的基准测试框架 **ProVoice-Bench**。该框架包含四个新颖的任务，用以 Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-geo2sound-a-scalable-geo-aligned-framework-for/ 这篇论文提出了一个名为 **Geo2Sound** 的新任务和框架，旨在从卫星图像生成地理上一致且逼真的声音景观。**要解决的问题**是现有图像到音频模型在处理自上而下的卫星视图时面临三大挑战：缺乏结 Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/ 本文首次提出了“聆听深度伪造检测”这一新任务，旨在识别视频中人物在倾听状态下（非说话时）的伪造反应，弥补了现有研究主要集中于“说话”场景的不足。为解决此任务数据稀缺的问题，作者构建了首个专门数据集Li VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-voxeffects-a-speech-oriented-audio-effects/ 本文旨在解决语音处理中一个基础但被忽视的问题：如何系统化地识别语音音频所经过的后期处理效果及其参数。现实中，语音几乎都经过了降噪、压缩等效果处理，但现有数据集缺乏此类精确标注，阻碍了相关研究。为此，作