预训练 on 语音/音频论文速递

预训练 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E9%A2%84%E8%AE%AD%E7%BB%83/ Recent content in 预训练 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ 语音识别 | 7.5/10 A Task-Aware Dual-Level Self-Supervised Learning Method for Effective Sound Event Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-task-aware-dual-level-self-supervised-learning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-task-aware-dual-level-self-supervised-learning/ 音频事件检测 | 7.5/10 ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acavcaps-enabling-large-scale-training-for-fine/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acavcaps-enabling-large-scale-training-for-fine/ 音频分类 | 8.5/10 Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ 语音识别 | 7.5/10 Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adversarial-fine-tuning-on-speech-foundation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adversarial-fine-tuning-on-speech-foundation/ 语音识别 | 7.5/10 An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-event-based-sequence-modeling-approach-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-event-based-sequence-modeling-approach-to/ 音乐信息检索 | 7.5/10 An Unsupervised Alignment Feature Fusion System for Spoken Language-Based Dementia Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-unsupervised-alignment-feature-fusion-system/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-unsupervised-alignment-feature-fusion-system/ 语音生物标志物 | 7.0/10 Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/ 语音识别 | 7.0/10 Audio-Guided Multimodal Approach for Fine-Grained Alignment and Boundary Modeling in Active Speaker Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-guided-multimodal-approach-for-fine-grained/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-guided-multimodal-approach-for-fine-grained/ 说话人检测 | 7.5/10 Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-benchmarking-music-autotagging-with-mgphot-expert/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-benchmarking-music-autotagging-with-mgphot-expert/ 音乐信息检索 | 7.5/10 Bimodal Fusion Framework for Dynamic Facial Expression Recognition In-The-Wild https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bimodal-fusion-framework-for-dynamic-facial/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bimodal-fusion-framework-for-dynamic-facial/ 语音情感识别 | 7.0/10 BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-birq-bi-level-self-labeling-random-quantization/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-birq-bi-level-self-labeling-random-quantization/ 语音识别 | 8.0/10 Break-the-Beat! Controllable MIDI-to-Drum audio synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-break-the-beat-controllable-midi-to-drum-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-break-the-beat-controllable-midi-to-drum-audio/ 音乐生成 | 7.5/10 Bridging the Semantic Gap: Cross-Attentive Fusion for Joint Acoustic-Semantic Speech Quality Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-semantic-gap-cross-attentive-fusion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-semantic-gap-cross-attentive-fusion/ 语音质量评估 | 8.5/10 CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-castella-long-audio-dataset-with-captions-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-castella-long-audio-dataset-with-captions-and/ 音频检索 | 8.5/10 Combining SSL Speech Features, Contextual Transformers and Mamba Models for Realistic Audio Spoofing Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-ssl-speech-features-contextual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-ssl-speech-features-contextual/ 音频深度伪造检测 | 7.5/10 Contrastive Timbre Representations for Musical Instrument And Synthesizer Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-contrastive-timbre-representations-for-musical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-contrastive-timbre-representations-for-musical/ 音频检索 | 7.5/10 Cross-Lingual Interleaving for Speech Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-interleaving-for-speech-language/ 语音大模型 | 7.5/10 DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/ 语音增强 | 8.5/10 Do Foundational Audio Encoders Understand Music Structure? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-foundational-audio-encoders-understand-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-foundational-audio-encoders-understand-music/ 音乐信息检索 | 7.0/10 Does the Pre-Training of an Embedding Influence its Encoding of Age? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-does-the-pre-training-of-an-embedding-influence/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-does-the-pre-training-of-an-embedding-influence/ 语音生物标志物 | 7.0/10 Domain Partitioning Meets Parameter-Efficient Fine-Tuning: A Novel Method for Improved Language-Queried Audio Source Separation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-domain-partitioning-meets-parameter-efficient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-domain-partitioning-meets-parameter-efficient/ 音频分离 | 7.5/10 Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-easy-turn-integrating-acoustic-and-linguistic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-easy-turn-integrating-acoustic-and-linguistic/ 语音对话系统 | 7.0/10 Efficient Audio-Visual Inference Via Token Clustering And Modality Fusion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-efficient-audio-visual-inference-via-token/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-efficient-audio-visual-inference-via-token/ 音频问答 | 7.5/10 Efficient Depression Detection from Speech via Language-Independent Prompt-Driven Reprogramming https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-efficient-depression-detection-from-speech-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-efficient-depression-detection-from-speech-via/ 语音生物标志物 | 7.5/10 Emotional Dimension Control in Language Model-Based Text-To-Speech: Spanning a Broad Spectrum of Human Emotions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-dimension-control-in-language-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-dimension-control-in-language-model/ 语音合成 | 7.5/10 Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation Guided Structured Pruning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speaker-verification-with-w2v-bert-20/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speaker-verification-with-w2v-bert-20/ 说话人验证 | 7.5/10 Enhancing Speech Intelligibility Prediction for Hearing Aids with Complementary Speech Foundation Model Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speech-intelligibility-prediction-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speech-intelligibility-prediction-for/ 语音增强 | 7.5/10 Exploring How Audio Effects Alter Emotion with Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-how-audio-effects-alter-emotion-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-how-audio-effects-alter-emotion-with/ 音乐理解 | 7.0/10 FUSEMOS: Perceptual Evaluation of Text-to-Music Generation with Dual-Encoder Fusion and Ranking-Aware Composite Loss https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fusemos-perceptual-evaluation-of-text-to-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fusemos-perceptual-evaluation-of-text-to-music/ 音乐生成 | 7.5/10 Gen-SER: When the Generative Model Meets Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gen-ser-when-the-generative-model-meets-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gen-ser-when-the-generative-model-meets-speech/ 语音情感识别 | 6.5/10 GLAP: General Contrastive Audio-Text Pretraining Across Domains and Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glap-general-contrastive-audio-text-pretraining/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glap-general-contrastive-audio-text-pretraining/ 音频检索 | 8.5/10 GLUE: Gradient-free Learning to Unify Experts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glue-gradient-free-learning-to-unify-experts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glue-gradient-free-learning-to-unify-experts/ 迁移学习 | 6.5/10 Graph-Biased EEG Transformers for Silent Speech Decoding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-graph-biased-eeg-transformers-for-silent-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-graph-biased-eeg-transformers-for-silent-speech/ 语音生物标志物 | 6.5/10 Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hashing-baseline-rethinking-hashing-in-the-age-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hashing-baseline-rethinking-hashing-in-the-age-of/ 音频检索音频分类 | 8.0/10 Hierarchical Activity Recognition and Captioning from Long-Form Audio https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-activity-recognition-and-captioning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-activity-recognition-and-captioning/ 音频事件检测 | 7.5/10 High-Fidelity Speech Enhancement Via Discrete Audio Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ 语音增强 | 7.5/10 I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-Based Single-Channel Speech Enhancement https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-i-dccrn-vae-an-improved-deep-representation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-i-dccrn-vae-an-improved-deep-representation/ 语音增强 | 7.5/10 ICASSP 2026 - 预训练论文列表 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-138/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-138/ 共 1 篇 ICASSP 2026 预训练方向论文 Improving Anomalous Sound Detection with Attribute-Aware Representation from Domain-Adaptive Pre-Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-anomalous-sound-detection-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-anomalous-sound-detection-with/ 音频事件检测 | 8.0/10 Leveraging Large Speech Language Models as Evaluators for Expressive Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-speech-language-models-as/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-speech-language-models-as/ 语音情感识别 | 6.5/10 Leveraging Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-multiple-speech-enhancers-for-non/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-multiple-speech-enhancers-for-non/ 模型评估 | 7.5/10 Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-segment-level-speech-representations/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-segment-level-speech-representations/ 语音识别 | 7.0/10 Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mispronunciation-detection-and-diagnosis-without/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mispronunciation-detection-and-diagnosis-without/ 语音评估 | 8.0/10 Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-attention-sinks-and-massive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-attention-sinks-and-massive/ 语音识别 | 7.0/10 Mixture-of-Experts Based Soft-Label Learning for Multi-Label Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-of-experts-based-soft-label-learning-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-of-experts-based-soft-label-learning-for/ 语音情感识别 | 7.5/10 MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmaudiosep-taming-video-to-audio-generative-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmaudiosep-taming-video-to-audio-generative-model/ 语音分离 | 8.0/10 Modeling Inter-Segment Relationships in Speech for Dementia Detection with Audio Spectrogram Transformers and Graph Attention Networks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-modeling-inter-segment-relationships-in-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-modeling-inter-segment-relationships-in-speech/ 语音生物标志物 | 7.0/10 MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-msf-ser-enriching-acoustic-modeling-with-multi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-msf-ser-enriching-acoustic-modeling-with-multi/ 语音情感识别 | 7.5/10 MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-Token Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mtp-s2ut-enhancing-speech-to-speech-translation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mtp-s2ut-enhancing-speech-to-speech-translation/ 语音翻译 | 8.5/10 Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-channel-speech-enhancement-for-cocktail/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-channel-speech-enhancement-for-cocktail/ 语音情感识别 | 7.5/10 Multi-Task Learning For Speech Quality Assessment Using ASR-Derived Entropy Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-task-learning-for-speech-quality-assessment/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-task-learning-for-speech-quality-assessment/ 语音质量评估 | 7.5/10 Multilingual Supervised Pretraining with Lm-Assisted Decoding for Visual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multilingual-supervised-pretraining-with-lm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multilingual-supervised-pretraining-with-lm/ 语音识别 | 6.5/10 Multimodal Transformer with Multiperspective Training for Predicting Self-Expression Skills from Video Interview https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-transformer-with-multiperspective/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-transformer-with-multiperspective/ 多模态模型 | 7.0/10 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musetok-symbolic-music-tokenization-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musetok-symbolic-music-tokenization-for/ 音乐生成 | 8.5/10 Noise-Robust AV-ASR Using Visual Features both in the Whisper Encoder and Decoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-robust-av-asr-using-visual-features-both-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-robust-av-asr-using-visual-features-both-in/ 语音识别 | 8.0/10 On deepfake voice detection - It’s all in the presentation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-on-deepfake-voice-detection-its-all-in-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-on-deepfake-voice-detection-its-all-in-the/ 音频深度伪造检测 | 8.0/10 Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/ 语音识别 | 6.5/10 PADAM: Perceptual Audio Defect Assessment Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-padam-perceptual-audio-defect-assessment-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-padam-perceptual-audio-defect-assessment-model/ 音频分类 | 7.0/10 Probing the Hidden Talent of ASR foundation models for L2 English Oral Assessment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-probing-the-hidden-talent-of-asr-foundation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-probing-the-hidden-talent-of-asr-foundation/ 预训练 | 7.5/10 Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for Voicemos 2024 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-quality-assessment-of-noisy-and-enhanced-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-quality-assessment-of-noisy-and-enhanced-speech/ 语音质量评估 | 7.0/10 RASD-SR: A Robust Anomalous Sound Detection Framework with Score Recalibration https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rasd-sr-a-robust-anomalous-sound-detection/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rasd-sr-a-robust-anomalous-sound-detection/ 异常声音检测 | 8.5/10 Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reading-between-the-waves-robust-topic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reading-between-the-waves-robust-topic/ 音频分类 | 7.0/10 Reasoning Driven Captions to Assist Noise Robust Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reasoning-driven-captions-to-assist-noise-robust/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reasoning-driven-captions-to-assist-noise-robust/ 语音情感识别 | 7.0/10 Recovering Performance in Speech Emotion Recognition from Discrete Tokens Via Multi-Layer Fusion and Paralinguistic Feature Integration https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-recovering-performance-in-speech-emotion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-recovering-performance-in-speech-emotion/ 语音情感识别 | 6.5/10 Reference-Aware SFM Layers for Intrusive Intelligibility Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reference-aware-sfm-layers-for-intrusive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reference-aware-sfm-layers-for-intrusive/ 语音评估 | 7.5/10 SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ 脑机接口 | 8.0/10 SAUNA: Song-Level Audio & User-Listening Data Neural Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sauna-song-level-audio-user-listening-data-neural/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sauna-song-level-audio-user-listening-data-neural/ 音乐信息检索 | 7.0/10 Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ 语音识别 | 8.5/10 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-se-dicow-self-enrolled-diarization-conditioned/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-se-dicow-self-enrolled-diarization-conditioned/ 语音识别 | 8.5/10 Shared Representation Learning for Reference-Guided Targeted Sound Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-shared-representation-learning-for-reference/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-shared-representation-learning-for-reference/ 音频事件检测 | 8.5/10 SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slap-scalable-language-audio-pretraining-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slap-scalable-language-audio-pretraining-with/ 音频检索 | 8.0/10 SmoothCLAP: Soft-Target Enhanced Contrastive Language-Audio Pretraining for Affective Computing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-smoothclap-soft-target-enhanced-contrastive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-smoothclap-soft-target-enhanced-contrastive/ 语音情感识别 | 6.5/10 SONAR: Self-Distilled Continual Pre-Training for Domain Adaptive Audio Representation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sonar-self-distilled-continual-pre-training-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sonar-self-distilled-continual-pre-training-for/ 音频事件检测 | 7.0/10 SPAM: Style Prompt Adherence Metric for Prompt-Based TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spam-style-prompt-adherence-metric-for-prompt/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spam-style-prompt-adherence-metric-for-prompt/ 语音合成 | 7.0/10 Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speaking-clearly-a-simplified-whisper-based-codec/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speaking-clearly-a-simplified-whisper-based-codec/ 语音编码 | 7.5/10 Speech Emotion Recognition based on Hierarchical Transformer with Shifted Windows https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speech-emotion-recognition-based-on-hierarchical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speech-emotion-recognition-based-on-hierarchical/ 语音情感识别 | 8.0/10 SpeechMapper: Speech-To-Text Embedding Projector for LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechmapper-speech-to-text-embedding-projector/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechmapper-speech-to-text-embedding-projector/ 语音大模型 | 7.0/10 Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ 语音合成 | 7.5/10 TAGARELA - A Portuguese Speech Dataset from Podcasts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ 语音识别语音合成 | 7.0/10 TASU: Text-only Alignment for Speech Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tasu-text-only-alignment-for-speech-understanding/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tasu-text-only-alignment-for-speech-understanding/ 语音识别 | 7.0/10 Test Time Adaptation for Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-test-time-adaptation-for-speech-emotion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-test-time-adaptation-for-speech-emotion/ 语音情感识别 | 7.0/10 Text2Move: Text-To-Moving Sound Generation via Trajectory Prediction and Temporal Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2move-text-to-moving-sound-generation-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2move-text-to-moving-sound-generation-via/ 空间音频 | 8.0/10 The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-3rd-clarity-prediction-challenge-a-machine/ 语音增强 | 7.5/10 The Synergistic Role of Audio and Large Video-Language Model in Source-Free Video Domain Adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-synergistic-role-of-audio-and-large-video/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-synergistic-role-of-audio-and-large-video/ 领域适应 | 7.0/10 Thinking While Listening: Simple Test Time Scaling for Audio Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-thinking-while-listening-simple-test-time-scaling/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-thinking-while-listening-simple-test-time-scaling/ 音频分类 | 6.5/10 Timbre-Based Pretraining with Pseudo-Labels for Multi-Instrument Automatic Music Transcription https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-timbre-based-pretraining-with-pseudo-labels-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-timbre-based-pretraining-with-pseudo-labels-for/ 音乐信息检索 | 7.0/10 Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/ 音频事件检测 | 7.5/10 Tpeformer: Temporal Patch Embedding Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tpeformer-temporal-patch-embedding-transformer/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tpeformer-temporal-patch-embedding-transformer/ 语音情感识别 | 7.5/10 Training-Free Inference-Time Scaling for Audio Source Separation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-free-inference-time-scaling-for-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-free-inference-time-scaling-for-audio/ 语音增强 | 7.5/10 Tri-Attention Fusion: Joint Temporal-Spectral and Bidirectional Modeling for Speech Spoofing Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tri-attention-fusion-joint-temporal-spectral-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tri-attention-fusion-joint-temporal-spectral-and/ 语音伪造检测 | 7.0/10 WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavesp-net-learnable-wavelet-domain-sparse-prompt/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavesp-net-learnable-wavelet-domain-sparse-prompt/ 语音伪造检测 | 8.0/10 WavLink: Compact Audio–Text Embeddings with a Global Whisper Token https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavlink-compact-audiotext-embeddings-with-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavlink-compact-audiotext-embeddings-with-a/ 音频检索 | 8.0/10 An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ 音乐理解 | 7.5/10 Scaling Properties of Continuous Diffusion Spoken Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/ 语音生成 | 8.0/10 DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-diarizen-explained-a-tutorial-for-the-open-source/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-diarizen-explained-a-tutorial-for-the-open-source/ 说话人分离 | 6.5/10 Misinformation Span Detection in Videos via Audio Transcripts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-misinformation-span-detection-in-videos-via-audio/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-misinformation-span-detection-in-videos-via-audio/ 音频安全 | 7.5/10 Environmental Sound Deepfake Detection Using Deep-Learning Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-environmental-sound-deepfake-detection-using-deep/ 1. **问题**：针对环境声音（包括声音场景和声音事件）的深度伪造检测（ESDD）任务，现有研究不足，且尚不清楚声音场景与声音事件的伪造检测是否需要不同模型。 2. **方法核心**：提出一个深度学习框架，核心是采用预训练的音频模型（BEATs）作为特征提取器，并结合一种三阶段训练策略（包含对 Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-indic-codecfake-meets-satyam-towards-detecting/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-indic-codecfake-meets-satyam-towards-detecting/ 1. **问题**：现有针对基于神经音频编解码器的语音深度伪造（CodecFake）检测的研究主要集中在英语和中文，对于语言多样性极高的印度语言缺乏大规模的基准数据集和有效的检测方法。 2. **方法**：作者构建了首个大规模印度语言CodecFake数据集（ICF），并提出了一个名为SATYA MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-move-translating-laughter-and-tears-via-mixture/ 这篇论文旨在解决语音到语音翻译（S2ST）系统普遍丢失源语音中非语言声音（如笑声、哭声）和情感信息的问题，这严重影响了跨语言交流的自然度和准确性。为此，作者提出了三项核心贡献：首先，设计了一个可扩展的自动化数据合成管道，用于生成大规模、高质量的英中富有表现力S2ST平行语料，克服了训练数据稀缺的瓶颈 Environmental Sound Deepfake Detection Using Deep-Learning Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-environmental-sound-deepfake-detection-using-deep/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-environmental-sound-deepfake-detection-using-deep/ 本文针对环境声音（如声音事件、声音场景）的深度伪造检测这一新兴任务，提出了一个系统的深度学习框架。**核心贡献**在于通过大量实验，系统评估了不同频谱图（MEL, CQT, Gammatone）、多种CNN架构（ResNet, Inception等）以及预训练模型（BEATs）在该任务上的表现，并验 Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ 这篇技术报告全面介绍了Qwen3.5-Omni，一个能够统一理解与生成文本、图像、音频和音视频内容的全模态大语言模型。**要解决的问题**是现有模型在实时交互、跨模态推理和自主智能体行为方面的局限性。**采用的方法**是基于“思考者-说话者”架构，引入了多项关键创新：1）思考者和说话者均采用混合注意 ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-prosdd-learning-prosodic-representations-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-prosdd-learning-prosodic-representations-for/ 这篇论文旨在解决当前语音深度伪造检测（SDD）系统在面对富有表现力和情感的合成语音攻击时泛化能力不足的核心问题。现有方法过度依赖伪造数据，容易学习数据集特定的伪影，而非自然语音的可迁移特征。为此，作者