知识蒸馏 on 语音/音频论文速递

知识蒸馏 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E7%9F%A5%E8%AF%86%E8%92%B8%E9%A6%8F/ Recent content in 知识蒸馏 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 Advancing Speech Summarization in Multi-Modal LLMs with Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-speech-summarization-in-multi-modal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-speech-summarization-in-multi-modal/ 音频问答 | 7.0/10 AFT: An Exemplar-Free Class Incremental Learning Method for Environmental Sound Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aft-an-exemplar-free-class-incremental-learning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aft-an-exemplar-free-class-incremental-learning/ 音频分类 | 7.0/10 AMBER2: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-amber2-dual-ambiguity-aware-emotion-recognition/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-amber2-dual-ambiguity-aware-emotion-recognition/ 语音情感识别 | 8.0/10 APKD: Aligned And Paced Knowledge Distillation Towards Lightweight Heterogeneous Multimodal Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-apkd-aligned-and-paced-knowledge-distillation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-apkd-aligned-and-paced-knowledge-distillation/ 情感识别 | 7.5/10 Attention-Weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied To Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention-weighted-centered-kernel-alignment-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention-weighted-centered-kernel-alignment-for/ 语音情感识别 | 8.0/10 Attentive Masked Self-Distillation for Respiratory Sound Classification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attentive-masked-self-distillation-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attentive-masked-self-distillation-for/ 音频分类 | 7.5/10 AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auv-teaching-audio-universal-vector-quantization/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auv-teaching-audio-universal-vector-quantization/ 音频生成 | 8.0/10 Cross-Architecture Knowledge Distillation of WavLM for Lightweight Speaker Verification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-architecture-knowledge-distillation-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-architecture-knowledge-distillation-of/ 说话人验证 | 8.0/10 Cross-Modal Knowledge Distillation for Speech Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-knowledge-distillation-for-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-knowledge-distillation-for-speech/ 语音大模型 | 7.0/10 Curriculum Learning with Contrastive Loss for Lightweight Speaker Verification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-curriculum-learning-with-contrastive-loss-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-curriculum-learning-with-contrastive-loss-for/ 说话人验证 | 6.5/10 DBFT-SD: Weakly Supervised Multimodal Detection of Sensitive Audio-Visual Content https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dbft-sd-weakly-supervised-multimodal-detection-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dbft-sd-weakly-supervised-multimodal-detection-of/ 音频事件检测 | 8.0/10 Distilling Attention Knowledge for Speaker Verification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-distilling-attention-knowledge-for-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-distilling-attention-knowledge-for-speaker/ 说话人验证 | 8.0/10 EchoRAG: A Two-Stage Framework for Audio-Text Retrieval and Temporal Grounding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echorag-a-two-stage-framework-for-audio-text/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-echorag-a-two-stage-framework-for-audio-text/ 音频检索 | 7.5/10 EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-edgespot-efficient-and-high-performance-few-shot/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-edgespot-efficient-and-high-performance-few-shot/ 语音活动检测 | 7.5/10 Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enabling-multi-species-bird-classification-on-low/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enabling-multi-species-bird-classification-on-low/ 生物声学 | 8.0/10 Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation Guided Structured Pruning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speaker-verification-with-w2v-bert-20/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhancing-speaker-verification-with-w2v-bert-20/ 说话人验证 | 7.5/10 FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-focalcodec-stream-streaming-low-bitrate-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-focalcodec-stream-streaming-low-bitrate-speech/ 语音编码 | 8.0/10 From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-hallucination-to-articulation-language-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-hallucination-to-articulation-language-model/ 语音合成 | 7.5/10 GLUE: Gradient-free Learning to Unify Experts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glue-gradient-free-learning-to-unify-experts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-glue-gradient-free-learning-to-unify-experts/ 迁移学习 | 6.5/10 Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/ 语音合成 | 7.5/10 Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ 语音识别 | 6.5/10 Lightweight and Generalizable Acoustic Scene Representations Via Contrastive Fine-Tuning and Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-and-generalizable-acoustic-scene/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-and-generalizable-acoustic-scene/ 音频场景理解 | 8.0/10 MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large Audio-Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mi-fuse-label-fusion-for-unsupervised-domain/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mi-fuse-label-fusion-for-unsupervised-domain/ 语音情感识别 | 8.0/10 Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/ 音频生成 | 7.5/10 Prompt-Guided Mixture-of-Experts for Robust Multimodal Sentiment Analysis with Missing Modalities https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prompt-guided-mixture-of-experts-for-robust/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prompt-guided-mixture-of-experts-for-robust/ 语音情感识别 | 8.5/10 S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s-sondo-self-supervised-knowledge-distillation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s-sondo-self-supervised-knowledge-distillation/ 音频分类 | 7.0/10 Salad-VAE: Semantic Audio Compression with Language-Audio Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-salad-vae-semantic-audio-compression-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-salad-vae-semantic-audio-compression-with/ 音频压缩 | 7.5/10 Semantic Anchor Transfer from Short to Long Speech in a Distillation-Based Summarization Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-semantic-anchor-transfer-from-short-to-long/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-semantic-anchor-transfer-from-short-to-long/ 语音摘要 | 7.5/10 SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sightsound-r1-cross-modal-reasoning-distillation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sightsound-r1-cross-modal-reasoning-distillation/ 音频问答 | 7.5/10 Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/ 音频生成 | 7.0/10 SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spade-structured-pruning-and-adaptive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spade-structured-pruning-and-adaptive/ 语音合成 | 7.5/10 SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechct-clip-distilling-text-image-knowledge-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speechct-clip-distilling-text-image-knowledge-to/ 医疗AI | 7.5/10 STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ 语音识别 | 8.0/10 Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization Via Neural Audio Codec and Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stream-voice-anon-enhancing-utility-of-real-time/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stream-voice-anon-enhancing-utility-of-real-time/ 语音匿名化 | 7.0/10 Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-llm-asr-with-speaker-aware-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-llm-asr-with-speaker-aware-speech/ 语音识别 | 8.8/10 Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teacher-guided-pseudo-supervision-and-cross-modal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teacher-guided-pseudo-supervision-and-cross-modal/ 音视频 | 7.0/10 Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-Wise Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-audio-models-to-reason-a-unified/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-audio-models-to-reason-a-unified/ 音频问答 | 7.0/10 Teaching the Teachers: Boosting Unsupervised Domain Adaptation In Speech Recognition By Ensemble Update https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-the-teachers-boosting-unsupervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-the-teachers-boosting-unsupervised/ 语音识别 | 7.0/10 Temporal Distillation for Music Representation Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-temporal-distillation-for-music-representation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-temporal-distillation-for-music-representation/ 音乐信息检索 | 7.5/10 The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-impact-of-audio-watermarking-on-audio-anti/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-impact-of-audio-watermarking-on-audio-anti/ 音频深度伪造检测 | 8.5/10 The Synergistic Role of Audio and Large Video-Language Model in Source-Free Video Domain Adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-synergistic-role-of-audio-and-large-video/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-synergistic-role-of-audio-and-large-video/ 领域适应 | 7.0/10 Triage Knowledge Distillation for Speaker Verification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-triage-knowledge-distillation-for-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-triage-knowledge-distillation-for-speaker/ 说话人验证 | 7.5/10 What the student learns in knowledge distillation: A subspace view and evidence on Convolutional Recurrent Network https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-what-the-student-learns-in-knowledge-distillation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-what-the-student-learns-in-knowledge-distillation/ 语音增强 | 6.5/10 Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/ 音视频 | 8.5/10 Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-beyond-acoustic-sparsity-and-linguistic-bias-a/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-beyond-acoustic-sparsity-and-linguistic-bias-a/ 发音错误检测 | 8.5/10 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ 语音合成 | 7.0/10 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ 本文针对现有语音合成系统在生成角色驱动、情感丰富的语音时难以同时保持角色身份一致性和情感表达准确性的问题，提出了ATRIE框架。其核心是**Persona-Prosody Dual-Track (P2-DT) 架构**，将语音生成解耦为静态的**音色轨道**（通过标量量化保持身份锚点）和动态的**韵 Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-audio-cogito-towards-deep-audio-reasoning-in/ 本文旨在解决大型音频语言模型（LALMs）在复杂音频推理任务中能力不足、推理过程不透明的问题。**核心贡献**是提出了一个名为 **Audio-Cogito** 的完全开源解决方案，其核心是一个四阶段的自动化数据构建管道 **Cogito-Pipe**，用于生成高质量、多样化的音频推理链（CoT）数 AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-avrt-audio-visual-reasoning-transfer-through/ 本文旨在解决多模态大模型在音视频联合推理任务上缺乏高质量训练数据的核心挑战。**核心贡献**是提出了AVRT框架，通过组合单模态专家模型的能力来合成多模态推理数据。**关键方法**分为两步：1）**数据生成**：使用专门的视觉教师（Kimi-VL-Thinking）和音频教师（Audio Flami HARNESS: Lightweight Distilled Arabic Speech Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ 这篇论文针对阿拉伯语语音识别、方言识别和情感识别中通用多语言/英语模型性能不足、且大模型难以部署的问题，提出了 HArnESS——一个以阿拉伯语为中心的自监督语音模型家族。作者采用 HuBERT 风格的迭代自蒸馏框架，先在大规模阿拉伯语-英语双语数据（约 23K 小时）上训练 24 层的教师模型 H Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-cogito-towards-deep-audio-reasoning-in/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-audio-cogito-towards-deep-audio-reasoning-in/ 这篇论文旨在解决大型音频语言模型（LALMs）在复杂音频推理任务上能力不足且依赖昂贵闭源数据的问题。作者提出了一个名为**Audio-Cogito**的全开源解决方案，其核心是**Cogito-Pip On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/ 本文针对现有语音变分自编码器（VAE）在统一语音重建、理解和生成任务上表现不平衡的问题（尤其是理解能力差），系统性地研究了蒸馏损失函数的设计空间。作者探索了三种将自监督学习（SSL）模型知识蒸馏到VA Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/ 这篇论文深入探讨了在端到端音频语言模型中，将视觉信息融入音频分词器时普遍存在的“理解提升但重建质量下降”的核心矛盾。作者通过系统性实验，揭示了三个关键发现：融合位置（在量化前还是量化后）至关重要；在离