<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>统一音频模型 on 语音/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/%E7%BB%9F%E4%B8%80%E9%9F%B3%E9%A2%91%E6%A8%A1%E5%9E%8B/</link>
    <description>Recent content in 统一音频模型 on 语音/音频论文速递</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/%E7%BB%9F%E4%B8%80%E9%9F%B3%E9%A2%91%E6%A8%A1%E5%9E%8B/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auv-teaching-audio-universal-vector-quantization/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auv-teaching-audio-universal-vector-quantization/</guid>
      <description>音频生成 | 8.0/10</description>
    </item>
    <item>
      <title>Hierarchical Activity Recognition and Captioning from Long-Form Audio</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-activity-recognition-and-captioning/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-activity-recognition-and-captioning/</guid>
      <description>音频事件检测 | 7.5/10</description>
    </item>
    <item>
      <title>InstructAudio: Unified Speech and Music Generation with Natural Language Instruction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/</guid>
      <description>语音识别 | 8.0/10</description>
    </item>
    <item>
      <title>UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/</link>
      <pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/</guid>
      <description>音频生成 | 8.5/10</description>
    </item>
    <item>
      <title>Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/</guid>
      <description>1. **问题**：训练一个既能高精度离线转录又能低延迟流式识别的统一ASR模型极具挑战性，传统方法在低延迟下性能会急剧下降。 2. **方法核心**：提出一个统一的Transducer框架，结合分块注意力（含右上下文）和动态块卷积（DCConv）来适配两种模式。核心创新是引入了模式一致性正则化损失</description>
    </item>
    <item>
      <title>UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/</link>
      <pubDate>Wed, 22 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/</guid>
      <description>**核心贡献**：本文提出了首个专为全双工语音交互设计的统一音频前端大模型（UAF）。它打破了传统级联式前端处理的范式，将语音活动检测（VAD）、说话人识别（SR）、自动语音识别（ASR）、轮次检测（TD）和问答（QA）等多个任务，统一建模为一个自回归序列预测问题。  **关键方法**：模型采用“音</description>
    </item>
    <item>
      <title>Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-beyond-transcription-unified-audio-schema-for/</guid>
      <description>这篇论文旨在解决当前音频大语言模型（AudioLLMs）在细粒度声学感知任务上表现不佳的核心问题。作者指出，主流的以自动语音识别（ASR）为中心的训练范式，通过将音频映射到纯文本转录，系统性地丢弃了副</description>
    </item>
    <item>
      <title>On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-on-the-distillation-loss-functions-of-speech-vae/</guid>
      <description>本文针对现有语音变分自编码器（VAE）在统一语音重建、理解和生成任务上表现不平衡的问题（尤其是理解能力差），系统性地研究了蒸馏损失函数的设计空间。作者探索了三种将自监督学习（SSL）模型知识蒸馏到VA</description>
    </item>
  </channel>
</rss>
