<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>多模态 on 语音/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/%E5%A4%9A%E6%A8%A1%E6%80%81/</link>
    <description>Recent content in 多模态 on 语音/音频论文速递</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/%E5%A4%9A%E6%A8%A1%E6%80%81/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/</guid>
      <description>生成模型 | 8.5/10</description>
    </item>
    <item>
      <title>MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/</guid>
      <description>语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic Event Classification</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-temporally-heterogeneous-graph-contrastive/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-temporally-heterogeneous-graph-contrastive/</guid>
      <description>音频事件检测 | 8.5/10</description>
    </item>
    <item>
      <title>语音/音频论文速递 2026-04-29</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29/</guid>
      <description>共分析 29 篇语音/AI 论文</description>
    </item>
    <item>
      <title>CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/</guid>
      <description>1. **问题**：现有视频扩散模型在生成人机交互（HOI）视频时，常出现手/脸结构崩溃和人机物理穿透等问题，根源在于模型缺乏对3D空间关系和交互结构的理解。 2. **方法核心**：提出CoInteract框架，核心是“空间结构化协同生成”范式。在一个共享的DiT骨干中联合训练RGB外观流和辅助的</description>
    </item>
    <item>
      <title>FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/</guid>
      <description>这篇论文旨在解决对多语言、多模态句子嵌入（如SONAR, LaBSE）的可解释性问题。核心方法是提出一种称为因子化线性投影（FLiP）的模型，通过将嵌入向量线性投影到词汇表空间来提取关键词，以此作为理解嵌入内容的代理任务。与之前非因子化的线性探测方法（如LiP）和SpLiCE相比，FLiP在关键词提</description>
    </item>
    <item>
      <title>语音/音频论文速递 2026-04-23</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23/</guid>
      <description>共分析 27 篇语音/AI 论文</description>
    </item>
    <item>
      <title>CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/</guid>
      <description>本文针对电影配音（视觉语音克隆）中音色保真度与唇形同步难以兼得的痛点，提出了一种基于流匹配的认知同步扩散Transformer（CoSyncDiT）框架。该方法受专业配音员认知过程启发，将噪声到语音的</description>
    </item>
    <item>
      <title>From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-from-reactive-to-proactive-assessing-the/</guid>
      <description>本文旨在解决当前语音代理评估中过度关注被动响应，而忽视其主动交互能力的问题。为此，作者提出了首个专门评估主动语音代理的基准测试框架 **ProVoice-Bench**。该框架包含四个新颖的任务，用以</description>
    </item>
    <item>
      <title>Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-listening-deepfake-detection-a-new-perspective/</guid>
      <description>本文首次提出了“聆听深度伪造检测”这一新任务，旨在识别视频中人物在倾听状态下（非说话时）的伪造反应，弥补了现有研究主要集中于“说话”场景的不足。为解决此任务数据稀缺的问题，作者构建了首个专门数据集Li</description>
    </item>
    <item>
      <title>Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-why-your-tokenizer-fails-in-information-fusion-a/</guid>
      <description>这篇论文深入探讨了在端到端音频语言模型中，将视觉信息融入音频分词器时普遍存在的“理解提升但重建质量下降”的核心矛盾。作者通过系统性实验，揭示了三个关键发现：融合位置（在量化前还是量化后）至关重要；在离</description>
    </item>
    <item>
      <title>语音/音频论文速递 2026-04-19</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19/</guid>
      <description>共分析 42 篇语音/AI 论文</description>
    </item>
    <item>
      <title>语音/音频论文速递 2026-04-18</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/</link>
      <pubDate>Sat, 18 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-18/</guid>
      <description>共分析 39 篇语音/AI 论文</description>
    </item>
  </channel>
</rss>
