<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>扩散模型 on 语音/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/%E6%89%A9%E6%95%A3%E6%A8%A1%E5%9E%8B/</link>
    <description>Recent content in 扩散模型 on 语音/音频论文速递</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/%E6%89%A9%E6%95%A3%E6%A8%A1%E5%9E%8B/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A State-Dependent Markov Diffusion Process for Generative Speech Enhancement</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-state-dependent-markov-diffusion-process-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-state-dependent-markov-diffusion-process-for/</guid>
      <description>语音增强 | 6.5/10</description>
    </item>
    <item>
      <title>Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-are-modern-speech-enhancement-systems-vulnerable/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-are-modern-speech-enhancement-systems-vulnerable/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asynchrony-aware-decoupled-multimodal-control-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asynchrony-aware-decoupled-multimodal-control-for/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Audience-Aware Co-speech Gesture Generation in Public Speaking via Anticipation Tokens</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audience-aware-co-speech-gesture-generation-in/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audience-aware-co-speech-gesture-generation-in/</guid>
      <description>音频生成 | 8.0/10</description>
    </item>
    <item>
      <title>Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/</guid>
      <description>语音识别 | 7.0/10</description>
    </item>
    <item>
      <title>AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Automatic Music Mixing Using a Generative Model of Effect Embeddings</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-automatic-music-mixing-using-a-generative-model/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-automatic-music-mixing-using-a-generative-model/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>Bayesian Signal Separation Via Plug-and-Play Diffusion-Within-Gibbs Sampling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bayesian-signal-separation-via-plug-and-play/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bayesian-signal-separation-via-plug-and-play/</guid>
      <description>语音分离 | 7.5/10</description>
    </item>
    <item>
      <title>Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-face-swapping-a-diffusion-based-digital/</guid>
      <description>音频深度伪造检测 | 8.1/10</description>
    </item>
    <item>
      <title>Bone-Conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bone-conduction-guided-multimodal-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bone-conduction-guided-multimodal-speech/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>Break-the-Beat! Controllable MIDI-to-Drum audio synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-break-the-beat-controllable-midi-to-drum-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-break-the-beat-controllable-midi-to-drum-audio/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>Bridging the Measurement–Simulation Gap in Room Acoustics with Real2sim Diffusion</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-measurementsimulation-gap-in-room/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-measurementsimulation-gap-in-room/</guid>
      <description>声源定位 | 8.5/10</description>
    </item>
    <item>
      <title>Cardiobridge-DM: Bridging Cross-Cohort Heart Sound Synthesis via Rhythm-Aware Semi-Supervised Diffusion</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cardiobridge-dm-bridging-cross-cohort-heart-sound/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cardiobridge-dm-bridging-cross-cohort-heart-sound/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Conditional Diffusion Models for Mental Health-Preserving Voice Conversion</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-conditional-diffusion-models-for-mental-health/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-conditional-diffusion-models-for-mental-health/</guid>
      <description>语音转换 | 8.0/10</description>
    </item>
    <item>
      <title>Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/</guid>
      <description>语音合成 | 8.0/10</description>
    </item>
    <item>
      <title>D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation from Lead Sheet</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-d3pia-a-discrete-denoising-diffusion-model-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-d3pia-a-discrete-denoising-diffusion-model-for/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>DGSDNet: Dual-Graph Spectral Diffusion Network for Incomplete Multimodal Emotion Recognition in Conversations</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dgsdnet-dual-graph-spectral-diffusion-network-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dgsdnet-dual-graph-spectral-diffusion-network-for/</guid>
      <description>语音情感识别 | 8.0/10</description>
    </item>
    <item>
      <title>Diff-vs: Efficient Audio-Aware Diffusion U-Net for Vocals Separation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diff-vs-efficient-audio-aware-diffusion-u-net-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diff-vs-efficient-audio-aware-diffusion-u-net-for/</guid>
      <description>语音分离 | 7.5/10</description>
    </item>
    <item>
      <title>Diffemotalk: Audio-Driven Facial Animation with Fine-Grained Emotion Control via Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffemotalk-audio-driven-facial-animation-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffemotalk-audio-driven-facial-animation-with/</guid>
      <description>语音情感识别 | 7.5/10</description>
    </item>
    <item>
      <title>Diffusion Timbre Transfer via Mutual Information Guided Inpainting</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffusion-timbre-transfer-via-mutual-information/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffusion-timbre-transfer-via-mutual-information/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>Direct Preference Optimization For Speech Autoregressive Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-preference-optimization-for-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-preference-optimization-for-speech/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/</guid>
      <description>语音增强 | 8.5/10</description>
    </item>
    <item>
      <title>Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Disentangling Physiology from Fidelity: Latent-Guided Diffusion Models for Cross-Modal Cardiac Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-disentangling-physiology-from-fidelity-latent/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-disentangling-physiology-from-fidelity-latent/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>DISSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dissr-disentangling-speech-representation-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dissr-disentangling-speech-representation-for/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ditse-high-fidelity-generative-speech-enhancement/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ditse-high-fidelity-generative-speech-enhancement/</guid>
      <description>语音增强 | 8.5/10</description>
    </item>
    <item>
      <title>DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ditsinger-scaling-singing-voice-synthesis-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ditsinger-scaling-singing-voice-synthesis-with/</guid>
      <description>歌唱语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dmp-tts-disentangled-multi-modal-prompting-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dmp-tts-disentangled-multi-modal-prompting-for/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Do We Need EMA for Diffusion-Based Speech Enhancement? Toward A Magnitude-Preserving Network Architecture</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-need-ema-for-diffusion-based-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-need-ema-for-diffusion-based-speech/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>DOMA: Leveraging Diffusion Language Models with Adaptive Prior for Intent Classification and Slot Filling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-doma-leveraging-diffusion-language-models-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-doma-leveraging-diffusion-language-models-with/</guid>
      <description>语音对话系统 | 8.5/10</description>
    </item>
    <item>
      <title>Emotion-Aligned Generation in Diffusion Text to Speech Models Via Preference-Guided Optimization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotion-aligned-generation-in-diffusion-text-to/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotion-aligned-generation-in-diffusion-text-to/</guid>
      <description>语音合成 | 8.0/10</description>
    </item>
    <item>
      <title>FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fac-facodec-controllable-zero-shot-foreign-accent/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fac-facodec-controllable-zero-shot-foreign-accent/</guid>
      <description>语音转换 | 8.0/10</description>
    </item>
    <item>
      <title>Feedback-Driven Retrieval-Augmented Audio Generation with Large Audio Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-feedback-driven-retrieval-augmented-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-feedback-driven-retrieval-augmented-audio/</guid>
      <description>音频生成 | 6.5/10</description>
    </item>
    <item>
      <title>FODGE : High-Fidelity Dance Generation via Full-Body Optimization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fodge-high-fidelity-dance-generation-via-full/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fodge-high-fidelity-dance-generation-via-full/</guid>
      <description>音频生成 | 6.5/10</description>
    </item>
    <item>
      <title>Gdiffuse: Diffusion-Based Speech Enhancement with Noise Model Guidance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gdiffuse-diffusion-based-speech-enhancement-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gdiffuse-diffusion-based-speech-enhancement-with/</guid>
      <description>语音增强 | 7.0/10</description>
    </item>
    <item>
      <title>Generalizability of Predictive and Generative Speech Enhancement Models to Pathological Speakers</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generalizability-of-predictive-and-generative/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generalizability-of-predictive-and-generative/</guid>
      <description>语音增强 | 7.0/10</description>
    </item>
    <item>
      <title>Generating Moving 3d Soundscapes with Latent Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generating-moving-3d-soundscapes-with-latent/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generating-moving-3d-soundscapes-with-latent/</guid>
      <description>空间音频 | 7.5/10</description>
    </item>
    <item>
      <title>Generative Audio Extension and Morphing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-audio-extension-and-morphing/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-generative-audio-extension-and-morphing/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>GLA-GRAD&#43;&#43;: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gla-grad-an-improved-griffin-lim-guided-diffusion/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gla-grad-an-improved-griffin-lim-guided-diffusion/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Constrative and Generative Pretraining</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gms-cavp-improving-audio-video-correspondence/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gms-cavp-improving-audio-video-correspondence/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Improving Automatic Speech Recognition by Mitigating Distortions Introduced by Speech Enhancement Under Drone Noise</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-automatic-speech-recognition-by/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-automatic-speech-recognition-by/</guid>
      <description>语音识别 | 6.5/10</description>
    </item>
    <item>
      <title>InstructAudio: Unified Speech and Music Generation with Natural Language Instruction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Instrument Generation Through Distributional Flow Matching and Test-Time Search</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instrument-generation-through-distributional-flow/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instrument-generation-through-distributional-flow/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>KSDIFF: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ksdiff-keyframe-augmented-speech-aware-dual-path/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ksdiff-keyframe-augmented-speech-aware-dual-path/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>LAFUFU: Latent Acoustic Features For Ultra-Fast Utterance Restoration</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lafufu-latent-acoustic-features-for-ultra-fast/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lafufu-latent-acoustic-features-for-ultra-fast/</guid>
      <description>语音增强 | 8.0/10</description>
    </item>
    <item>
      <title>Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-linearity-in-audio-consistency/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-linearity-in-audio-consistency/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Leveraging Diffusion U-Net Features for Predominant Instrument Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-diffusion-u-net-features-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-diffusion-u-net-features-for/</guid>
      <description>音乐信息检索 | 8.0/10</description>
    </item>
    <item>
      <title>Low-Resource Guidance for Controllable Latent Audio Diffusion</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-resource-guidance-for-controllable-latent/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-resource-guidance-for-controllable-latent/</guid>
      <description>音乐生成 | 8.5/10</description>
    </item>
    <item>
      <title>MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation Without Vector Quantization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mag-multi-modal-aligned-autoregressive-co-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mag-multi-modal-aligned-autoregressive-co-speech/</guid>
      <description>音频生成 | 8.0/10</description>
    </item>
    <item>
      <title>MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/</guid>
      <description>语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>Membership Inference Attack against Music Diffusion Models via Generative Manifold Perturbation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-membership-inference-attack-against-music/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-membership-inference-attack-against-music/</guid>
      <description>音频安全 | 7.5/10</description>
    </item>
    <item>
      <title>MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/</guid>
      <description>语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>Mitigating Data Replication in Text-to-Audio Generative Diffusion Models Through Anti-Memorization Guidance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-data-replication-in-text-to-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-data-replication-in-text-to-audio/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Mix2Morph: Learning Sound Morphing from Noisy Mixes</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mix2morph-learning-sound-morphing-from-noisy-mixes/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mix2morph-learning-sound-morphing-from-noisy-mixes/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mutual-forcing-dual-mode-self-evolution-for-fast/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-to-notes-diffusion-based-generation-and/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-to-notes-diffusion-based-generation-and/</guid>
      <description>音乐信息检索 | 8.0/10</description>
    </item>
    <item>
      <title>PG-SE: Predictive Acceleration and Correction for Generative Speech Enhancement</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pg-se-predictive-acceleration-and-correction-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pg-se-predictive-acceleration-and-correction-for/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>PICOAUDIO2: Temporal Controllable Text-to-Audio Generation with Natural Language Description</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-picoaudio2-temporal-controllable-text-to-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-picoaudio2-temporal-controllable-text-to-audio/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>PRoADS: Provably Secure And Robust Audio Diffusion Steganography With Latent Optimization And Backward Euler Inversion</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-proads-provably-secure-and-robust-audio-diffusion/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-proads-provably-secure-and-robust-audio-diffusion/</guid>
      <description>音频安全 | 6.5/10</description>
    </item>
    <item>
      <title>PromptSep: Generative Audio Separation Via Multimodal Prompting</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-promptsep-generative-audio-separation-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-promptsep-generative-audio-separation-via/</guid>
      <description>语音分离 | 7.5/10</description>
    </item>
    <item>
      <title>RAP: Real-Time Audio-Driven Portrait Animation with Video Diffusion Transformer</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rap-real-time-audio-driven-portrait-animation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rap-real-time-audio-driven-portrait-animation/</guid>
      <description>音视频 | 7.0/10</description>
    </item>
    <item>
      <title>RFM-Editing: Rectified Flow Matching for Text-Guided Audio Editing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rfm-editing-rectified-flow-matching-for-text/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rfm-editing-rectified-flow-matching-for-text/</guid>
      <description>音频编辑 | 7.5/10</description>
    </item>
    <item>
      <title>S-PRESSO: Ultra Low Bitrate Sound Effect Compression with Diffusion Autoencoders and Offline Quantization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s-presso-ultra-low-bitrate-sound-effect/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s-presso-ultra-low-bitrate-sound-effect/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saga-sr-semantically-and-acoustically-guided/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saga-sr-semantically-and-acoustically-guided/</guid>
      <description>音频增强 | 7.5/10</description>
    </item>
    <item>
      <title>Savgbench: Benchmarking Spatially Aligned Audio-Video Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/</guid>
      <description>基准测试 | 7.5/10</description>
    </item>
    <item>
      <title>Scalable Evaluation for Audio Identification Via Synthetic Latent Fingerprint Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scalable-evaluation-for-audio-identification-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scalable-evaluation-for-audio-identification-via/</guid>
      <description>音频检索 | 7.0/10</description>
    </item>
    <item>
      <title>Shortcut Flow Matching for Speech Enhancement: Step-Invariant Flows via Single Stage Training</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-shortcut-flow-matching-for-speech-enhancement/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-shortcut-flow-matching-for-speech-enhancement/</guid>
      <description>语音增强 | 7.0/10</description>
    </item>
    <item>
      <title>SIRUP: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sirup-a-diffusion-based-virtual-upmixer-of/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sirup-a-diffusion-based-virtual-upmixer-of/</guid>
      <description>声源定位 | 7.0/10</description>
    </item>
    <item>
      <title>Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/</guid>
      <description>音频生成 | 7.0/10</description>
    </item>
    <item>
      <title>Staged Diffusion with Hybrid Mixture-of-Experts (MOE) for Multimodal Sentiment Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-staged-diffusion-with-hybrid-mixture-of-experts/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-staged-diffusion-with-hybrid-mixture-of-experts/</guid>
      <description>语音情感识别 | 8.0/10</description>
    </item>
    <item>
      <title>Stemphonic: All-At-Once Flexible Multi-Stem Music Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stemphonic-all-at-once-flexible-multi-stem-music/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stemphonic-all-at-once-flexible-multi-stem-music/</guid>
      <description>音乐生成 | 7.7/10</description>
    </item>
    <item>
      <title>StereoFoley: Object-Aware Stereo Audio Generation from Video</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stereofoley-object-aware-stereo-audio-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stereofoley-object-aware-stereo-audio-generation/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Str-DiffSep: Streamable Diffusion Model for Speech Separation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-str-diffsep-streamable-diffusion-model-for-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-str-diffsep-streamable-diffusion-model-for-speech/</guid>
      <description>语音分离 | 7.5/10</description>
    </item>
    <item>
      <title>Structure-Aware Diffusion Schrödinger Bridge</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-structure-aware-diffusion-schrdinger-bridge/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-structure-aware-diffusion-schrdinger-bridge/</guid>
      <description>数据集对齐 | 7.7/10</description>
    </item>
    <item>
      <title>StyHarmo: Efficient Style-Specific Video Generation with Music Synchronization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-styharmo-efficient-style-specific-video/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-styharmo-efficient-style-specific-video/</guid>
      <description>视频生成 | 6.5/10</description>
    </item>
    <item>
      <title>Style-Disentangled Diffusion for Controllable and Identity-Generalized Speech-Driven Body Motion Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-style-disentangled-diffusion-for-controllable-and/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-style-disentangled-diffusion-for-controllable-and/</guid>
      <description>语音驱动动作生成 | 7.0/10</description>
    </item>
    <item>
      <title>TAG: Structured Temporal Audio Generation via LLM-Guided Manual Scription and Control</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tag-structured-temporal-audio-generation-via-llm/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tag-structured-temporal-audio-generation-via-llm/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Taming Audio VAEs via Target-KL Regularization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-taming-audio-vaes-via-target-kl-regularization/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-taming-audio-vaes-via-target-kl-regularization/</guid>
      <description>音频生成 | 6.5/10</description>
    </item>
    <item>
      <title>Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/</guid>
      <description>音频事件检测 | 7.5/10</description>
    </item>
    <item>
      <title>Towards Multi-View Hierarchical Video-to-Piano Generation with MIDI Guidance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-multi-view-hierarchical-video-to-piano/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-multi-view-hierarchical-video-to-piano/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>Training-Free Multimodal Guidance for Video to Audio Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-free-multimodal-guidance-for-video-to/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-free-multimodal-guidance-for-video-to/</guid>
      <description>音频生成 | 8.0/10</description>
    </item>
    <item>
      <title>Virtual Consistency for Audio Editing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-virtual-consistency-for-audio-editing/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-virtual-consistency-for-audio-editing/</guid>
      <description>音乐生成 | 8.0/10</description>
    </item>
    <item>
      <title>Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vitex-visual-texture-control-for-multi-track/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vitex-visual-texture-control-for-multi-track/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>VividTalker: A Modular Framework for Expressive 3D Talking Avatars with Controllable Gaze and Blink</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vividtalker-a-modular-framework-for-expressive-3d/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vividtalker-a-modular-framework-for-expressive-3d/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>VMSP: Video-to-Music Generation with Two-Stage Alignment and Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vmsp-video-to-music-generation-with-two-stage/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vmsp-video-to-music-generation-with-two-stage/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>VT-Heads: Voice Cloning and Talking Head Generation from Text Based on V-DiT</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vt-heads-voice-cloning-and-talking-head/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vt-heads-voice-cloning-and-talking-head/</guid>
      <description>视频生成 | 6.5/10</description>
    </item>
    <item>
      <title>Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/</guid>
      <description>语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>Wavenext 2: Convnext-Based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for Gan And Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavenext-2-convnext-based-fast-neural-vocoders/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavenext-2-convnext-based-fast-neural-vocoders/</guid>
      <description>语音合成 | 9.0/10</description>
    </item>
    <item>
      <title>CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/</guid>
      <description>跨模态 | 8.0/10</description>
    </item>
    <item>
      <title>Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/</guid>
      <description>音视频 | 8.5/10</description>
    </item>
    <item>
      <title>Scaling Properties of Continuous Diffusion Spoken Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-scaling-properties-of-continuous-diffusion-spoken/</guid>
      <description>语音生成 | 8.0/10</description>
    </item>
    <item>
      <title>Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/</link>
      <pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/</guid>
      <description>音频生成 | 8.5/10</description>
    </item>
    <item>
      <title>Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/</link>
      <pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-cointeract-physically-consistent-human-object/</guid>
      <description>1. **问题**：现有视频扩散模型在生成人机交互（HOI）视频时，常出现手/脸结构崩溃和人机物理穿透等问题，根源在于模型缺乏对3D空间关系和交互结构的理解。 2. **方法核心**：提出CoInteract框架，核心是“空间结构化协同生成”范式。在一个共享的DiT骨干中联合训练RGB外观流和辅助的</description>
    </item>
    <item>
      <title>Anonymization, Not Elimination: Utility-Preserved Speech Anonymization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-anonymization-not-elimination-utility-preserved/</link>
      <pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-anonymization-not-elimination-utility-preserved/</guid>
      <description>这篇论文针对语音数据隐私保护中“隐私泄露”与“数据效用损失”的核心矛盾，提出了一个新颖的两阶段框架。首先，为解决语音匿名化（保护“谁在说”）中身份多样性不足和可控性差的问题，提出了基于流匹配的说话人嵌入匿名器（F3-VA），它能生成多样且与原始说话人充分分离的新身份。其次，为解决内容匿名化（保护“说</description>
    </item>
    <item>
      <title>Latent Fourier Transform</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/</link>
      <pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/</guid>
      <description>这篇论文旨在解决现有音乐生成模型难以对**任意时间尺度**上的音乐模式进行精确控制的问题。作者提出了**潜在傅里叶变换（LatentFT）** 框架，其核心是将离散傅里叶变换应用于由扩散自编码器编码得到的**潜在向量序列**，从而得到“潜在频谱”。通过在训练过程中对潜在频谱进行随机频率掩码，迫使解码</description>
    </item>
    <item>
      <title>Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-beyond-monologue-interactive-talking-listening/</guid>
      <description>本文旨在解决从单向“独白”式虚拟人生成迈向自然“全双工”交互式生成的核心挑战。**核心问题**在于，现有方法要么因严格的帧对齐而反应僵硬，要么因引入全局注意力而破坏唇同步。**关键方法**是提出一个基于多头高斯核（MHGK）的统一注意力架构，该机制通过为不同的注意力头分配从窄到宽的高斯分布感受野，使</description>
    </item>
    <item>
      <title>Elucidating the SNR-t Bias of Diffusion Probabilistic Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/</guid>
      <description>这篇论文的核心贡献是识别并系统分析了扩散概率模型（DPMs）中一个基础性问题——信噪比-时间步（SNR-t）偏差。该偏差指推理时去噪样本的实际SNR与其所分配时间步t所理论对应的SNR不匹配，这种错位源于训练时的严格耦合在推理时被累积误差打破。作者通过详实的实验（滑动窗口测试、前向与反向过程对比）揭</description>
    </item>
    <item>
      <title>Hierarchical Codec Diffusion for Video-to-Speech Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/</guid>
      <description>本论文针对 Video-to-Speech（VTS）生成中视觉-语音模态信息不对称的问题，提出现有方法忽略了语音从粗粒度语义到细粒度韵律的层次结构，导致视觉条件无法与语音表示精准对齐。为此，作者提出 HiCoDiT（Hierarchical Codec Diffusion Transformer），</description>
    </item>
    <item>
      <title>ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-controlfoley-unified-and-controllable-video-to/</guid>
      <description>本文提出了ControlFoley，一个统一且可控的视频到音频生成框架，旨在解决现有方法在跨模态冲突下文本控制力弱、以及参考音频控制中音色与时间信息纠缠的问题。其核心贡献包括：1）提出联合视觉编码范式</description>
    </item>
    <item>
      <title>CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-cosyncdit-cognitive-synchronous-diffusion/</guid>
      <description>本文针对电影配音（视觉语音克隆）中音色保真度与唇形同步难以兼得的痛点，提出了一种基于流匹配的认知同步扩散Transformer（CoSyncDiT）框架。该方法受专业配音员认知过程启发，将噪声到语音的</description>
    </item>
    <item>
      <title>Diffusion Language Models for Speech Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/</guid>
      <description>这篇论文探索了将扩散语言模型（DLM）应用于自动语音识别（ASR）任务的新方法。其核心目标是利用扩散模型的双向注意和并行生成能力，来提升基于传统编码器（如CTC）生成的ASR候选假设的准确性。论文主要</description>
    </item>
    <item>
      <title>Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tora3-trajectory-guided-audio-video-generation/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-tora3-trajectory-guided-audio-video-generation/</guid>
      <description>本文针对现有音视频（AV）生成模型中存在的运动不真实、声音与运动事件不同步、声音强度与运动强度不匹配等问题，提出了Tora3框架。其核心创新在于**将物体轨迹视为连接视觉与听觉模态的共享运动学先验**</description>
    </item>
  </channel>
</rss>
