生成模型 on 语音/音频论文速递

生成模型 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B/ Recent content in 生成模型 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 A Generative-First Neural Audio Autoencoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-generative-first-neural-audio-autoencoder/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-generative-first-neural-audio-autoencoder/ 音乐生成 | 8.5/10 Adaptive Deterministic Flow Matching for Target Speaker Extraction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adaptive-deterministic-flow-matching-for-target/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adaptive-deterministic-flow-matching-for-target/ 目标说话人提取 | 8.0/10 Bleed No More: Generative Interference Reduction for Musical Recordings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bleed-no-more-generative-interference-reduction/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bleed-no-more-generative-interference-reduction/ 音乐源分离 | 7.0/10 Combining Multi-Order Attention and Multi-Resolution Discriminator for High-Fidelity Neural Vocoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-multi-order-attention-and-multi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-multi-order-attention-and-multi/ 语音合成 | 6.5/10 Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ 语音增强 | 6.5/10 Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cutscene-agent-an-llm-agent-framework-for/ 生成模型 | 8.5/10 ECSA: Dual-Branch Emotion Compensation for Emotion-Consistent Speaker Anonymization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ecsa-dual-branch-emotion-compensation-for-emotion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ecsa-dual-branch-emotion-compensation-for-emotion/ 语音匿名化 | 8.5/10 EmoTri-RL: Emotion- and Cause-Aware Reinforcement Learning for Multi-Modal Empathetic Dialogue https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotri-rl-emotion-and-cause-aware-reinforcement/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotri-rl-emotion-and-cause-aware-reinforcement/ 语音情感识别 | 7.0/10 Enhanced Generative Machine Listener https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhanced-generative-machine-listener/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-enhanced-generative-machine-listener/ 音频分类 | 7.0/10 Etude: Piano Cover Generation with a Three-Stage Approach — Extract, Structuralize, and Decode https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-etude-piano-cover-generation-with-a-three-stage/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-etude-piano-cover-generation-with-a-three-stage/ 音乐生成 | 7.0/10 Gen-SER: When the Generative Model Meets Speech Emotion Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gen-ser-when-the-generative-model-meets-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gen-ser-when-the-generative-model-meets-speech/ 语音情感识别 | 6.5/10 Hanui: Harnessing Distributional Discrepancies for Singing Voice Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hanui-harnessing-distributional-discrepancies-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hanui-harnessing-distributional-discrepancies-for/ 音频深度伪造检测 | 8.0/10 HCGAN: Harmonic-Coupled Generative Adversarial Network for Speech Super-Resolution in Low-Bandwidth Scenarios https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hcgan-harmonic-coupled-generative-adversarial/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hcgan-harmonic-coupled-generative-adversarial/ 语音增强 | 8.0/10 Hierarchical Tokenization of Multimodal Music Data for Generative Music Retrieval https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-tokenization-of-multimodal-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-tokenization-of-multimodal-music/ 音乐检索 | 7.0/10 Huí Sù: Co-constructing a Dual Feedback Apparatus https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hu-s-co-constructing-a-dual-feedback-apparatus/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hu-s-co-constructing-a-dual-feedback-apparatus/ 音乐生成 | 5.5/10 LLAC: Learned Lossless Audio Codec https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-llac-learned-lossless-audio-codec/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-llac-learned-lossless-audio-codec/ 音频无损编码 | 7.5/10 MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mage-a-coarse-to-fine-speech-enhancer-with-masked/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mage-a-coarse-to-fine-speech-enhancer-with-masked/ 语音增强 | 8.0/10 MeanFlowSE: One-Step Generative Speech Enhancement via Conditional Mean Flow https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanflowse-one-step-generative-speech-enhancement/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanflowse-one-step-generative-speech-enhancement/ 语音增强 | 7.5/10 MeanSE: Efficient Generative Speech Enhancement with Mean Flows https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanse-efficient-generative-speech-enhancement/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanse-efficient-generative-speech-enhancement/ 语音增强 | 6.5/10 MECap-R1: Emotion-Aware Policy with Reinforcement Learning for Multimodal Emotion Captioning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mecap-r1-emotion-aware-policy-with-reinforcement/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mecap-r1-emotion-aware-policy-with-reinforcement/ 语音情感识别 | 7.5/10 Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-to-notes-diffusion-based-generation-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-to-notes-diffusion-based-generation-and/ 音乐信息检索 | 8.0/10 ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-Based Neural Speech Codec https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-paragse-parallel-generative-speech-enhancement/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-paragse-parallel-generative-speech-enhancement/ 语音增强 | 7.5/10 PG-SE: Predictive Acceleration and Correction for Generative Speech Enhancement https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pg-se-predictive-acceleration-and-correction-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pg-se-predictive-acceleration-and-correction-for/ 语音增强 | 7.5/10 Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prosody-guided-harmonic-attention-for-phase/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prosody-guided-harmonic-attention-for-phase/ 语音合成 | 8.0/10 PSTalker: Realistic 3D Talking Head Synthesis via a Semantic-Aware Audio-Driven Point-Based Shape https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pstalker-realistic-3d-talking-head-synthesis-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pstalker-realistic-3d-talking-head-synthesis-via/ 说话人合成 | 7.5/10 ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-recom-realistic-co-speech-motion-generation-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-recom-realistic-co-speech-motion-generation-with/ 音频生成 | 7.0/10 SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saga-sr-semantically-and-acoustically-guided/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saga-sr-semantically-and-acoustically-guided/ 音频增强 | 7.5/10 Timbre-Based Pretraining with Pseudo-Labels for Multi-Instrument Automatic Music Transcription https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-timbre-based-pretraining-with-pseudo-labels-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-timbre-based-pretraining-with-pseudo-labels-for/ 音乐信息检索 | 7.0/10 Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tldiffgan-a-latent-diffusion-gan-framework-with/ 音频事件检测 | 7.5/10 Two-Stage Language Model Framework for Acoustic Echo Cancellation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-two-stage-language-model-framework-for-acoustic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-two-stage-language-model-framework-for-acoustic/ 语音增强 | 7.5/10 Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uncertainty-aware-3d-emotional-talking-face/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uncertainty-aware-3d-emotional-talking-face/ 音视频 | 8.0/10 Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/ 语音合成 | 7.0/10 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ 本文针对文本转语音（TTS）任务，提出了一种名为“细节链”（Chain-of-Details, CoD）的新框架。**要解决的问题**是现有TTS方法在建模语音生成的时域动态（从粗略时序到精细声学细节的渐进过程）方面存在不足。**使用的方法**是将语音生成分解为多个时间分辨率递增的阶段，在每个阶段使 Latent Fourier Transform https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-latent-fourier-transform/ 这篇论文旨在解决现有音乐生成模型难以对**任意时间尺度**上的音乐模式进行精确控制的问题。作者提出了**潜在傅里叶变换（LatentFT）** 框架，其核心是将离散傅里叶变换应用于由扩散自编码器编码得到的**潜在向量序列**，从而得到“潜在频谱”。通过在训练过程中对潜在频谱进行随机频率掩码，迫使解码 Elucidating the SNR-t Bias of Diffusion Probabilistic Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-elucidating-the-snr-t-bias-of-diffusion/ 这篇论文的核心贡献是识别并系统分析了扩散概率模型（DPMs）中一个基础性问题——信噪比-时间步（SNR-t）偏差。该偏差指推理时去噪样本的实际SNR与其所分配时间步t所理论对应的SNR不匹配，这种错位源于训练时的严格耦合在推理时被累积误差打破。作者通过详实的实验（滑动窗口测试、前向与反向过程对比）揭 ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ 这篇论文旨在解决卫星、水下等极端带宽受限场景下（如200bps）语音通信清晰度严重下降的问题。传统编解码器以波形重建为目标，在超低比特率下会将宝贵的比特分配给不必要的声学细节，而非核心语义信息。为此， Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-dual-axis-generative-reward-model-toward-semantic/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-dual-axis-generative-reward-model-toward-semantic/ 本文旨在解决全双工语音对话模型（SDMs）实现类人交互的核心挑战。现有自动化评估指标流于表面（如统计行为或预测时机准确率），无法为强化学习提供可靠的奖励信号，而人工评估成本高昂且难以扩展。为此，作者提 UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-unipase-a-generative-model-for-universal-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-unipase-a-generative-model-for-universal-speech/ 这篇论文旨在解决通用语音增强（USE）中生成模型面临的“高感知质量”与“低内容幻觉”难以兼得的核心矛盾。作者提出了UniPASE框架，它扩展了其先前的低幻觉PASE模型，以处理包括噪声、混响、丢包、风