自回归模型 on 语音/音频论文速递

自回归模型 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E8%87%AA%E5%9B%9E%E5%BD%92%E6%A8%A1%E5%9E%8B/ Recent content in 自回归模型 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adaptive-rotary-steering-with-joint/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adaptive-rotary-steering-with-joint/ 语音分离 | 8.5/10 Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-language-models-for-lyric-to-melody/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aligning-language-models-for-lyric-to-melody/ 音乐生成 | 7.5/10 An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-event-based-sequence-modeling-approach-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-event-based-sequence-modeling-approach-to/ 音乐信息检索 | 7.5/10 AR-BSNet: Towards Ultra-Low Complexity Autoregressive Target Speaker Extraction With Band-Split Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ar-bsnet-towards-ultra-low-complexity/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ar-bsnet-towards-ultra-low-complexity/ 语音分离 | 7.0/10 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridgecode-a-dual-speech-representation-paradigm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridgecode-a-dual-speech-representation-paradigm/ 语音合成 | 8.0/10 Chunkwise Aligners for Streaming Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ 语音识别 | 7.5/10 Compression meets Sampling: LZ78-SPA for Efficient Symbolic Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-compression-meets-sampling-lz78-spa-for-efficient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-compression-meets-sampling-lz78-spa-for-efficient/ 音乐生成 | 7.5/10 Confidence-Guided Error Correction for Disordered Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-guided-error-correction-for-disordered/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-guided-error-correction-for-disordered/ 语音识别 | 7.5/10 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/ 语音合成 | 8.0/10 DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discontse-single-step-diffusion-speech/ 语音增强 | 8.5/10 Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/ 语音合成 | 7.5/10 DSRMS-TransUnet: A Decentralized Non-Shifted Transunet for Shallow Water Acoustic Source Range Estimation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dsrms-transunet-a-decentralized-non-shifted/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dsrms-transunet-a-decentralized-non-shifted/ 声源定位 | 8.0/10 Etude: Piano Cover Generation with a Three-Stage Approach — Extract, Structuralize, and Decode https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-etude-piano-cover-generation-with-a-three-stage/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-etude-piano-cover-generation-with-a-three-stage/ 音乐生成 | 7.0/10 Frame-Stacked Local Transformers for Efficient Multi-Codebook Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frame-stacked-local-transformers-for-efficient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frame-stacked-local-transformers-for-efficient/ 语音合成 | 7.5/10 Gelina: Unified Speech and Gesture Synthesis Via Interleaved Token Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gelina-unified-speech-and-gesture-synthesis-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gelina-unified-speech-and-gesture-synthesis-via/ 语音合成 | 7.0/10 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hd-ppt-hierarchical-decoding-of-content-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hd-ppt-hierarchical-decoding-of-content-and/ 语音合成 | 8.0/10 High-Fidelity Speech Enhancement Via Discrete Audio Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-high-fidelity-speech-enhancement-via-discrete/ 语音增强 | 7.5/10 Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ 语音识别语音翻译 | 7.0/10 Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ 语音识别 | 8.0/10 MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanvc-lightweight-and-streaming-zero-shot-voice/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanvc-lightweight-and-streaming-zero-shot-voice/ 语音转换 | 7.5/10 MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ 语音合成 | 7.0/10 Melos: Sentence-To-Section Training with Multi-Task Learning for LLM-Driven Song Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-melos-sentence-to-section-training-with-multi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-melos-sentence-to-section-training-with-multi/ 音乐生成 | 6.5/10 Modeling Strategies For Speech Enhancement in The Latent Space of a Neural Audio Codec https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-modeling-strategies-for-speech-enhancement-in-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-modeling-strategies-for-speech-enhancement-in-the/ 语音增强 | 8.0/10 Pianoroll-Event: A Novel Score Representation for Symbolic Music https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pianoroll-event-a-novel-score-representation-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pianoroll-event-a-novel-score-representation-for/ 音乐生成 | 6.5/10 Principled Coarse-Grained Acceptance For Speculative Decoding In Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-principled-coarse-grained-acceptance-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-principled-coarse-grained-acceptance-for/ 语音合成 | 7.5/10 Retrieval-Based Speculative Decoding For Autoregressive Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-retrieval-based-speculative-decoding-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-retrieval-based-speculative-decoding-for/ 语音合成 | 7.0/10 S2Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s2voice-style-aware-autoregressive-modeling-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-s2voice-style-aware-autoregressive-modeling-with/ 歌唱语音转换 | 7.0/10 SLM-SS: Speech Language Model for Generative Speech Separation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slm-ss-speech-language-model-for-generative/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slm-ss-speech-language-model-for-generative/ 语音分离 | 7.5/10 Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization Via Neural Audio Codec and Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stream-voice-anon-enhancing-utility-of-real-time/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stream-voice-anon-enhancing-utility-of-real-time/ 语音匿名化 | 7.0/10 SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-symphonygen-3d-hierarchical-orchestral-generation/ 音乐生成 | 7.5/10 Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ 语音合成 | 7.5/10 T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ 语音合成 | 7.0/10 Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2midi-inferalign-improving-symbolic-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-text2midi-inferalign-improving-symbolic-music/ 音乐生成 | 7.5/10 Time-Shifted Token Scheduling for Symbolic Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-time-shifted-token-scheduling-for-symbolic-music/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-time-shifted-token-scheduling-for-symbolic-music/ 音乐生成 | 8.5/10 Tokenchain: A Discrete Speech Chain via Semantic Token Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ 语音识别 | 7.0/10 Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-via-score-to-performance-efficient-human/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-via-score-to-performance-efficient-human/ 音乐生成 | 7.5/10 VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/ 语音合成 | 8.5/10 When Noise Lowers the Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-noise-lowers-the-loss-rethinking-likelihood/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-noise-lowers-the-loss-rethinking-likelihood/ 音乐生成 | 7.0/10 An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-an-event-based-sequence-modeling-approach-to/ 音乐理解 | 7.5/10 Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-opening-the-design-space-two-years-of-performance/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-opening-the-design-space-two-years-of-performance/ 音乐生成 | 6.5/10 Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ 语音合成 | 7.5/10 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-video-robin-autoregressive-diffusion-planning-for/ 音乐生成 | 7.0/10 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ 1. **问题**：现有基于离散token的TTS模型，其“粗到细”的生成范式主要体现在从语义token到声学token的转换，而对语音固有的时间动态（temporal dynamics）缺乏显式建模。 2. **方法核心**：提出Chain-of-Details (CoD)框架，将语音生成分解为多 Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/ 1. **要解决什么问题**：现有基于生成模型（如扩散模型、自回归模型）的目标说话人提取（TSE）方法依赖全局上下文，难以直接用于实时流式场景，强行适配会导致性能严重下降。 2. **方法核心是什么**：提出首个面向流式TSE的自回归（AR）框架，核心是“分块交错拼接范式”。该范式将混合语音分块 BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-beat-tokenizing-and-generating-symbolic-music-by/ 本文针对符号音乐生成中主流的事件序列（event-based）tokenization方法隐含处理时间规律、导致模型需额外学习时间网格的问题，提出了一种名为**BEAT**的新型网格化tokenization框架。其核心思想是将音乐在时间上均匀离散化为“拍”（beat）作为基本单位，将每拍内每个音高 Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/ 这篇论文旨在解决生成式目标说话人提取（TSE）模型在流式实时应用中因依赖全局上下文而导致性能严重下降的核心问题。作者首次提出了一个基于自回归语言模型（LauraGPT）的流式TSE框架。其核心创新是“分块交织拼接范式”，通过将混合音频块与对应的目标语音离散编码块交错排列作为模型输入，严格保证了推理的 MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mimiclm-zero-shot-voice-imitation-through/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mimiclm-zero-shot-voice-imitation-through/ 这篇论文旨在解决零样本语音模仿任务中高质量平行训练数据稀缺的核心瓶颈。传统方法要么依赖复杂的解耦架构，要么使用合成语音作为训练目标，导致输出质量受限于合成系统的能力。作者提出了一种名为 **MimicLM** 的新框架，其核心创新在于**“角色交换”的数据构建策略**：使用TTS生成的语音作为**训 Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-video-robin-autoregressive-diffusion-planning-for/ 本文针对现有视频到音乐（V2M）生成模型缺乏对创作者风格、主题等细粒度意图控制的问题，提出了Video-Robin，一个结合文本提示的视频配乐框架。其核心方法是将生成过程解耦为两个阶段：首先，一个多模态自回归规划头（AR-Head）整合视频帧和文本提示，通过语义语言模型、有限标量量化（FSQ）和残差 Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-towards-fine-grained-temporal-perception-post/ 这篇论文旨在解决大型音频语言模型（LALM）在细粒度时间感知（如精确定位声音事件的起止时间）上的不足。作者提出了**TimePro-RL**框架，其核心是两步走策略：首先，提出**音频侧时间提示（AS