<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>流式处理 on 语音/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/%E6%B5%81%E5%BC%8F%E5%A4%84%E7%90%86/</link>
    <description>Recent content in 流式处理 on 语音/音频论文速递</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/%E6%B5%81%E5%BC%8F%E5%A4%84%E7%90%86/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A Generative-First Neural Audio Autoencoder</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-generative-first-neural-audio-autoencoder/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-generative-first-neural-audio-autoencoder/</guid>
      <description>音乐生成 | 8.5/10</description>
    </item>
    <item>
      <title>An Efficient Neural Network for Modeling Human Auditory Neurograms for Speech</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-efficient-neural-network-for-modeling-human/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-efficient-neural-network-for-modeling-human/</guid>
      <description>语音增强 | 7.0/10</description>
    </item>
    <item>
      <title>Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/</guid>
      <description>语音识别 | 7.5/10</description>
    </item>
    <item>
      <title>Chunkwise Aligners for Streaming Speech Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/</guid>
      <description>语音识别 | 7.5/10</description>
    </item>
    <item>
      <title>CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ctc-did-ctc-based-arabic-dialect-identification/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ctc-did-ctc-based-arabic-dialect-identification/</guid>
      <description>语音识别 | 6.5/10</description>
    </item>
    <item>
      <title>Direct Simultaneous Translation Activation for Large Audio-Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-simultaneous-translation-activation-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-simultaneous-translation-activation-for/</guid>
      <description>语音翻译 | 6.0/10</description>
    </item>
    <item>
      <title>Do we really need self-attention for streaming automatic speech recognition?</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-really-need-self-attention-for-streaming/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-really-need-self-attention-for-streaming/</guid>
      <description>语音识别 | 7.5/10</description>
    </item>
    <item>
      <title>EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eend-saa-enrollment-less-main-speaker-voice/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eend-saa-enrollment-less-main-speaker-voice/</guid>
      <description>语音活动检测 | 7.5/10</description>
    </item>
    <item>
      <title>Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-entropy-guided-grvq-for-ultra-low-bitrate-neural/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-entropy-guided-grvq-for-ultra-low-bitrate-neural/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Equipping Large Language Model with Directional Speech Understanding Capabilities</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-equipping-large-language-model-with-directional/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-equipping-large-language-model-with-directional/</guid>
      <description>语音识别 语音翻译 | 7.0/10</description>
    </item>
    <item>
      <title>FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fastenhancer-speed-optimized-streaming-neural/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fastenhancer-speed-optimized-streaming-neural/</guid>
      <description>语音增强 | 8.5/10</description>
    </item>
    <item>
      <title>FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-focalcodec-stream-streaming-low-bitrate-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-focalcodec-stream-streaming-low-bitrate-speech/</guid>
      <description>语音编码 | 8.0/10</description>
    </item>
    <item>
      <title>IBPCodec : A Low-Bitrate Lightweight Speech Codec With Inter-Band Prediction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ibpcodec-a-low-bitrate-lightweight-speech-codec/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ibpcodec-a-low-bitrate-lightweight-speech-codec/</guid>
      <description>语音编码 | 7.0/10</description>
    </item>
    <item>
      <title>Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Integrating Speaker Embeddings and LLM-Derived Semantic Representations for Streaming Speaker Diarization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-integrating-speaker-embeddings-and-llm-derived/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-integrating-speaker-embeddings-and-llm-derived/</guid>
      <description>说话人分离 | 6.5/10</description>
    </item>
    <item>
      <title>Lightweight Phoneme-Conditioned Bandwidth Extension for Body-Conducted Speech</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-phoneme-conditioned-bandwidth/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-phoneme-conditioned-bandwidth/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>Low-Bandwidth High-Fidelity Speech Transmission with Generative Latent Joint Source-Channel Coding</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-bandwidth-high-fidelity-speech-transmission/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-bandwidth-high-fidelity-speech-transmission/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanvc-lightweight-and-streaming-zero-shot-voice/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-meanvc-lightweight-and-streaming-zero-shot-voice/</guid>
      <description>语音转换 | 7.5/10</description>
    </item>
    <item>
      <title>Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/</guid>
      <description>语音识别 | 6.5/10</description>
    </item>
    <item>
      <title>Phrased: Phrase Dictionary Biasing for Speech Translation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phrased-phrase-dictionary-biasing-for-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phrased-phrase-dictionary-biasing-for-speech/</guid>
      <description>语音翻译 | 7.5/10</description>
    </item>
    <item>
      <title>Real-Time Streaming MEL Vocoding with Generative Flow Matching</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-real-time-streaming-mel-vocoding-with-generative/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-real-time-streaming-mel-vocoding-with-generative/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/</guid>
      <description>脑机接口 | 8.0/10</description>
    </item>
    <item>
      <title>SpatialNet-Echo: Real-Time Acoustic Echo Cancellation via Integrated Narrow-Band and Cross-Band Processing</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatialnet-echo-real-time-acoustic-echo/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatialnet-echo-real-time-acoustic-echo/</guid>
      <description>语音增强 | 7.5/10</description>
    </item>
    <item>
      <title>Spike-Driven Low-Power Speech Bandwidth Extension</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spike-driven-low-power-speech-bandwidth-extension/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spike-driven-low-power-speech-bandwidth-extension/</guid>
      <description>语音增强 | 8.0/10</description>
    </item>
    <item>
      <title>Str-DiffSep: Streamable Diffusion Model for Speech Separation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-str-diffsep-streamable-diffusion-model-for-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-str-diffsep-streamable-diffusion-model-for-speech/</guid>
      <description>语音分离 | 7.5/10</description>
    </item>
    <item>
      <title>Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/</guid>
      <description>语音识别 | 7.0/10</description>
    </item>
    <item>
      <title>SynaSpot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synaspot-a-lightweight-streaming-multi-modal/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synaspot-a-lightweight-streaming-multi-modal/</guid>
      <description>关键词检测 | 7.5/10</description>
    </item>
    <item>
      <title>Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/</guid>
      <description>语音合成 | 7.5/10</description>
    </item>
    <item>
      <title>Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/</guid>
      <description>说话人分离 | 9.0/10</description>
    </item>
    <item>
      <title>VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vchangecodec-an-ultra-low-complexity-neural/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vchangecodec-an-ultra-low-complexity-neural/</guid>
      <description>语音转换 语音增强 | 8.0/10</description>
    </item>
    <item>
      <title>VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/</guid>
      <description>语音合成 | 8.5/10</description>
    </item>
    <item>
      <title>WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/</guid>
      <description>语音识别 | 6.5/10</description>
    </item>
    <item>
      <title>Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-hallo-live-real-time-streaming-joint-audio-video/</guid>
      <description>音视频 | 8.5/10</description>
    </item>
    <item>
      <title>MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/</guid>
      <description>语音合成 | 7.0/10</description>
    </item>
    <item>
      <title>Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-hierarchical-policy-optimization-for-simultaneous/</link>
      <pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-hierarchical-policy-optimization-for-simultaneous/</guid>
      <description>语音翻译 | 7.5/10</description>
    </item>
    <item>
      <title>FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-fastturn-unifying-acoustic-and-streaming-semantic/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-fastturn-unifying-acoustic-and-streaming-semantic/</guid>
      <description>这篇论文针对全双工语音对话系统中需要低延迟、高精度判断用户是否结束发言（轮次检测）的难题，提出了FastTurn统一框架。其核心方法是将流式CTC解码提供的快速部分语义信息，与Conformer编码器提取的声学特征，通过适配器输入给大语言模型（LLM）进行推理，并最终融合声学与语义特征进行轮次预测。</description>
    </item>
    <item>
      <title>Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/</guid>
      <description>1. **问题**：训练一个既能高精度离线转录又能低延迟流式识别的统一ASR模型极具挑战性，传统方法在低延迟下性能会急剧下降。 2. **方法核心**：提出一个统一的Transducer框架，结合分块注意力（含右上下文）和动态块卷积（DCConv）来适配两种模式。核心创新是引入了模式一致性正则化损失</description>
    </item>
    <item>
      <title>Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-towards-streaming-target-speaker-extraction-via/</guid>
      <description>1.  **要解决什么问题**：现有基于生成模型（如扩散模型、自回归模型）的目标说话人提取（TSE）方法依赖全局上下文，难以直接用于实时流式场景，强行适配会导致性能严重下降。 2.  **方法核心是什么**：提出首个面向流式TSE的自回归（AR）框架，核心是“分块交错拼接范式”。该范式将混合语音分块</description>
    </item>
    <item>
      <title>X-VC: Zero-shot Streaming Voice Conversion in Codec Space</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-x-vc-zero-shot-streaming-voice-conversion-in/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-x-vc-zero-shot-streaming-voice-conversion-in/</guid>
      <description>1. **问题**：零样本语音转换需要同时实现高质量的说话人特征迁移和低延迟的流式推理，这是一个尚未很好解决的挑战。 2. **方法核心**：提出X-VC系统，在预训练的SAC语音编解码器的潜在空间中进行一步转换。核心是一个双条件声学转换器，它联合处理源语音的编解码器潜在表示和目标参考语音的帧级梅尔</description>
    </item>
    <item>
      <title>Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/</link>
      <pubDate>Wed, 22 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-towards-streaming-target-speaker-extraction-via/</guid>
      <description>这篇论文旨在解决生成式目标说话人提取（TSE）模型在流式实时应用中因依赖全局上下文而导致性能严重下降的核心问题。作者首次提出了一个基于自回归语言模型（LauraGPT）的流式TSE框架。其核心创新是“分块交织拼接范式”，通过将混合音频块与对应的目标语音离散编码块交错排列作为模型输入，严格保证了推理的</description>
    </item>
    <item>
      <title>UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/</link>
      <pubDate>Wed, 22 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-uaf-a-unified-audio-front-end-llm-for-full-duplex/</guid>
      <description>**核心贡献**：本文提出了首个专为全双工语音交互设计的统一音频前端大模型（UAF）。它打破了传统级联式前端处理的范式，将语音活动检测（VAD）、说话人识别（SR）、自动语音识别（ASR）、轮次检测（TD）和问答（QA）等多个任务，统一建模为一个自回归序列预测问题。  **关键方法**：模型采用“音</description>
    </item>
    <item>
      <title>NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-nim4-asr-towards-efficient-robust-and/</link>
      <pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-nim4-asr-towards-efficient-robust-and/</guid>
      <description>本文提出了NIM4-ASR，一个面向生产环境的高效、鲁棒且可定制的实时语音识别框架。该工作旨在解决现有LLM-based ASR在实际部署中的三大挑战：1) 轻量化模型性能严重下降（有限的向下扩展性）；2) 在声学挑战条件下产生幻觉；3) 缺乏生产就绪的热词定制机制。为此，作者提出了一套原则性的多阶</description>
    </item>
    <item>
      <title>MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-moshirag-asynchronous-knowledge-retrieval-for/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-moshirag-asynchronous-knowledge-retrieval-for/</guid>
      <description>本文旨在解决全双工语音语言模型（如Moshi）事实性不足的核心问题，同时不牺牲其高交互性。**问题**：全双工模型能实时打断和回应，但因训练数据规模远小于文本，其知识储备和事实准确性较弱。**方法**：提出了MoshiRAG，一个模块化框架。它在Moshi模型中引入一个特殊的`&amp;lt;ret&amp;gt;`检索触发令</description>
    </item>
    <item>
      <title>Qwen3.5-Omni Technical Report</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-qwen35-omni-technical-report/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-qwen35-omni-technical-report/</guid>
      <description>Qwen3.5-Omni 是一个旨在统一理解、推理、生成与行动的全模态大语言模型。它**解决**了现有模型在实时交互、长上下文音视频处理、流式语音生成稳定性以及多语言支持等方面的局限性。**方法上**，它基于Thinker-Talker架构，引入了Hybrid MoE以提升效率，采用显式时间戳替代稀</description>
    </item>
    <item>
      <title>An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/</guid>
      <description>这篇论文旨在解决实时交互式语音合成中**推理延迟高**与**声学质量（尤其是高频细节）易受损**的核心矛盾。传统流水线依赖计算密集的神经声码器进行波形重建，且基于连续回归的声学模型易导致频谱过平滑。为</description>
    </item>
    <item>
      <title>MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-moshirag-asynchronous-knowledge-retrieval-for/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-moshirag-asynchronous-knowledge-retrieval-for/</guid>
      <description>本文提出了MoshiRAG，这是首个集成检索增强生成功能的全双工语音语言模型。**要解决的问题**是全双工语音模型在保持实时交互性的同时，事实准确性不足的挑战。**核心方法**是基于Moshi模型，设</description>
    </item>
    <item>
      <title>X-VC: Zero-shot Streaming Voice Conversion in Codec Space</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-x-vc-zero-shot-streaming-voice-conversion-in/</link>
      <pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-x-vc-zero-shot-streaming-voice-conversion-in/</guid>
      <description>这篇论文旨在解决零样本语音转换中**高保真说话人迁移**与**低延迟流式推理**难以兼得的核心挑战。作者提出了**X-VC**系统，其核心创新在于**在预训练神经编解码器（SAC）的潜在空间中进行一步</description>
    </item>
  </channel>
</rss>
