语音合成 on 语音/音频论文速递

语音合成 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E8%AF%AD%E9%9F%B3%E5%90%88%E6%88%90/ Recent content in 语音合成 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-archi-tts-a-flow-matching-based-text-to-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-archi-tts-a-flow-matching-based-text-to-speech/ 语音合成 | 8.0/10 Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asynchrony-aware-decoupled-multimodal-control-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-asynchrony-aware-decoupled-multimodal-control-for/ 语音合成 | 7.5/10 AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audiogen-omni-a-unified-multimodal-diffusion/ 音频生成 | 7.5/10 Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-global-emotion-fine-grained-emotional/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-global-emotion-fine-grained-emotional/ 语音合成 | 7.5/10 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridgecode-a-dual-speech-representation-paradigm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridgecode-a-dual-speech-representation-paradigm/ 语音合成 | 8.0/10 Combining Multi-Order Attention and Multi-Resolution Discriminator for High-Fidelity Neural Vocoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-multi-order-attention-and-multi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-combining-multi-order-attention-and-multi/ 语音合成 | 6.5/10 Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-based-filtering-for-speech-dataset/ 语音增强 | 6.5/10 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-continuous-token-diffusion-for-speaker-referenced/ 语音合成 | 8.0/10 CosyAccent: Duration-Controllable Accent Normalization using Source-Synthesis Training Data https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cosyaccent-duration-controllable-accent/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cosyaccent-duration-controllable-accent/ 语音转换 | 7.8/10 Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-f5-tts-towards-language-agnostic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-lingual-f5-tts-towards-language-agnostic/ 语音克隆 | 7.5/10 DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-daien-tts-disentangled-audio-infilling-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-daien-tts-disentangled-audio-infilling-for/ 语音合成 | 8.0/10 Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-deep-dubbing-end-to-end-auto-audiobook-system/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-deep-dubbing-end-to-end-auto-audiobook-system/ 语音合成 | 7.5/10 Direct Preference Optimization For Speech Autoregressive Diffusion Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-preference-optimization-for-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-preference-optimization-for-speech/ 语音合成 | 7.5/10 Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-diffusion-for-generative-modeling-of/ 语音合成 | 7.5/10 DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dmp-tts-disentangled-multi-modal-prompting-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dmp-tts-disentangled-multi-modal-prompting-for/ 语音合成 | 7.5/10 Do You Hear What I Mean? Quantifying the Instruction-Perception GAP in Instruction-Guided Expressive Text-to-Speech Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-you-hear-what-i-mean-quantifying-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-you-hear-what-i-mean-quantifying-the/ 语音合成 | 8.0/10 ECSA: Dual-Branch Emotion Compensation for Emotion-Consistent Speaker Anonymization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ecsa-dual-branch-emotion-compensation-for-emotion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ecsa-dual-branch-emotion-compensation-for-emotion/ 语音匿名化 | 8.5/10 EMG-to-Speech with Fewer Channels https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emg-to-speech-with-fewer-channels/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emg-to-speech-with-fewer-channels/ 语音合成 | 7.5/10 Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ 语音识别 | 7.5/10 EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emorl-tts-reinforcement-learning-for-fine-grained/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emorl-tts-reinforcement-learning-for-fine-grained/ 语音合成 | 8.5/10 EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emoshift-lightweight-activation-steering-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emoshift-lightweight-activation-steering-for/ 语音合成 | 7.0/10 Emotion-Aligned Generation in Diffusion Text to Speech Models Via Preference-Guided Optimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotion-aligned-generation-in-diffusion-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotion-aligned-generation-in-diffusion-text-to/ 语音合成 | 8.0/10 Emotional Damage: Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-damage-investigating-safety/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-damage-investigating-safety/ 音频安全 | 7.5/10 Emotional Dimension Control in Language Model-Based Text-To-Speech: Spanning a Broad Spectrum of Human Emotions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-dimension-control-in-language-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emotional-dimension-control-in-language-model/ 语音合成 | 7.5/10 Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-entropy-guided-grvq-for-ultra-low-bitrate-neural/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-entropy-guided-grvq-for-ultra-low-bitrate-neural/ 语音合成 | 7.5/10 Erasing Your Voice Before it’s Heard: Training-Free Speaker Unlearning for Zero-Shot Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-erasing-your-voice-before-its-heard-training-free/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-erasing-your-voice-before-its-heard-training-free/ 语音合成 | 7.5/10 FED-PISA: Federated Voice Cloning Via Personalized Identity-Style Adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fed-pisa-federated-voice-cloning-via-personalized/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-fed-pisa-federated-voice-cloning-via-personalized/ 语音克隆 | 8.0/10 Frame-Stacked Local Transformers for Efficient Multi-Codebook Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frame-stacked-local-transformers-for-efficient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frame-stacked-local-transformers-for-efficient/ 语音合成 | 7.5/10 From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-hallucination-to-articulation-language-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-hallucination-to-articulation-language-model/ 语音合成 | 7.5/10 Gelina: Unified Speech and Gesture Synthesis Via Interleaved Token Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gelina-unified-speech-and-gesture-synthesis-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gelina-unified-speech-and-gesture-synthesis-via/ 语音合成 | 7.0/10 GLA-GRAD++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gla-grad-an-improved-griffin-lim-guided-diffusion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gla-grad-an-improved-griffin-lim-guided-diffusion/ 语音合成 | 7.5/10 Group Relative Policy Optimization for Text-to-Speech with Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-group-relative-policy-optimization-for-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-group-relative-policy-optimization-for-text-to/ 语音合成 | 8.0/10 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hd-ppt-hierarchical-decoding-of-content-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hd-ppt-hierarchical-decoding-of-content-and/ 语音合成 | 8.0/10 Hierarchical Discrete Flow Matching For Multi-Codebook Codec-Based Text-To-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-discrete-flow-matching-for-multi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hierarchical-discrete-flow-matching-for-multi/ 语音合成 | 7.5/10 How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-to-label-resynthesized-audio-the-dual-role-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-to-label-resynthesized-audio-the-dual-role-of/ 音频深度伪造检测 | 7.5/10 IBPCodec : A Low-Bitrate Lightweight Speech Codec With Inter-Band Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ibpcodec-a-low-bitrate-lightweight-speech-codec/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ibpcodec-a-low-bitrate-lightweight-speech-codec/ 语音编码 | 7.0/10 ICASSP 2026 - 语音合成论文列表 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-061/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-061/ 共 63 篇 ICASSP 2026 语音合成方向论文 InstructAudio: Unified Speech and Music Generation with Natural Language Instruction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-instructaudio-unified-speech-and-music-generation/ 语音合成 | 7.5/10 Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-int-meanflow-few-step-speech-generation-with/ 语音合成 | 7.5/10 Learning Vocal-Tract Area And Radiation With A Physics-Informed Webster Model https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-vocal-tract-area-and-radiation-with-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-vocal-tract-area-and-radiation-with-a/ 歌唱语音合成 | 7.0/10 Leveraging Text-to-Speech and Voice Conversion as Data Augmentation for Alzheimer's Disease Detection from Spontaneous Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-text-to-speech-and-voice-conversion-as/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-text-to-speech-and-voice-conversion-as/ 语音生物标志物 | 7.0/10 LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lp-cfm-perceptual-invariance-aware-conditional/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lp-cfm-perceptual-invariance-aware-conditional/ 语音合成 | 7.0/10 Marco-Voice: A Unified Framework for Expressive Speech Synthesis with Voice Cloning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-marco-voice-a-unified-framework-for-expressive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-marco-voice-a-unified-framework-for-expressive/ 语音合成 | 8.0/10 Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-measuring-prosody-diversity-in-zero-shot-tts-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-measuring-prosody-diversity-in-zero-shot-tts-a/ 语音合成 | 8.0/10 MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ 语音合成 | 7.0/10 Mind Your [m]S, Cross Your [t]S: a Large-Scale Phonetic Analysis of Speech Reproduction in Modern Speech Generators https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mind-your-ms-cross-your-ts-a-large-scale-phonetic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mind-your-ms-cross-your-ts-a-large-scale-phonetic/ 语音伪造检测 | 7.0/10 MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mirrortalk-forging-personalized-avatars-via/ 语音合成 | 7.0/10 Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-intra-speaker-variability-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-intra-speaker-variability-in/ 说话人日志 | 7.0/10 NCF-TTS: Enhancing Flow Matching Based Text-To-Speech with Neighborhood Consistency Flow https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ncf-tts-enhancing-flow-matching-based-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ncf-tts-enhancing-flow-matching-based-text-to/ 语音合成 | 8.0/10 Neuromamba: Adaptive Frequency Filtering with a Pyramid Mamba for sEEG-driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-neuromamba-adaptive-frequency-filtering-with-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-neuromamba-adaptive-frequency-filtering-with-a/ 语音合成 | 8.0/10 No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-no-verifiable-reward-for-prosody-toward/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-no-verifiable-reward-for-prosody-toward/ 语音合成 | 8.0/10 Optimizing Speech Language Models for Acoustic Consistency https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-optimizing-speech-language-models-for-acoustic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-optimizing-speech-language-models-for-acoustic/ 语音合成 | 8.0/10 OV-INSTRUCTTTS: Towards Open-Vocabulary Instruct Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ov-instructtts-towards-open-vocabulary-instruct/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ov-instructtts-towards-open-vocabulary-instruct/ 语音合成 | 8.0/10 PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pfluxtts-hybrid-flow-matching-tts-with-robust/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pfluxtts-hybrid-flow-matching-tts-with-robust/ 语音合成 | 7.0/10 Phonological Tokenizer: Prosody-Aware Phonetic Token Via Multi-Objective Fine-Tuning with Differentiable K-Means https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phonological-tokenizer-prosody-aware-phonetic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phonological-tokenizer-prosody-aware-phonetic/ 语音表示学习 | 8.0/10 Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-praxy-voice-voice-prompt-recovery-bups-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-praxy-voice-voice-prompt-recovery-bups-for/ 语音合成 | 8.0/10 Principled Coarse-Grained Acceptance For Speculative Decoding In Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-principled-coarse-grained-acceptance-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-principled-coarse-grained-acceptance-for/ 语音合成 | 7.5/10 Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prosody-guided-harmonic-attention-for-phase/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prosody-guided-harmonic-attention-for-phase/ 语音合成 | 8.0/10 PRSA: Preventing Malicious Speaker Recognition and Speech Synthesis Simultaneously with Adversarial Examples https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prsa-preventing-malicious-speaker-recognition-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-prsa-preventing-malicious-speaker-recognition-and/ 语音匿名化 | 7.0/10 PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-psp-an-interpretable-per-dimension-accent/ 基准测试 | 7.5/10 PSTalker: Realistic 3D Talking Head Synthesis via a Semantic-Aware Audio-Driven Point-Based Shape https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pstalker-realistic-3d-talking-head-synthesis-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pstalker-realistic-3d-talking-head-synthesis-via/ 说话人合成 | 7.5/10 QFOCUS: Controllable Synthesis for Automated Speech Stress Editing to Deliver Human-Like Emphatic Intent https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-qfocus-controllable-synthesis-for-automated/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-qfocus-controllable-synthesis-for-automated/ 语音合成 | 7.5/10 Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-quantifying-speaker-embedding-phonological-rule/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-quantifying-speaker-embedding-phonological-rule/ 语音合成 | 7.0/10 Real-Time Streaming MEL Vocoding with Generative Flow Matching https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-real-time-streaming-mel-vocoding-with-generative/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-real-time-streaming-mel-vocoding-with-generative/ 语音合成 | 7.5/10 Residual Tokens Enhance Masked Autoencoders for Speech Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-residual-tokens-enhance-masked-autoencoders-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-residual-tokens-enhance-masked-autoencoders-for/ 语音合成 | 7.0/10 Retrieval-Based Speculative Decoding For Autoregressive Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-retrieval-based-speculative-decoding-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-retrieval-based-speculative-decoding-for/ 语音合成 | 7.0/10 RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-roco-robust-code-for-fast-and-effective-proactive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-roco-robust-code-for-fast-and-effective-proactive/ 音频安全 | 7.5/10 RRPO: Robust Reward Policy Optimization for LLM-Based Emotional TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rrpo-robust-reward-policy-optimization-for-llm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rrpo-robust-reward-policy-optimization-for-llm/ 语音合成 | 7.5/10 SFM-TTS: Lightweight and Rapid Speech Synthesis with Flexible Shortcut Flow Matching https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sfm-tts-lightweight-and-rapid-speech-synthesis/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sfm-tts-lightweight-and-rapid-speech-synthesis/ 语音合成 | 7.0/10 Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-Scale Dataset Cleansing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sidon-fast-and-robust-open-source-multilingual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sidon-fast-and-robust-open-source-multilingual/ 语音增强 | 8.5/10 SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sp-mcqa-evaluating-intelligibility-of-tts-beyond/ 语音合成 | 7.0/10 SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spade-structured-pruning-and-adaptive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spade-structured-pruning-and-adaptive/ 语音合成 | 7.5/10 SPAM: Style Prompt Adherence Metric for Prompt-Based TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spam-style-prompt-adherence-metric-for-prompt/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spam-style-prompt-adherence-metric-for-prompt/ 语音合成 | 7.0/10 Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speech-quality-based-localization-of-low-quality/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-speech-quality-based-localization-of-low-quality/ 语音质量评估 | 7.0/10 STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ 语音识别 | 8.0/10 Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-syncspeech-efficient-and-low-latency-text-to/ 语音合成 | 7.5/10 SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synparaspeech-automated-synthesis-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synparaspeech-automated-synthesis-of/ 语音合成 | 7.5/10 Synthetic yet Striking? Assessing Vocal Charisma in TTS via Perceptual and Algorithmic Measures https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthetic-yet-striking-assessing-vocal-charisma/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthetic-yet-striking-assessing-vocal-charisma/ 语音合成 | 7.5/10 T-Cache: Fast Inference For Masked Generative Transformer-Based TTS Via Prompt-Aware Feature Caching https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-cache-fast-inference-for-masked-generative/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-cache-fast-inference-for-masked-generative/ 语音合成 | 9.0/10 T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ 语音合成 | 7.0/10 TAGARELA - A Portuguese Speech Dataset from Podcasts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ 语音识别语音合成 | 7.0/10 Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-task-vector-in-tts-toward-emotionally-expressive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-task-vector-in-tts-toward-emotionally-expressive/ 语音合成 | 7.0/10 TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for Ü-Tsang, Amdo and Kham Speech Dataset Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tmd-tts-a-unified-tibetan-multi-dialect-text-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tmd-tts-a-unified-tibetan-multi-dialect-text-to/ 语音合成 | 7.5/10 Training Flow Matching Models with Reliable Labels via Self-Purification https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-flow-matching-models-with-reliable/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-training-flow-matching-models-with-reliable/ 语音合成 | 7.5/10 Understanding the Strengths and Weaknesses of SSL Models for Audio Deepfake Model Attribution https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-understanding-the-strengths-and-weaknesses-of-ssl/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-understanding-the-strengths-and-weaknesses-of-ssl/ 音频深度伪造检测 | 7.0/10 VividTalker: A Modular Framework for Expressive 3D Talking Avatars with Controllable Gaze and Blink https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vividtalker-a-modular-framework-for-expressive-3d/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vividtalker-a-modular-framework-for-expressive-3d/ 语音合成 | 7.5/10 VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxmorph-scalable-zero-shot-voice-identity/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxmorph-scalable-zero-shot-voice-identity/ 语音克隆 | 9.0/10 VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voxtream-full-stream-text-to-speech-with/ 语音合成 | 8.5/10 Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wave-trainer-fit-neural-vocoder-with-trainable/ 语音合成 | 7.0/10 Wavenext 2: Convnext-Based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for Gan And Diffusion Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavenext-2-convnext-based-fast-neural-vocoders/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wavenext-2-convnext-based-fast-neural-vocoders/ 语音合成 | 9.0/10 When Voice Matters: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-voice-matters-a-controlled-study-of-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-voice-matters-a-controlled-study-of-audio/ 模型评估 | 7.0/10 ZSV2C-MLLM: Zero-Shot Visual Voice Cloning Via Multimodal Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-zsv2c-mllm-zero-shot-visual-voice-cloning-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-zsv2c-mllm-zero-shot-visual-voice-cloning-via/ 语音克隆 | 6.5/10 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.0/10 Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-talker-t2av-joint-talking-audio-video-generation/ 语音合成 | 7.5/10 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-tts-prism-a-perceptual-reasoning-and/ 语音质量评估 | 7.5/10 UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-unisonate-a-unified-model-for-speech-music-and/ 音频生成 | 8.5/10 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-magic-tts-fine-grained-controllable-speech/ Sat, 25 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.5/10 语音/音频论文速递 2026-04-25 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25/ Sat, 25 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-25/ 共分析 2 篇语音/AI 论文 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-atrie-adaptive-tuning-for-robust-inference-and/ 语音合成 | 7.0/10 MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-magic-tts-fine-grained-controllable-speech/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-magic-tts-fine-grained-controllable-speech/ 语音合成 | 7.5/10 Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-preferences-of-a-voice-first-nation-large-scale/ 语音合成 | 7.5/10 Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-qwen35-omni-technical-report/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-qwen35-omni-technical-report/ 这篇论文介绍了Qwen3.5-Omni，一个支持文本、图像、音频和音频-视频输入的全模态大语言模型。为解决现有模型在实时交互、跨模态推理和工具使用上的不足，其核心方法是采用“Thinker-Talker”架构，并引入混合专家（MoE）设计以提升效率。与前代相比，主要创新在于：1）模型规模扩展至数千亿 SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-speechparaling-bench-a-comprehensive-benchmark/ 1. **问题**：现有大型音频语言模型在副语言（如情绪、语气、音色）生成与理解能力上的评估存在特征覆盖不全、评估方法主观且不可扩展的问题。 2. **方法**：提出了SpeechParaling-Bench，一个包含1000余个中英平行语音查询、覆盖超过100个细粒度副语言特征的综合基准。基准 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-text-to-speech-with-chain-of-details-modeling/ 1. **问题**：现有基于离散token的TTS模型，其“粗到细”的生成范式主要体现在从语义token到声学token的转换，而对语音固有的时间动态（temporal dynamics）缺乏显式建模。 2. **方法核心**：提出Chain-of-Details (CoD)框架，将语音生成分解为多 ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-atrie-adaptive-tuning-for-robust-inference-and/ 本文针对现有语音合成系统在生成角色驱动、情感丰富的语音时难以同时保持角色身份一致性和情感表达准确性的问题，提出了ATRIE框架。其核心是**Persona-Prosody Dual-Track (P2-DT) 架构**，将语音生成解耦为静态的**音色轨道**（通过标量量化保持身份锚点）和动态的**韵 NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-nvbench-a-benchmark-for-speech-synthesis-with-non/ 这篇论文旨在解决语音合成（TTS）领域中一个关键但被忽视的问题：如何标准化评估系统生成非语言声音（NVV，如笑声、叹息）的能力。作者提出了**NVBench**，一个包含**45类NVV统一分类体系**的双语（英/中）基准。其核心方法包括：1）构建了一个每类50例、总计4500例的高质量平衡评估数据 Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ 这篇技术报告全面介绍了Qwen3.5-Omni，一个能够统一理解与生成文本、图像、音频和音视频内容的全模态大语言模型。**要解决的问题**是现有模型在实时交互、跨模态推理和自主智能体行为方面的局限性。**采用的方法**是基于“思考者-说话者”架构，引入了多项关键创新：1）思考者和说话者均采用混合注意 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ 本文针对文本转语音（TTS）任务，提出了一种名为“细节链”（Chain-of-Details, CoD）的新框架。**要解决的问题**是现有TTS方法在建模语音生成的时域动态（从粗略时序到精细声学细节的渐进过程）方面存在不足。**使用的方法**是将语音生成分解为多个时间分辨率递增的阶段，在每个阶段使 MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-mint-bench-a-comprehensive-multilingual-benchmark/ 这篇论文旨在解决指令跟随文本转语音（TTS）领域缺乏系统化评估工具的问题。当前评估存在覆盖不全、诊断粒度粗、多语言支持弱等缺陷。为此，作者提出了**MINT-Bench**，一个全面的多语言基准测试。其核心方法包括：1）一个基于10种原子声学属性的**分层多轴分类法**，系统性地组织了从简单到复杂（ AST: Adaptive, Seamless, and Training-Free Precise Speech Editing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ast-adaptive-seamless-and-training-free-precise/ 本文针对现有语音编辑方法依赖任务特定训练、未编辑区域时间一致性差的问题，提出了AST（Adaptive, Seamless, and Training-free），一种基于预训练AM-FM（自回归-流匹配）范式TTS模型的精确语音编辑框架。AST首先通过逆Euler ODE求解器将原始语音反演至潜空 Hierarchical Codec Diffusion for Video-to-Speech Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/ 本论文针对 Video-to-Speech（VTS）生成中视觉-语音模态信息不对称的问题，提出现有方法忽略了语音从粗粒度语义到细粒度韵律的层次结构，导致视觉条件无法与语音表示精准对齐。为此，作者提出 HiCoDiT（Hierarchical Codec Diffusion Transformer）， PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ps-tts-phonetic-synchronization-in-text-to-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-ps-tts-phonetic-synchronization-in-text-to-speech/ 这篇论文旨在解决自动配音（AD）中目标语音与源语音在时长和唇形上的同步难题。其核心贡献是提出了一套两阶段的文本改写方法，并集成到TTS系统中：首先通过语言模型进行**等时性**改写，确保目标语音时长匹配源语音；其次引入**音素同步（PS）**，使用动态时间规整（DTW）和从训练数据中学习的元音距离， An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ 这篇论文旨在解决实时交互式语音合成中**推理延迟高**与**声学质量（尤其是高频细节）易受损**的核心矛盾。传统流水线依赖计算密集的神经声码器进行波形重建，且基于连续回归的声学模型易导致频谱过平滑。为