端到端 on 语音/音频论文速递

端到端 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E7%AB%AF%E5%88%B0%E7%AB%AF/ Recent content in 端到端 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 A Speech-Driven Paradigm for Physics-Informed Modeling of Coupled Micro-Speakers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-speech-driven-paradigm-for-physics-informed/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-speech-driven-paradigm-for-physics-informed/ 音频生成 | 7.0/10 Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adapting-diarization-conditioned-whisper-for-end/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adapting-diarization-conditioned-whisper-for-end/ 语音识别 | 7.5/10 Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ 语音识别 | 7.5/10 ALMA-Chor: Leveraging Audio-Lyric Alignment with Mamba for Chorus Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-alma-chor-leveraging-audio-lyric-alignment-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-alma-chor-leveraging-audio-lyric-alignment-with/ 音乐信息检索 | 7.0/10 An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-end-to-end-multimodal-system-for-subtitle/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-end-to-end-multimodal-system-for-subtitle/ 多模态模型 | 7.0/10 An Envelope Separation Aided Multi-Task Learning Model for Blind Source Counting and Localization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-envelope-separation-aided-multi-task-learning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-envelope-separation-aided-multi-task-learning/ 声源定位 | 6.5/10 Audio Deepfake Detection at the First Greeting: "Hi!" https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-deepfake-detection-at-the-first-greeting-hi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-deepfake-detection-at-the-first-greeting-hi/ 音频深度伪造检测 | 7.5/10 Audio-to-Score Jazz Solo Transcription with the Rhythm Perceiver https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-to-score-jazz-solo-transcription-with-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-to-score-jazz-solo-transcription-with-the/ 音乐信息检索 | 7.5/10 Auditory-Inspired Transformer for Binaural Speech Enhancement and Spatial Cue Preservation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auditory-inspired-transformer-for-binaural-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auditory-inspired-transformer-for-binaural-speech/ 语音增强 | 7.0/10 CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/ 语音识别 | 7.5/10 Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/ 语音识别 | 7.5/10 Chunkwise Aligners for Streaming Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ 语音识别 | 7.5/10 Content Anonymization for Privacy in Long-Form Audio https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-anonymization-for-privacy-in-long-form/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-anonymization-for-privacy-in-long-form/ 语音匿名化 | 7.5/10 Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-deep-dubbing-end-to-end-auto-audiobook-system/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-deep-dubbing-end-to-end-auto-audiobook-system/ 语音合成 | 7.5/10 Direct Transfer of Prosody in Speech-to-speech Translation using Disentangled Speech Tokens https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-transfer-of-prosody-in-speech-to-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-direct-transfer-of-prosody-in-speech-to-speech/ 语音翻译 | 7.5/10 Discrete-Continuous Fusion With Adaptive Hierarchical Features For Audio Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-continuous-fusion-with-adaptive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-discrete-continuous-fusion-with-adaptive/ 音频深度伪造检测 | 8.0/10 DSRMS-TransUnet: A Decentralized Non-Shifted Transunet for Shallow Water Acoustic Source Range Estimation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dsrms-transunet-a-decentralized-non-shifted/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dsrms-transunet-a-decentralized-non-shifted/ 声源定位 | 8.0/10 Dual-Strategy-Enhanced Conbimamba for Neural Speaker Diarization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dual-strategy-enhanced-conbimamba-for-neural/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dual-strategy-enhanced-conbimamba-for-neural/ 说话人分离 | 8.0/10 E2E-AEC: Implementing An End-To-End Neural Network Learning Approach for Acoustic Echo Cancellation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-e2e-aec-implementing-an-end-to-end-neural-network/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-e2e-aec-implementing-an-end-to-end-neural-network/ 语音增强 | 7.5/10 EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eend-saa-enrollment-less-main-speaker-voice/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-eend-saa-enrollment-less-main-speaker-voice/ 语音活动检测 | 7.5/10 Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-ssl-discrete-tokens-for-multilingual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-ssl-discrete-tokens-for-multilingual/ 语音识别 | 7.5/10 HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection with Multichannel Audio and Multiscale Visual Cues https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-havt-ivd-heterogeneity-aware-cross-modal-network/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-havt-ivd-heterogeneity-aware-cross-modal-network/ 音频事件检测 | 8.0/10 HCGAN: Harmonic-Coupled Generative Adversarial Network for Speech Super-Resolution in Low-Bandwidth Scenarios https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hcgan-harmonic-coupled-generative-adversarial/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hcgan-harmonic-coupled-generative-adversarial/ 语音增强 | 8.0/10 HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hvac-ear-eavesdropping-human-speech-using-hvac/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hvac-ear-eavesdropping-human-speech-using-hvac/ 音频安全 | 8.5/10 HyFlowSE: Hybrid End-To-End Flow-Matching Speech Enhancement via Generative-Discriminative Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hyflowse-hybrid-end-to-end-flow-matching-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-hyflowse-hybrid-end-to-end-flow-matching-speech/ 语音增强 | 8.0/10 Improving Contextual Asr Via Multi-Grained Fusion With Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-contextual-asr-via-multi-grained-fusion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-contextual-asr-via-multi-grained-fusion/ 语音识别 | 8.5/10 Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ 语音识别语音翻译 | 7.0/10 Joint Deep Secondary Path Estimation and Adaptive Control for Active Noise Cancellation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-deep-secondary-path-estimation-and-adaptive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-deep-secondary-path-estimation-and-adaptive/ 语音增强 | 7.5/10 Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-Task Multi-Scale Network https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-estimation-of-piano-dynamics-and-metrical/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-estimation-of-piano-dynamics-and-metrical/ 音乐理解 | 7.5/10 K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-k-function-joint-pronunciation-transcription-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-k-function-joint-pronunciation-transcription-and/ 语音识别 | 7.5/10 Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-language-infused-retrieval-augmented-ctc-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-language-infused-retrieval-augmented-ctc-with/ 语音识别 | 8.0/10 Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ 语音识别 | 8.0/10 Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ 语音识别 | 6.5/10 Lightweight Implicit Neural Network for Binaural Audio Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-implicit-neural-network-for-binaural/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lightweight-implicit-neural-network-for-binaural/ 空间音频 | 7.0/10 Lingometer: On-Device Personal Speech Word Counting System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lingometer-on-device-personal-speech-word/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lingometer-on-device-personal-speech-word/ 语音活动检测 | 8.0/10 Low-Bandwidth High-Fidelity Speech Transmission with Generative Latent Joint Source-Channel Coding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-bandwidth-high-fidelity-speech-transmission/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-low-bandwidth-high-fidelity-speech-transmission/ 语音增强 | 7.5/10 MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mela-tts-joint-transformer-diffusion-model-with/ 语音合成 | 7.0/10 Mixture of Experts for Recognizing Depression from Interview and Reading Tasks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-of-experts-for-recognizing-depression/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-of-experts-for-recognizing-depression/ 语音生物标志物 | 6.0/10 MSANET: Multi-Scale Semantic Aggregation Network for Brain-Assisted Speech Enhancement in Multi-Speaker Conditions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-msanet-multi-scale-semantic-aggregation-network/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-msanet-multi-scale-semantic-aggregation-network/ 语音增强 | 7.5/10 Musicdetr: A Position-Aware Spectral Note Detection Model for Singing Transcription https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musicdetr-a-position-aware-spectral-note/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-musicdetr-a-position-aware-spectral-note/ 歌唱语音转录 | 8.5/10 Peeking Into the Future for Contextual Biasing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-peeking-into-the-future-for-contextual-biasing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-peeking-into-the-future-for-contextual-biasing/ 语音识别 | 7.0/10 Polynomial Mixing for Efficient Self-Supervised Speech Encoders https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-polynomial-mixing-for-efficient-self-supervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-polynomial-mixing-for-efficient-self-supervised/ 语音识别 | 8.0/10 Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-purification-before-fusion-toward-mask-free/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-purification-before-fusion-toward-mask-free/ 语音识别 | 7.5/10 QFOCUS: Controllable Synthesis for Automated Speech Stress Editing to Deliver Human-Like Emphatic Intent https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-qfocus-controllable-synthesis-for-automated/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-qfocus-controllable-synthesis-for-automated/ 语音合成 | 7.5/10 Revisiting Direct Speech-to-Text Translation with Speech LLMS: Better Scaling than Cot Prompting? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-revisiting-direct-speech-to-text-translation-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-revisiting-direct-speech-to-text-translation-with/ 语音翻译 | 7.5/10 RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rlbr-reinforcement-learning-with-biasing-rewards/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rlbr-reinforcement-learning-with-biasing-rewards/ 语音识别 | 8.0/10 SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-saasdnet-an-eeg-based-streaming-auditory/ 脑机接口 | 8.0/10 Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ 语音识别 | 8.5/10 Semantic Anchor Transfer from Short to Long Speech in a Distillation-Based Summarization Framework https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-semantic-anchor-transfer-from-short-to-long/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-semantic-anchor-transfer-from-short-to-long/ 语音摘要 | 7.5/10 Session-Level Spoken Language Assessment with A Multimodal Foundation Model Via Multi-Target Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-session-level-spoken-language-assessment-with-a/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-session-level-spoken-language-assessment-with-a/ 语音评估 | 7.5/10 SpatialNet-Echo: Real-Time Acoustic Echo Cancellation via Integrated Narrow-Band and Cross-Band Processing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatialnet-echo-real-time-acoustic-echo/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatialnet-echo-real-time-acoustic-echo/ 语音增强 | 7.5/10 Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/ 语音识别 | 7.0/10 StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streammark-a-deep-learning-based-semi-fragile/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streammark-a-deep-learning-based-semi-fragile/ 音频深度伪造检测 | 8.0/10 Sunac: Source-Aware Unified Neural Audio Codec https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sunac-source-aware-unified-neural-audio-codec/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sunac-source-aware-unified-neural-audio-codec/ 音频生成 | 7.5/10 T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-t-mimi-a-transformer-based-mimi-decoder-for-real/ 语音合成 | 7.0/10 Task-Oriented Sound Privacy Preservation for Sound Event Detection Via End-to-End Adversarial Multi-Task Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-task-oriented-sound-privacy-preservation-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-task-oriented-sound-privacy-preservation-for/ 音频事件检测 | 7.5/10 TextlessRAG: End-to-End Visual Document RAG by Speech without Text https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/ 语音问答 | 8.5/10 Tokenchain: A Discrete Speech Chain via Semantic Token Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ 语音识别 | 7.0/10 Toward Robust And Efficient Beat Tracking Via Beat-Aware Attention https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-toward-robust-and-efficient-beat-tracking-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-toward-robust-and-efficient-beat-tracking-via/ 音乐理解 | 8.5/10 Tpeformer: Temporal Patch Embedding Transformer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tpeformer-temporal-patch-embedding-transformer/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tpeformer-temporal-patch-embedding-transformer/ 语音情感识别 | 7.5/10 Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/ 说话人分离 | 9.0/10 Tri-Attention Fusion: Joint Temporal-Spectral and Bidirectional Modeling for Speech Spoofing Detection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tri-attention-fusion-joint-temporal-spectral-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tri-attention-fusion-joint-temporal-spectral-and/ 语音伪造检测 | 7.0/10 UJCodec: An End-to-end Unet-Style Codec for Joint Speech Compression and Enhancement https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ujcodec-an-end-to-end-unet-style-codec-for-joint/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ujcodec-an-end-to-end-unet-style-codec-for-joint/ 语音增强 | 7.5/10 UMA-SPLIT: Unimodal Aggregation for Both English and Mandarin Non-Autoregressive Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uma-split-unimodal-aggregation-for-both-english/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uma-split-unimodal-aggregation-for-both-english/ 语音识别 | 7.5/10 USVexplorer: Robust Detection of Ultrasonic Vocalizations with Cross Species Generalization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-usvexplorer-robust-detection-of-ultrasonic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-usvexplorer-robust-detection-of-ultrasonic/ 音频事件检测 | 8.0/10 VBx for End-to-End Neural and Clustering-Based Diarization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vbx-for-end-to-end-neural-and-clustering-based/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vbx-for-end-to-end-neural-and-clustering-based/ 说话人分离 | 8.5/10 VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vchangecodec-an-ultra-low-complexity-neural/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vchangecodec-an-ultra-low-complexity-neural/ 语音转换语音增强 | 8.0/10 WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ 语音识别 | 6.5/10 β-AVSDNET: A Novel End-To-End Neural Network Architecture For Audio-Visual Speaker Diarization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-avsdnet-a-novel-end-to-end-neural-network/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-avsdnet-a-novel-end-to-end-neural-network/ 说话人分离 | 7.5/10 Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-full-duplex-interaction-in-spoken-dialogue/ 语音对话系统 | 6.5/10 Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ 1. **问题**：训练一个既能高精度离线转录又能低延迟流式识别的统一ASR模型极具挑战性，传统方法在低延迟下性能会急剧下降。 2. **方法核心**：提出一个统一的Transducer框架，结合分块注意力（含右上下文）和动态块卷积（DCConv）来适配两种模式。核心创新是引入了模式一致性正则化损失 Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-deep-supervised-contrastive-learning-of-pitch/ 这篇论文旨在解决将连续变化的基频（F0）曲线映射到首尔韩语中离散、不变的音高重音类别（如LHLH, HHLH）这一难题。传统方法易受F0测量噪声和说话人差异的影响。为此，作者提出了**Dual-Glob**，一个深度监督对比学习框架。其核心是通过一个**双分支（干净视图和增强视图）编码器**，在共享 Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-text-to-speech-with-chain-of-details-modeling/ 本文针对文本转语音（TTS）任务，提出了一种名为“细节链”（Chain-of-Details, CoD）的新框架。**要解决的问题**是现有TTS方法在建模语音生成的时域动态（从粗略时序到精细声学细节的渐进过程）方面存在不足。**使用的方法**是将语音生成分解为多个时间分辨率递增的阶段，在每个阶段使 MUSCAT: MUltilingual, SCientific ConversATion Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ 本文提出了 MUSCAT，一个用于评估多语言科学对话场景下自动语音识别（ASR）性能的新基准。数据集包含 6 组双语对话录音（共约 65 分钟，9,066 词），涉及英语与德语、土耳其语、中文、越南语的配对对话；每组对话使用 Meeting Owl 3、ReSpeaker USB 麦克风阵列和 Me VoxMind: An End-to-End Agentic Spoken Dialogue System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-voxmind-an-end-to-end-agentic-spoken-dialogue/ 端到端语音对话模型在自然交互上进步迅速，但普遍缺乏处理复杂任务的agent能力（工具调用、规划、推理）。本文首先形式化定义了"端到端语音智能体"的四大维度——画像（Profile）、记忆（Memory）、规划（Planning）与执行（Action Execution），填补了该领域理论标准的空白。 An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-an-ultra-low-latency-end-to-end-streaming-speech/ 这篇论文旨在解决实时交互式语音合成中**推理延迟高**与**声学质量（尤其是高频细节）易受损**的核心矛盾。传统流水线依赖计算密集的神经声码器进行波形重建，且基于连续回归的声学模型易导致频谱过平滑。为 WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-wavalign-enhancing-intelligence-and/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-wavalign-enhancing-intelligence-and/ 这篇论文旨在解决端到端语音对话模型在智能（IQ）和表达力（EQ）上难以同时提升的核心挑战。作者发现，直接对混合文本-语音序列应用统一的偏好优化（如DPO、GRPO）会导致问题：稀疏的偏好信号被淹没在密