语音识别 on 语音/音频论文速递

语音识别 on 语音/音频论文速递 https://nanless.github.io/audio-paper-digest-blog/tags/%E8%AF%AD%E9%9F%B3%E8%AF%86%E5%88%AB/ Recent content in 语音识别 on 语音/音频论文速递 Hugo zh-cn Wed, 29 Apr 2026 00:00:00 +0000 A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dataset-of-robot-patient-and-doctor-patient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dataset-of-robot-patient-and-doctor-patient/ 语音对话系统 | 7.5/10 A Personalized Real-Time Proactive Voice Memory Assistant https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-personalized-real-time-proactive-voice-memory/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-personalized-real-time-proactive-voice-memory/ 实时处理 | 7.0/10 A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-study-of-data-selection-strategies-for-pre/ 语音识别 | 7.5/10 A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-text-to-text-alignment-algorithm-for-better/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-text-to-text-alignment-algorithm-for-better/ 模型评估 | 7.5/10 AccLID: Accent-aware Language Identification for Robust Multilingual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acclid-accent-aware-language-identification-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acclid-accent-aware-language-identification-for/ 语音识别 | 7.0/10 Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adapting-diarization-conditioned-whisper-for-end/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adapting-diarization-conditioned-whisper-for-end/ 语音识别 | 7.5/10 Advanced modeling of interlanguage speech intelligibility benefit with L1-L2 multi-task learning using differentiable K-means for accent-robust discrete token-based ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advanced-modeling-of-interlanguage-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advanced-modeling-of-interlanguage-speech/ 语音识别 | 7.0/10 Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-llm-based-multi-channel-multi-speaker/ 语音识别 | 7.5/10 Advancing Semi-Supervised Child Speech Recognition with Omni-Temporal Classification under Label Noise https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-semi-supervised-child-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-advancing-semi-supervised-child-speech/ 语音识别 | 7.5/10 Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adversarial-fine-tuning-on-speech-foundation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-adversarial-fine-tuning-on-speech-foundation/ 语音识别 | 7.5/10 AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aishell6-whisper-a-chinese-mandarin-audio-visual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-aishell6-whisper-a-chinese-mandarin-audio-visual/ 语音识别 | 8.3/10 An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-end-to-end-multimodal-system-for-subtitle/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-end-to-end-multimodal-system-for-subtitle/ 多模态模型 | 7.0/10 Ara-BEST-RQ: Multi Dialectal Arabic SSL https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ara-best-rq-multi-dialectal-arabic-ssl/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ara-best-rq-multi-dialectal-arabic-ssl/ 语音识别 | 6.5/10 Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-text System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention2probability-attention-driven/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-attention2probability-attention-driven/ 语音识别 | 7.0/10 Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-conditioned-diffusion-llms-for-asr-and/ 语音识别 | 7.0/10 Bayesian Low-Rank Factorization for Robust Model Adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bayesian-low-rank-factorization-for-robust-model/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bayesian-low-rank-factorization-for-robust-model/ 语音识别 | 8.0/10 BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Improved Multilingual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bbpe16-utf-16-based-byte-level-byte-pair-encoding/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bbpe16-utf-16-based-byte-level-byte-pair-encoding/ 语音识别 | 7.0/10 BEST-RQ-based Self-Supervised Learning for Whisper Domain Adaptation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-best-rq-based-self-supervised-learning-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-best-rq-based-self-supervised-learning-for/ 语音识别 | 7.5/10 BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-birq-bi-level-self-labeling-random-quantization/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-birq-bi-level-self-labeling-random-quantization/ 语音识别 | 8.0/10 Bridging the Front-End and Back-End for Robust ASR via Cross-Attention-Based U-Net https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-front-end-and-back-end-for-robust/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bridging-the-front-end-and-back-end-for-robust/ 语音识别 | 7.0/10 CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/ 语音识别 | 7.5/10 Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-can-large-audio-language-models-understand-audio/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-can-large-audio-language-models-understand-audio/ 基准测试 | 7.0/10 CCST: Cross-Modal and Consistency-Aware Self-Training for Source-Free Unsupervised Domain Adaptation in Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ccst-cross-modal-and-consistency-aware-self/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ccst-cross-modal-and-consistency-aware-self/ 语音识别 | 7.5/10 Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunk-wise-attention-transducers-for-fast-and/ 语音识别 | 7.5/10 Chunkwise Aligners for Streaming Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-chunkwise-aligners-for-streaming-speech/ 语音识别 | 7.5/10 Confidence-Guided Error Correction for Disordered Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-guided-error-correction-for-disordered/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-confidence-guided-error-correction-for-disordered/ 语音识别 | 7.5/10 Content-Preserving Speech Representation Learning Via Adaptive Segment-Level Alignment https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-preserving-speech-representation-learning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-content-preserving-speech-representation-learning/ 语音识别 | 7.5/10 Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-contextual-biasing-for-asr-in-speech-llm-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-contextual-biasing-for-asr-in-speech-llm-with/ 语音识别 | 7.0/10 Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-cultural-bias-in-mel-scale-representations/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-cultural-bias-in-mel-scale-representations/ 语音识别 | 7.0/10 Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-bottleneck-fusion-for-noise-robust/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-bottleneck-fusion-for-noise-robust/ 语音识别 | 7.5/10 CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ctc-did-ctc-based-arabic-dialect-identification/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ctc-did-ctc-based-arabic-dialect-identification/ 语音识别 | 6.5/10 Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-decoder-only-conformer-with-modality-aware-sparse/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-decoder-only-conformer-with-modality-aware-sparse/ 语音识别 | 7.5/10 Do we really need self-attention for streaming automatic speech recognition? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-really-need-self-attention-for-streaming/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-we-really-need-self-attention-for-streaming/ 语音识别 | 7.5/10 Domain-Aware Scheduling for ASR Fine-Tuning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-domain-aware-scheduling-for-asr-fine-tuning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-domain-aware-scheduling-for-asr-fine-tuning/ 语音识别 | 6.5/10 Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-emilia-nv-a-non-verbal-speech-dataset-with-word/ 语音识别 | 7.5/10 Equipping Large Language Model with Directional Speech Understanding Capabilities https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-equipping-large-language-model-with-directional/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-equipping-large-language-model-with-directional/ 语音识别语音翻译 | 7.0/10 Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-ssl-discrete-tokens-for-multilingual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-exploring-ssl-discrete-tokens-for-multilingual/ 语音识别 | 7.5/10 FinHuBERT: Hierarchical Feature Imitating Networks for Low-Resource Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-finhubert-hierarchical-feature-imitating-networks/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-finhubert-hierarchical-feature-imitating-networks/ 语音识别 | 7.5/10 Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-flexi-lora-with-input-adaptive-ranks-efficient/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-flexi-lora-with-input-adaptive-ranks-efficient/ 语音识别 | 7.5/10 Frontend Token Enhancement for Token-Based Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frontend-token-enhancement-for-token-based-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-frontend-token-enhancement-for-token-based-speech/ 语音识别 | 8.0/10 GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gloria-gated-low-rank-interpretable-adaptation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-gloria-gated-low-rank-interpretable-adaptation/ 语音识别 | 8.0/10 Grey-Box Prompt Tuning With Graph Alignment for Speech-Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-grey-box-prompt-tuning-with-graph-alignment-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-grey-box-prompt-tuning-with-graph-alignment-for/ 语音识别 | 8.0/10 How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-Resource Transfer https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-far-do-ssl-speech-models-listen-for-tone/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-how-far-do-ssl-speech-models-listen-for-tone/ 语音识别 | 6.5/10 ICASSP 2026 - 语音识别论文列表 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-078/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-078/ 共 102 篇 ICASSP 2026 语音识别方向论文 Identifying the Minimal and Maximal Phonetic Subspace of Speech Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-identifying-the-minimal-and-maximal-phonetic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-identifying-the-minimal-and-maximal-phonetic/ 语音识别 | 8.0/10 Impact of Phonetics on Speaker Identity in Adversarial Voice Attack https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-impact-of-phonetics-on-speaker-identity-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-impact-of-phonetics-on-speaker-identity-in/ 说话人验证 | 7.0/10 Improving Automatic Speech Recognition by Mitigating Distortions Introduced by Speech Enhancement Under Drone Noise https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-automatic-speech-recognition-by/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-automatic-speech-recognition-by/ 语音识别 | 6.5/10 Improving Contextual Asr Via Multi-Grained Fusion With Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-contextual-asr-via-multi-grained-fusion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-improving-contextual-asr-via-multi-grained-fusion/ 语音识别 | 8.5/10 In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word level timestamp predictions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-in-sync-adaptation-of-speech-aware-large-language/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-in-sync-adaptation-of-speech-aware-large-language/ 语音识别 | 7.0/10 Input-Adaptive Differentiable Filterbanks via Hypernetworks for Robust Speech Processing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-input-adaptive-differentiable-filterbanks-via/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-input-adaptive-differentiable-filterbanks-via/ 语音识别 | 7.5/10 Inverse-Hessian Regularization for Continual Learning in ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-inverse-hessian-regularization-for-continual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-inverse-hessian-regularization-for-continual/ 语音识别 | 7.5/10 Investigating The Effect Of Sentence-Level Syntactic Structure On Information Loss In The Human Auditory System https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-investigating-the-effect-of-sentence-level/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-investigating-the-effect-of-sentence-level/ 语音识别 | 7.0/10 Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-joint-autoregressive-modeling-of-multi-talker/ 语音识别语音翻译 | 7.0/10 K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-k-function-joint-pronunciation-transcription-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-k-function-joint-pronunciation-transcription-and/ 语音识别 | 7.5/10 Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-language-infused-retrieval-augmented-ctc-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-language-infused-retrieval-augmented-ctc-with/ 语音识别 | 8.0/10 Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lattice-guided-consistency-regularization-of-dual/ 语音识别 | 8.0/10 Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-learning-to-align-with-unbalanced-optimal/ 语音识别 | 6.5/10 LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-less-large-language-model-enhanced-semi/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-less-large-language-model-enhanced-semi/ 语音识别语音翻译 | 7.5/10 Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-audio-visual-data-to-reduce-the/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-audio-visual-data-to-reduce-the/ 语音识别 | 6.0/10 Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-segment-level-speech-representations/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-segment-level-speech-representations/ 语音识别 | 7.0/10 Leveraging Text-to-Speech and Voice Conversion as Data Augmentation for Alzheimer's Disease Detection from Spontaneous Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-text-to-speech-and-voice-conversion-as/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-text-to-speech-and-voice-conversion-as/ 语音生物标志物 | 7.0/10 Linguard: Authenticating Speech Recordings Using Speech Recognition and Watermark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-linguard-authenticating-speech-recordings-using/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-linguard-authenticating-speech-recordings-using/ 音频安全 | 6.5/10 Listen, But Don't Leak: Sensitive Data Protection for Privacy Aware Automatic Speech Recognition with Acoustic Triggers https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-listen-but-dont-leak-sensitive-data-protection/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-listen-but-dont-leak-sensitive-data-protection/ 语音识别 | 7.5/10 LLM-Based Post-ASR Error Correction for Disordered Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-llm-based-post-asr-error-correction-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-llm-based-post-asr-error-correction-for/ 语音识别 | 7.5/10 LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-longspeech-a-scalable-benchmark-for-transcription/ 基准测试 | 7.8/10 LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lotusdis-a-thai-far-field-meeting-corpus-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-lotusdis-a-thai-far-field-meeting-corpus-for/ 语音识别 | 7.5/10 Medical ASR Enhancement by Domain-Specific Reinforcement Fine-Tuning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-medical-asr-enhancement-by-domain-specific/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-medical-asr-enhancement-by-domain-specific/ 语音识别 | 6.5/10 Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mind-the-shift-using-delta-ssl-embeddings-to/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mind-the-shift-using-delta-ssl-embeddings-to/ 语音识别 | 7.0/10 Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-attention-sinks-and-massive/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-attention-sinks-and-massive/ 语音识别 | 7.0/10 Mixture To Beamformed Mixture: Leveraging Beamformed Mixture As Weak-Supervision for Speech Enhancement and Noise-Robust ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-to-beamformed-mixture-leveraging/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixture-to-beamformed-mixture-leveraging/ 语音增强 | 8.0/10 Mixtures of Lightweight Articulatory Experts for Multilingual Asr https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixtures-of-lightweight-articulatory-experts-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mixtures-of-lightweight-articulatory-experts-for/ 语音识别 | 7.0/10 MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mnv-17-a-high-quality-performative-mandarin/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mnv-17-a-high-quality-performative-mandarin/ 语音识别 | 7.5/10 Multilingual Supervised Pretraining with Lm-Assisted Decoding for Visual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multilingual-supervised-pretraining-with-lm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multilingual-supervised-pretraining-with-lm/ 语音识别 | 6.5/10 nGPT as a Scalable Architecture for Speech Recognition and Translation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ngpt-as-a-scalable-architecture-for-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ngpt-as-a-scalable-architecture-for-speech/ 语音识别 | 7.5/10 Noise-Robust AV-ASR Using Visual Features both in the Whisper Encoder and Decoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-robust-av-asr-using-visual-features-both-in/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-noise-robust-av-asr-using-visual-features-both-in/ 语音识别 | 8.0/10 OMNI-AVSR: Towards Unified Multimodal Speech Recognition With Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-omni-avsr-towards-unified-multimodal-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-omni-avsr-towards-unified-multimodal-speech/ 语音识别 | 8.5/10 Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-online-register-for-dual-mode-self-supervised/ 语音识别 | 6.5/10 PAC: Pronunciation-Aware Contextualized Large Language Model-Based Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pac-pronunciation-aware-contextualized-large/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-pac-pronunciation-aware-contextualized-large/ 语音识别 | 7.0/10 Peeking Into the Future for Contextual Biasing https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-peeking-into-the-future-for-contextual-biasing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-peeking-into-the-future-for-contextual-biasing/ 语音识别 | 7.0/10 PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phoenixdsr-phoneme-guided-and-llm-enhanced/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-phoenixdsr-phoneme-guided-and-llm-enhanced/ 语音识别 | 7.0/10 Polynomial Mixing for Efficient Self-Supervised Speech Encoders https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-polynomial-mixing-for-efficient-self-supervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-polynomial-mixing-for-efficient-self-supervised/ 语音识别 | 8.0/10 Position-Invariant Fine-Tuning Of Speech Enhancement Models With Self-Supervised Speech Representations https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-position-invariant-fine-tuning-of-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-position-invariant-fine-tuning-of-speech/ 语音增强 | 6.5/10 Production-Scale Dynamic Vocabulary ASR Biasing with Word-Level FST and Robust Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-production-scale-dynamic-vocabulary-asr-biasing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-production-scale-dynamic-vocabulary-asr-biasing/ 语音识别 | 7.5/10 Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-proficiency-aware-adaptation-and-data/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-proficiency-aware-adaptation-and-data/ 语音识别 | 6.5/10 Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-purification-before-fusion-toward-mask-free/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-purification-before-fusion-toward-mask-free/ 语音识别 | 7.5/10 RAS: a Reliability Oriented Metric for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ras-a-reliability-oriented-metric-for-automatic/ 语音识别 | 7.5/10 Reducing Prompt Sensitivity in LLM-Based Speech Recognition Through Learnable Projection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reducing-prompt-sensitivity-in-llm-based-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reducing-prompt-sensitivity-in-llm-based-speech/ 语音识别 | 7.0/10 Reference Microphone Selection for Guided Source Separation Based on The Normalized L-P Norm https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reference-microphone-selection-for-guided-source/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reference-microphone-selection-for-guided-source/ 语音增强 | 7.0/10 Relative Time Intervals Representation For Word-Level Timestamping With Masked Training https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-relative-time-intervals-representation-for-word/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-relative-time-intervals-representation-for-word/ 语音识别 | 8.0/10 RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rlbr-reinforcement-learning-with-biasing-rewards/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rlbr-reinforcement-learning-with-biasing-rewards/ 语音识别 | 8.0/10 Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-robust-accent-identification-via-voice-conversion/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-robust-accent-identification-via-voice-conversion/ 语音识别 | 7.5/10 Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-scaling-multi-talker-asr-with-speaker-agnostic/ 语音识别 | 8.5/10 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-se-dicow-self-enrolled-diarization-conditioned/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-se-dicow-self-enrolled-diarization-conditioned/ 语音识别 | 8.5/10 SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sed-structural-entropy-based-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sed-structural-entropy-based-speech/ 语音识别 | 6.5/10 Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sequence-level-unsupervised-training-in-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sequence-level-unsupervised-training-in-speech/ 语音识别 | 6.5/10 SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slm-tta-a-framework-for-test-time-adaptation-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-slm-tta-a-framework-for-test-time-adaptation-of/ 语音识别 | 7.0/10 SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ssvd-o-parameter-efficient-fine-tuning-with/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ssvd-o-parameter-efficient-fine-tuning-with/ 语音识别 | 7.0/10 STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stacodec-semantic-token-assignment-for-balancing/ 语音识别 | 8.0/10 Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-streaming-speech-recognition-with-decoder-only/ 语音识别 | 7.0/10 Synthesized Data Selection via Score Distribution Matching for Te Reo Māori Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthesized-data-selection-via-score-distribution/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthesized-data-selection-via-score-distribution/ 语音识别 | 8.0/10 Synthetic Data Domain Adaptation for ASR via LLM-Based Text and Phonetic Respelling Augmentation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthetic-data-domain-adaptation-for-asr-via-llm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-synthetic-data-domain-adaptation-for-asr-via-llm/ 语音识别 | 8.0/10 TAGARELA - A Portuguese Speech Dataset from Podcasts https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tagarela-a-portuguese-speech-dataset-from-podcasts/ 语音识别语音合成 | 7.0/10 Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-llm-asr-with-speaker-aware-speech/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-target-speaker-llm-asr-with-speaker-aware-speech/ 语音识别 | 8.8/10 TASU: Text-only Alignment for Speech Understanding https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tasu-text-only-alignment-for-speech-understanding/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tasu-text-only-alignment-for-speech-understanding/ 语音识别 | 7.0/10 Teaching the Teachers: Boosting Unsupervised Domain Adaptation In Speech Recognition By Ensemble Update https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-the-teachers-boosting-unsupervised/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-teaching-the-teachers-boosting-unsupervised/ 语音识别 | 7.0/10 Three Seconds is Sufficient: A Multi-Pronged Framework for Model-Based Speaker Adaptation in ASR Under Data-Scarce Conditions https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-three-seconds-is-sufficient-a-multi-pronged/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-three-seconds-is-sufficient-a-multi-pronged/ 语音识别 | 7.0/10 TICL: Text-Embedding KNN for Speech in-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ticl-text-embedding-knn-for-speech-in-context/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ticl-text-embedding-knn-for-speech-in-context/ 语音识别 | 7.5/10 Tokenchain: A Discrete Speech Chain via Semantic Token Modeling https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tokenchain-a-discrete-speech-chain-via-semantic/ 语音识别 | 7.0/10 Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-building-speech-large-language-models-for/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-building-speech-large-language-models-for/ 语音识别 | 6.5/10 Towards Fair ASR for Second Language Speakers using Fairness Prompted Finetuning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-fair-asr-for-second-language-speakers/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-fair-asr-for-second-language-speakers/ 语音识别 | 6.5/10 Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-orthographically-informed-evaluation-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-orthographically-informed-evaluation-of/ 语音识别 | 7.0/10 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-robust-dysarthric-speech-recognition-llm/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-robust-dysarthric-speech-recognition-llm/ 语音识别 | 9.0/10 Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-train-short-infer-long-speech-llm-enables-zero/ 说话人分离 | 9.0/10 TTA: Transcribe, Translate and Alignment for Cross-Lingual Speech Representation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tta-transcribe-translate-and-alignment-for-cross/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-tta-transcribe-translate-and-alignment-for-cross/ 语音识别 | 7.5/10 UMA-SPLIT: Unimodal Aggregation for Both English and Mandarin Non-Autoregressive Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uma-split-unimodal-aggregation-for-both-english/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uma-split-unimodal-aggregation-for-both-english/ 语音识别 | 7.5/10 Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-variational-low-rank-adaptation-for-personalized/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-variational-low-rank-adaptation-for-personalized/ 语音识别 | 7.5/10 Voting-Based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voting-based-pitch-estimation-with-temporal-and/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-voting-based-pitch-estimation-with-temporal-and/ 语音识别 | 8.0/10 WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wav2lev-predicting-levenshtein-edit-operation/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-wav2lev-predicting-levenshtein-edit-operation/ 语音识别 | 7.5/10 Whisper-FEST: Single-Channel Far-Field Enhanced Speech-to-text without Parallel Data https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-fest-single-channel-far-field-enhanced/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-fest-single-channel-far-field-enhanced/ 语音识别 | 7.5/10 Whisper-MLA: Reducing GPU Memory Consumption of ASR Models Based on MHA2MLA Conversion https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-mla-reducing-gpu-memory-consumption-of/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-mla-reducing-gpu-memory-consumption-of/ 语音识别 | 7.0/10 Whisper: Courtside Edition - Enhancing ASR Performance through LLM-Driven Context Generation https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-courtside-edition-enhancing-asr/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisper-courtside-edition-enhancing-asr/ 语音识别 | 6.5/10 WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-whisperpipe-a-resource-efficient-streaming/ 语音识别 | 6.5/10 Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-windowed-summarymixing-an-efficient-fine-tuning/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-windowed-summarymixing-an-efficient-fine-tuning/ 语音识别 | 6.5/10 Z-Scores: A Metric for Linguistically Assessing Disfluency Removal https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-z-scores-a-metric-for-linguistically-assessing/ Wed, 29 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-z-scores-a-metric-for-linguistically-assessing/ 模型评估 | 6.5/10 RAS: a Reliability Oriented Metric for Automatic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-ras-a-reliability-oriented-metric-for-automatic/ Tue, 28 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-ras-a-reliability-oriented-metric-for-automatic/ 语音识别 | 7.5/10 Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-advancing-automatic-speech-recognition-using/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-advancing-automatic-speech-recognition-using/ 语音识别 | 7.0/10 DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-dm-asr-diarization-aware-multi-speaker-asr-with/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-dm-asr-diarization-aware-multi-speaker-asr-with/ 说话人识别 | 8.0/10 Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-identifying-and-typifying-demographic-unfairness/ Mon, 27 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-27-identifying-and-typifying-demographic-unfairness/ 语音识别 | 7.0/10 "This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-this-wasnt-made-for-me-recentering-user/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-this-wasnt-made-for-me-recentering-user/ 语音识别 | 7.0/10 Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-do-llm-decoders-listen-fairly-benchmarking-how/ 语音识别 | 7.5/10 Evaluation of Automatic Speech Recognition Using Generative Large Language Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ Fri, 24 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-evaluation-of-automatic-speech-recognition-using/ 语音识别 | 7.5/10 Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-aligning-stuttered-speech-research-with-end-user/ 1. **问题**：当前口吃语音技术研究与口吃者（PWS）及言语语言病理学家（SLP）的实际需求存在系统性脱节，研究重点、任务定义和评估方法未能充分以用户为中心。 2. **方法核心**：通过两部分结合分析：1）对228篇相关论文进行范围综述，提出研究任务分类法并分析研究现状；2）对70名利益相 Enhancing ASR Performance in the Medical Domain for Dravidian Languages https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-asr-performance-in-the-medical-domain/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-enhancing-asr-performance-in-the-medical-domain/ 这篇论文旨在解决达罗毗荼语言（Telugu和Kannada）在医疗领域自动语音识别（ASR）中面临的标注数据稀缺和语言形态复杂两大挑战。其核心方法是提出一个“置信度感知训练框架”，该框架通过一个混合置信度评分机制（结合静态的感知、声学相似性、WER分数和动态的模型熵），对混合了真实与合成语音的训练数 Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-reducing-the-offline-streaming-gap-for-unified/ 1. **问题**：训练一个既能高精度离线转录又能低延迟流式识别的统一ASR模型极具挑战性，传统方法在低延迟下性能会急剧下降。 2. **方法核心**：提出一个统一的Transducer框架，结合分块注意力（含右上下文）和动态块卷积（DCConv）来适配两种模式。核心创新是引入了模式一致性正则化损失 Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-tadabur-a-large-scale-quran-audio-dataset/ 1. **问题**：现有的古兰经语音数据集在规模、诵读者多样性、音频质量和标注深度上存在严重不足，限制了古兰经ASR、诵读者识别等任务的研究进展。 2. **方法核心**：提出Tadabur数据集及其构建流水线。流水线核心是“古兰经经文对齐模块”（AAM），它结合WhisperX进行初步转录，再利用 Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-utterance-level-methods-for-identifying-reliable/ Thu, 23 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-utterance-level-methods-for-identifying-reliable/ 1. **要解决什么问题**：儿童语音自动识别（ASR）错误率高，影响语言学习、阅读辅助等应用。传统置信度估计方法在噪声大、模式多变的儿童语音上可能失效。需要一种在转录后（utterance级别）自动识别哪些ASR输出是可靠的方法，以减少人工审核负担。 2. **方法核心是什么**：提出两种基于 APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-aprvos-1st-place-winner-of-5th-pvuw-mevis-audio/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-aprvos-1st-place-winner-of-5th-pvuw-mevis-audio/ 这篇论文报告了APRVOS系统，一个专为MEVIS_Audio（音频条件下的指代视频对象分割）任务设计的冠军方案。**要解决的问题**是传统文本指代分割模型无法直接处理包含噪声、不完整且可能描述视频中不存在物体的语音输入。**采用的方法**是一个四阶段流水线：首先使用VibeVoice-ASR将语音 Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-detecting-hallucinations-in-speechllms-at/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-detecting-hallucinations-in-speechllms-at/ 本文旨在解决语音大模型（SpeechLLMs）在推理时产生的“幻觉”问题，即生成与输入音频不符的流畅文本。现有方法依赖昂贵的黄金标准输出，而文本LLM的方法无法捕捉音频特有信号。为此，作者提出了四个基于注意力图的轻量级指标（AudioRatio, AudioConsistency, AudioEnt Qwen3.5-Omni Technical Report https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-qwen35-omni-technical-report/ 这篇技术报告全面介绍了Qwen3.5-Omni，一个能够统一理解与生成文本、图像、音频和音视频内容的全模态大语言模型。**要解决的问题**是现有模型在实时交互、跨模态推理和自主智能体行为方面的局限性。**采用的方法**是基于“思考者-说话者”架构，引入了多项关键创新：1）思考者和说话者均采用混合注意 Tadabur: A Large-Scale Quran Audio Dataset https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-tadabur-a-large-scale-quran-audio-dataset/ 本文旨在解决古兰经语音研究领域缺乏大规模、多样化、细粒度标注数据集的问题。为此，作者提出了**Tadabur**数据集及其自动化构建流水线。该流水线首先从公共平台收集音频，并利用大语言模型（Gemini）从非结构化文本中提取标准化元数据（如章节、朗诵者）。核心步骤是**Ayah Alignment Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-voice-of-india-a-large-scale-benchmark-for-real/ Wed, 22 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-voice-of-india-a-large-scale-benchmark-for-real/ 这篇论文旨在解决现有印度语言语音识别（Indic ASR）基准不反映真实场景、评估方法不公平的核心问题。为此，作者构建了“Voice of India”大规模基准，其数据源自3.6万名说话者的非脚本化电话对话，覆盖15种主要印度语言和139个地区集群，总计536小时。关键创新在于采用了考虑拼写变体的 ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-claricodec-optimising-neural-speech-codes-for/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-claricodec-optimising-neural-speech-codes-for/ 本文针对卫星、水下通信等超低比特率（200bps）场景下，传统神经语音编解码器因优化重建质量而牺牲可懂度的问题，提出了ClariCodec。其核心方法是将编码器的量化过程重新定义为一个随机策略，并利用强化学习（RL），以词错率（WER）作为奖励信号对编码器进行微调，而冻结解码器等声学重建管线。实验表 Where Do Self-Supervised Speech Models Become Unfair? https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-where-do-self-supervised-speech-models-become/ Tue, 21 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-where-do-self-supervised-speech-models-become/ 这篇论文旨在探究自监督语音模型（S3M）的不公平性究竟在模型的哪个层级产生。研究团队采用了一种轻量级的线性探针方法，在多个S3M（如WavLM, Wav2Vec2, BEST-RQ, Whisper）的每一层嵌入上，同时评估了说话人识别（SID）和自动语音识别（ASR）任务的整体性能及对不同说话人组 HARNESS: Lightweight Distilled Arabic Speech Foundation Models https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-harness-lightweight-distilled-arabic-speech/ 这篇论文针对阿拉伯语语音识别、方言识别和情感识别中通用多语言/英语模型性能不足、且大模型难以部署的问题，提出了 HArnESS——一个以阿拉伯语为中心的自监督语音模型家族。作者采用 HuBERT 风格的迭代自蒸馏框架，先在大规模阿拉伯语-英语双语数据（约 23K 小时）上训练 24 层的教师模型 H Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-interactive-asr-towards-human-like-interaction/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-interactive-asr-towards-human-like-interaction/ 这篇论文针对传统ASR的两大盲区——WER指标对语义错误不敏感、以及系统无法通过自然交互进行纠错——提出了Interactive ASR框架。首先，作者引入S²ER（Sentence-level Semantic Error Rate），利用LLM-as-a-Judge二元判断识别结果与参考文本是否 MUSCAT: MUltilingual, SCientific ConversATion Benchmark https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ Mon, 20 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-muscat-multilingual-scientific-conversation/ 本文提出了 MUSCAT，一个用于评估多语言科学对话场景下自动语音识别（ASR）性能的新基准。数据集包含 6 组双语对话录音（共约 65 分钟，9,066 词），涉及英语与德语、土耳其语、中文、越南语的配对对话；每组对话使用 Meeting Owl 3、ReSpeaker USB 麦克风阵列和 Me ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-claricodec-optimising-neural-speech-codes-for/ 这篇论文旨在解决卫星、水下等极端带宽受限场景下（如200bps）语音通信清晰度严重下降的问题。传统编解码器以波形重建为目标，在超低比特率下会将宝贵的比特分配给不必要的声学细节，而非核心语义信息。为此， Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-contextual-biasing-for-asr-in-speech-llm-with/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-contextual-biasing-for-asr-in-speech-llm-with/ 这篇论文旨在解决语音大模型（SLLM）在识别训练数据中稀有或未见的“偏置词”时性能不佳的问题。传统方法依赖于为偏置词提供精确的音素序列（通过G2P系统生成），但这对用户有专业要求且工具兼容性差。为此， Diffusion Language Models for Speech Recognition https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/ Sun, 19 Apr 2026 00:00:00 +0000 https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-19-diffusion-language-models-for-speech-recognition/ 这篇论文探索了将扩散语言模型（DLM）应用于自动语音识别（ASR）任务的新方法。其核心目标是利用扩散模型的双向注意和并行生成能力，来提升基于传统编码器（如CTC）生成的ASR候选假设的准确性。论文主要