Posts

ICASSP 2026 - 语音表示学习论文列表

ICASSP 2026 - 语音表示学习共 1 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 Phonological Tokenizer: Prosody-Aware Phonetic Token Via Mul 8.0分前25% 📋 论文详情 🥇 Phonological Tokenizer: Prosody-Aware Phonetic Token Via Multi-Objective Fine-Tuning with Differentiable K-Means 🔥 8.0/10 | 前25% | #语音表示学习 | #离散token | #多任务学习 #自监督学习 👥 作者与机构第一作者：Kentaro Onda（东京大学，索尼集团）通讯作者：未说明作者列表：Kentaro Onda（东京大学，索尼集团）、Hayato Futami（索尼集团）、Yosuke Kashiwagi（索尼集团）、Emiru Tsunoo（索尼集团）、Shinji Watanabe（卡内基梅隆大学） 💡 毒舌点评这篇论文的亮点在于其巧妙地利用多目标优化和可微分k-means，在理论上“纯净”的语音学token和“丰富”的声学token之间找到了一个实用且性能优异的平衡点，尤其在情感识别和语音转换等韵律敏感任务上取得了显著提升。然而，其短板在于对“不同iable k-means”这一核心工具的离散化本质在端到端训练中可能带来的优化挑战（如梯度估计方差）探讨不足，且虽然声码器使用了预训练说话人编码器进行条件化以“剥离”说话人信息，但这种剥离是否彻底以及对下游任务的潜在影响分析不够深入。 🔗 开源详情代码：论文中未提及代码仓库链接。方法基于ESPnet工具包实现。模型权重：未提及是否公开微调后的模型权重。数据集：使用了VCTK， LibriSpeech， RAVDESS， VoxCeleb， LJSpeech， TIMIT， Expresso， LibriLight等公开数据集，获取方式见各自官网。 Demo：提供了在线演示网站：https://ondatk68.github.io/onda-demo/projects/phonological-tokenizer。复现材料：给出了部分训练细节（如两阶段训练、学习率、epoch数、α值），但未提供完整的配置文件、检查点或详细的超参数列表。论文中引用的开源项目：ESPnet， HiFi-GAN（ParallelWaveGAN）， ECAPA-TDNN（SpeechBrain）， WavLM， Qwen2.5， Llama-3.2等。 📌 核心摘要 ...

ICASSP 2026 - 语音解码论文列表

ICASSP 2026 - 语音解码共 1 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 A Robust Multi-Scale Framework with Test-Time Adaptation for 7.5分前25% 📋 论文详情 🥇 A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding ✅ 7.5/10 | 前25% | #语音解码 | #领域适应 | #脑机接口 #多尺度特征学习 👥 作者与机构第一作者：Yang-yang Li（南京理工大学计算机科学与工程学院；香港中文大学（深圳）数据科学学院、人工智能学院）通讯作者：Siqi Cai（哈尔滨工业大学（深圳）智能科学与工程学院、人工智能学院）作者列表：Yang-yang Li（南京理工大学计算机科学与工程学院；香港中文大学（深圳）数据科学学院、人工智能学院）、Suli Wang（达姆施塔特工业大学计算机科学系；香港中文大学（深圳）数据科学学院、人工智能学院）、Siqi Cai（哈尔滨工业大学（深圳）智能科学与工程学院、人工智能学院）、Haizhou Li（香港中文大学（深圳）数据科学学院、人工智能学院） 💡 毒舌点评这篇论文的亮点在于直面sEEG信号解码的核心痛点——非平稳性导致的域偏移，并提出了一个逻辑清晰、组件有效的“先强化表示，再在线适应”的两阶段解决方案，在公开数据集上确实取得了显著的性能提升。其短板在于实验仅在一个数据集（DU-IN）上验证，且模型大小（5.964M）在BCI植入式应用场景下可能偏大，论文对模型轻量化和实时推理的考量不足，临床转化的可行性论证略显单薄。 🔗 开源详情代码：论文提供了代码仓库链接：https://github.com/lyyi599/MDM-Tent。但未说明代码是否已发布，或仅为占位页面。模型权重：论文中未提及是否提供预训练模型权重。数据集：实验使用了公开的DU-IN数据集，论文中未提供其具体获取方式，但暗示读者可参考原始研究。 Demo：论文中未提及在线演示。复现材料：论文中部分训练细节（如优化器、学习率、batch size）未说明。消融实验的完整结果可在提供的GitHub链接中获取。论文中引用的开源项目：论文引用了多个基线模型的开源实现或相关工作，如DU-IN、EEGNet、Tent等。 📌 核心摘要 ...

ICASSP 2026 - 语音评估论文列表

ICASSP 2026 - 语音评估共 5 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 Mispronunciation Detection and Diagnosis Without Model Train 8.0分前25% 🥈 Matrix-Structured Hierarchical Convolutional Modeling for Pr 8.0分前25% 🥉 Reference-Aware SFM Layers for Intrusive Intelligibility Pre 7.5分前10% 4. Session-Level Spoken Language Assessment with A Multimodal F 7.5分前25% 5. Fine-Tuning Large Multimodal Models for Automatic Pronunciat 7.0分前50% 📋 论文详情 🥇 Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach 🔥 8.0/10 | 前25% | #语音评估 | #检索增强 | #预训练 #零样本 ...

ICASSP 2026 - 语音识别 #语音合成论文列表

ICASSP 2026 - 语音识别 #语音合成共 1 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 TAGARELA - A Portuguese Speech Dataset from Podcasts 7.0分前25% 📋 论文详情 🥇 TAGARELA - A Portuguese Speech Dataset from Podcasts ✅ 7.0/10 | 前25% | #语音识别 #语音合成 | #预训练 | #语音识别 #语音合成 👥 作者与机构第一作者：Frederico Santos de Oliveira（Federal University of Mato Grosso (UFMT)）通讯作者：未说明作者列表：Frederico Santos de Oliveira (UFMT), Lucas Rafael Stefanel Gris (UFG), Alef Iury Siqueira Ferreira (UFG), Augusto Seben da Rosa (UNESP), Alexandre Costa Ferro Filho (UFG), Edresson Casanova (NVIDIA), Christopher Dane Shulby (Elsa Speak), Rafael Teixeira Sousa (UFMT), Diogo Fernandes Costa Silva (UFG), Anderson da Silva Soares (UFG), Arlindo Rodrigues Galvão Filho (UFG) 💡 毒舌点评 ...

ICASSP 2026 - 语音识别 #语音翻译论文列表

ICASSP 2026 - 语音识别 #语音翻译共 3 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 LESS: Large Language Model Enhanced Semi-Supervised Learning 7.5分前25% 🥈 Equipping Large Language Model with Directional Speech Under 7.0分前50% 🥉 Joint Autoregressive Modeling of Multi-Talker Overlapped Spe 7.0分前25% 📋 论文详情 🥇 LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data ✅ 7.5/10 | 前25% | #语音识别 #语音翻译 | #半监督学习 #大语言模型 | #语音识别 #语音翻译 ...

ICASSP 2026 - 语音识别论文列表

ICASSP 2026 - 语音识别共 102 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post 9.0分前25% 🥈 Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder 8.8分前10% 🥉 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper 8.5分前25% 4. Scaling Multi-Talker ASR with Speaker-Agnostic Activity Stre 8.5分前25% 5. Improving Contextual Asr Via Multi-Grained Fusion With Large 8.5分前25% 6. OMNI-AVSR: Towards Unified Multimodal Speech Recognition Wit 8.5分前10% 7. AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Sp 8.3分前25% 8. Polynomial Mixing for Efficient Self-Supervised Speech Encod 8.0分前25% 9. GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialecta 8.0分前25% 10. Voting-Based Pitch Estimation with Temporal and Frequential 8.0分前25% 11. Identifying the Minimal and Maximal Phonetic Subspace of Spe 8.0分前25% 12. Lattice-Guided Consistency Regularization of Dual-Mode Trans 8.0分前25% 13. BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Su 8.0分前25% 14. Synthetic Data Domain Adaptation for ASR via LLM-Based Text 8.0分前25% 15. STACodec: Semantic Token Assignment for Balancing Acoustic F 8.0分前25% 16. Language-Infused Retrieval-Augmented CTC with Adaptive Soft- 8.0分前25% 17. Relative Time Intervals Representation For Word-Level Timest 8.0分前25% 18. RLBR: Reinforcement Learning with Biasing Rewards for Contex 8.0分前25% 19. Grey-Box Prompt Tuning With Graph Alignment for Speech-Langu 8.0分前25% 20. Frontend Token Enhancement for Token-Based Speech Recognitio 8.0分前25% 21. Noise-Robust AV-ASR Using Visual Features both in the Whispe 8.0分前25% 22. Synthesized Data Selection via Score Distribution Matching f 8.0分前25% 23. Bayesian Low-Rank Factorization for Robust Model Adaptation 8.0分前25% 24. nGPT as a Scalable Architecture for Speech Recognition and T 7.5分前25% 25. Input-Adaptive Differentiable Filterbanks via Hypernetworks 7.5分前25% 26. A Study of Data Selection Strategies for Pre-Training Self-S 7.5分前25% 27. K-Function: Joint Pronunciation Transcription and Feedback f 7.5分前25% 28. Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning f 7.5分前25% 29. Adversarial Fine-Tuning on Speech Foundation Model with Vuln 7.5分前25% 30. WAV2LEV: Predicting Levenshtein Edit Operation Sequences For 7.5分前25% 31. LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Convers 7.5分前25% 32. Whisper-FEST: Single-Channel Far-Field Enhanced Speech-to-te 7.5分前50% 33. Production-Scale Dynamic Vocabulary ASR Biasing with Word-Le 7.5分前25% 34. Do we really need self-attention for streaming automatic spe 7.5分前25% 35. Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recog 7.5分前25% 36. Adapting Diarization-Conditioned Whisper for End-to-End Mult 7.5分前25% 37. CALM: Joint Contextual Acoustic-Linguistic Modeling for Pers 7.5分前25% 38. TTA: Transcribe, Translate and Alignment for Cross-Lingual S 7.5分前25% 39. Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annot 7.5分前25% 40. LLM-Based Post-ASR Error Correction for Disordered Speech 7.5分前50% 41. Content-Preserving Speech Representation Learning Via Adapti 7.5分前25% 42. Exploring SSL Discrete Tokens for Multilingual Automatic Spe 7.5分前25% 43. TICL: Text-Embedding KNN for Speech in-Context Learning Unlo 7.5分前25% 44. Purification Before Fusion: Toward Mask-Free Speech Enhancem 7.5分前25% 45. Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual 7.5分前25% 46. Inverse-Hessian Regularization for Continual Learning in ASR 7.5分前25% 47. BEST-RQ-based Self-Supervised Learning for Whisper Domain Ad 7.5分前25% 48. CCST: Cross-Modal and Consistency-Aware Self-Training for So 7.5分前25% 49. Chunk-Wise Attention Transducers for Fast and Accurate Strea 7.5分前25% 50. Chunkwise Aligners for Streaming Speech Recognition 7.5分前25% 51. FinHuBERT: Hierarchical Feature Imitating Networks for Low-R 7.5分前25% 52. UMA-SPLIT: Unimodal Aggregation for Both English and Mandari 7.5分前25% 53. MNV-17: A High-Quality Performative Mandarin Dataset for Non 7.5分前25% 54. Listen, But Don’t Leak: Sensitive Data Protection for Privac 7.5分前25% 55. Confidence-Guided Error Correction for Disordered Speech Rec 7.5分前25% 56. Advancing Semi-Supervised Child Speech Recognition with Omni 7.5分前25% 57. Variational Low-Rank Adaptation for Personalized Impaired Sp 7.5分前50% 58. Decoder-Only Conformer with Modality-Aware Sparse Mixtures o 7.5分前25% 59. Cross-Cultural Bias in Mel-Scale Representations: Evidence a 7.0分前25% 60. Bridging the Front-End and Back-End for Robust ASR via Cross 7.0分前25% 61. TASU: Text-only Alignment for Speech Understanding 7.0分前25% 62. Streaming Speech Recognition with Decoder-Only Large Languag 7.0分前25% 63. Reducing Prompt Sensitivity in LLM-Based Speech Recognition 7.0分前25% 64. PAC: Pronunciation-Aware Contextualized Large Language Model 7.0分前25% 65. Investigating The Effect Of Sentence-Level Syntactic Structu 7.0分前50% 66. SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD 7.0分前25% 67. Three Seconds is Sufficient: A Multi-Pronged Framework for M 7.0分前50% 68. In-Sync: Adaptation of Speech Aware Large Language Models fo 7.0分前50% 69. AccLID: Accent-aware Language Identification for Robust Mult 7.0分前25% 70. BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Impro 7.0分前50% 71. Mixtures of Lightweight Articulatory Experts for Multilingua 7.0分前25% 72. Towards Orthographically-Informed Evaluation of Speech Recog 7.0分前25% 73. Contextual Biasing for ASR in Speech LLM with Common Word Cu 7.0分前25% 74. Peeking Into the Future for Contextual Biasing 7.0分前50% 75. SLM-TTA: A Framework for Test-Time Adaptation of Generative 7.0分前50% 76. Tokenchain: A Discrete Speech Chain via Semantic Token Model 7.0分前25% 77. Advanced modeling of interlanguage speech intelligibility be 7.0分前25% 78. Leveraging Segment-Level Speech Representations for LLM-Base 7.0分前50% 79. Mitigating Attention Sinks and Massive Activations in Audio- 7.0分前25% 80. Teaching the Teachers: Boosting Unsupervised Domain Adaptati 7.0分前25% 81. Attention2Probability: Attention-Driven Terminology Probabil 7.0分前25% 82. Whisper-MLA: Reducing GPU Memory Consumption of ASR Models B 7.0分前25% 83. Mind the Shift: Using Delta SSL Embeddings to Enhance Child 7.0分前25% 84. PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speec 7.0分前50% 85. Audio-Conditioned Diffusion LLMs for ASR and Deliberation Pr 7.0分前50% 86. Sequence-Level Unsupervised Training in Speech Recognition: 6.5分前50% 87. Ara-BEST-RQ: Multi Dialectal Arabic SSL 6.5分前50% 88. Medical ASR Enhancement by Domain-Specific Reinforcement Fin 6.5分前25% 89. CTC-DID: CTC-Based Arabic Dialect Identification for Streami 6.5分前50% 90. Towards Fair ASR for Second Language Speakers using Fairness 6.5分前50% 91. Towards Building Speech Large Language Models for Multitask 6.5分前25% 92. Whisper: Courtside Edition - Enhancing ASR Performance throu 6.5分前50% 93. SED: Structural Entropy Based Speech Discretization for Disc 6.5分前50% 94. Multilingual Supervised Pretraining with Lm-Assisted Decodin 6.5分前50% 95. Improving Automatic Speech Recognition by Mitigating Distort 6.5分前25% 96. Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Sup 6.5分前50% 97. Proficiency-Aware Adaptation and Data Augmentation for Robus 6.5分前25% 98. Domain-Aware Scheduling for ASR Fine-Tuning 6.5分前50% 99. Online Register For Dual-Mode Self-Supervised Speech Models: 6.5分前50% 100. Learning to Align with Unbalanced Optimal Transport in Lingu 6.5分前50% 101. How Far Do SSL Speech Models Listen for Tone? Temporal Focus 6.5分前50% 102. Leveraging Audio-Visual Data to Reduce the Multilingual Gap 6.0分前50% 📋 论文详情 🥇 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER 🔥 9.0/10 | 前25% | #语音识别 | #大语言模型 | #鲁棒性 #数据集 ...

ICASSP 2026 - 语音质量评估论文列表

ICASSP 2026 - 语音质量评估共 8 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 Bridging the Semantic Gap: Cross-Attentive Fusion for Joint 8.5分前25% 🥈 Unseen but Not Unknown: Using Dataset Concealment to Robustl 8.3分前25% 🥉 Time vs. Layer: Locating Predictive Cues for Dysarthric Spee 7.5分前50% 4. Multi-Task Learning For Speech Quality Assessment Using ASR- 7.5分前25% 5. Quality Assessment of Noisy and Enhanced Speech with Limited 7.0分前25% 6. SA-SSL-MOS: Self-Supervised Learning MOS Prediction with Spe 7.0分前50% 7. Speech Quality-Based Localization of Low-Quality Speech and 7.0分前25% 8. A Generalization Strategy for Speech Quality Prediction: Fro 6.5分前25% 📋 论文详情 🥇 Bridging the Semantic Gap: Cross-Attentive Fusion for Joint Acoustic-Semantic Speech Quality Assessment 🔥 8.5/10 | 前25% | #语音质量评估 | #对比学习 | #预训练 #交叉注意力 ...

ICASSP 2026 - 语音转换 #语音增强论文列表

ICASSP 2026 - 语音转换 #语音增强共 1 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 VChangeCodec: An Ultra Low-Complexity Neural Speech Codec wi 8.0分前25% 📋 论文详情 🥇 VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication 🔥 8.0/10 | 前25% | #语音转换 #语音增强 | #端到端 | #语音转换 #语音增强 👥 作者与机构第一作者：Xusheng Yang (⋆†) (北京大学深圳研究生院，超高清沉浸式媒体技术广东省重点实验室；ADSPLAB，电子与计算机工程学院) 通讯作者：Yuexian Zou (⋆†B) (北京大学深圳研究生院，超高清沉浸式媒体技术广东省重点实验室；ADSPLAB，电子与计算机工程学院) 作者列表： Xusheng Yang (北京大学深圳研究生院，超高清沉浸式媒体技术广东省重点实验室；ADSPLAB，电子与计算机工程学院) Wei Xiao (⋄) (腾讯天籁音频实验室) Bang Yang (‡) (鹏城实验室) Shidong Shang (⋄) (腾讯天籁音频实验室) Yuexian Zou (⋆†B) (北京大学深圳研究生院，超高清沉浸式媒体技术广东省重点实验室；ADSPLAB，电子与计算机工程学院) 💡 毒舌点评 ...

ICASSP 2026 - 语音转换论文列表

ICASSP 2026 - 语音转换共 9 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversio 8.0分前25% 🥈 Conditional Diffusion Models for Mental Health-Preserving Vo 8.0分前25% 🥉 CosyAccent: Duration-Controllable Accent Normalization using 7.8分前25% 4. QE-XVC: Zero-Shot Cross-Lingual Voice Conversion via Query-E 7.5分前25% 5. MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion 7.5分前25% 6. Expressive Voice Conversion with Controllable Emotional Inte 7.5分前25% 7. Lightweight and Perceptually-Guided Voice Conversion for Ele 7.5分前25% 8. MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Me 7.0分前25% 9. MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice 6.5分前50% 📋 论文详情 🥇 FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec 🔥 8.0/10 | 前25% | #语音转换 | #扩散模型 | #零样本 #语音编解码器 ...

ICASSP 2026 - 语音问答论文列表

ICASSP 2026 - 语音问答共 3 篇论文 ← 返回 ICASSP 2026 总览排名论文评分分档 🥇 TextlessRAG: End-to-End Visual Document RAG by Speech withou 8.5分前25% 🥈 Understanding Textual Capability Degradation in Speech LLMS 7.5分前25% 🥉 Advancing Speech Understanding in Speech-Aware Language Mode 7.0分前25% 📋 论文详情 🥇 TextlessRAG: End-to-End Visual Document RAG by Speech without Text 🔥 8.5/10 | 前25% | #语音问答 | #端到端 | #基准测试 #跨模态 ...