2026  1287

April  1287

3D Mesh Grid Room Impulse Responses Measured with A Linear Microphone Array And Suppression of Frame Reflections

2026-04-29

A Bayesian Approach to Singing Skill Evaluation Using Semitone Pitch Histogram and MCMC-Based Generated Quantities

2026-04-29

A Bimodal Approach for Detecting Fatigue Using Speech and Personal Assessments in College Students

2026-04-29

A Consistent Learning Depression Detection Framework Integrating Multi-View Attention

2026-04-29

A Data-Driven Framework for Personal Sound Zone Control Addressing Loudspeaker Nonlinearities

2026-04-29

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

2026-04-29

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

2026-04-29

A Dynamic Gated Cross-Attention Framework for Audio-Text Apparent Personality Analysis

2026-04-29

A Feature-Optimized Audio Watermarking Algorithm with Adaptive Embedding Strength

2026-04-29

A Framework for Controlled Multi-Speaker Audio Synthesis for Robustness Evaluation of Speaker Diarisation Systems

2026-04-29

A Generalization Strategy for Speech Quality Prediction: From Domain-Specific to Unified Datasets

2026-04-29

A Generative-First Neural Audio Autoencoder

2026-04-29

A Hybrid Convolution-Mamba Network with Tone-Octave Contrastive Learning for Stratified Semi-Supervised Singing Melody Extraction

2026-04-29

A Learning-Based Automotive Sound Field Reproduction Method Using Plane-Wave Decomposition and Multi-Position Constraint

2026-04-29

A Lightweight Fourier-Based Network for Binaural Speech Enhancement with Spatial Cue Preservation

2026-04-29

A LLM-Driven Acoustic Semantic Enriched Framework for Underwater Acoustic Target Recognition

2026-04-29

A Metric Learning Approach to Heart Murmur Detection from Phonocardiogram Recordings

2026-04-29

A New Method and Dataset for Classroom Teaching Stage Segmentation

2026-04-29

A Noniterative Phase Retrieval Considering the Zeros of STFT Magnitude

2026-04-29

A Noval Monte Carlo Gradient Method Based on Meta-Learning for Effective Step-Size Selection in Active Noise Control

2026-04-29

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

2026-04-29

A Personalized Real-Time Proactive Voice Memory Assistant

2026-04-29

A Robust KNN Approach for Multi-Class Laryngeal Disease Detection using MFCC Features

2026-04-29

A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding

2026-04-29

A Speech-Driven Paradigm for Physics-Informed Modeling of Coupled Micro-Speakers

2026-04-29

A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering

2026-04-29

A State-Dependent Markov Diffusion Process for Generative Speech Enhancement

2026-04-29

A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models

2026-04-29

A Superb-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

2026-04-29

A Task-Aware Dual-Level Self-Supervised Learning Method for Effective Sound Event Detection

2026-04-29

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

2026-04-29

A Unified SVD-Modal Solution for Sparse Sound Field Reconstruction with Hybrid Spherical-Linear Microphone Arrays

2026-04-29

A Unsupervised Domain Adaptation Framework For Semi-Supervised Melody Extraction Using Confidence Matrix Replace and Nearest Neighbour Supervision

2026-04-29

ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding

2026-04-29

Accelerating Regularized Attention Kernel Regression for Spectrum Cartography

2026-04-29

AccLID: Accent-aware Language Identification for Robust Multilingual Speech Recognition

2026-04-29

ACIR-MACL: Effective Multimodal Sentiment Analysis via Attention-Based Causal Intervention Regularization and Multi-Aspect Contrastive Learning

2026-04-29

Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech

2026-04-29

Acoustic Feedback Cancellation in Hearing Aids Exploiting an Inertial Sensor

2026-04-29

Acoustic Non-Stationarity Objective Assessment with Hard Label Criteria for Supervised Learning Models

2026-04-29

Acoustic Teleportation Via Disentangled Neural Audio Codec Representations

2026-04-29

Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

2026-04-29

Adaptive Deterministic Flow Matching for Target Speaker Extraction

2026-04-29

Adaptive Embedding Fusion with Contrastive Learning for Robust Fully Few-Shot Class-Incremental Audio Classification

2026-04-29

Adaptive Per-Channel Energy Normalization Front-End for Robust Audio Signal Processing

2026-04-29

Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios

2026-04-29

Adaptive Spectral Weighting in Sagittal-Plane Sound Localization: A Reliability-Driven Approach

2026-04-29

Adaptive Task-Incremental Learning For Underwater Acoustic Recognition Based on Mixture-of-Experts Adapter

2026-04-29

Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

2026-04-29

ADH-VA: Adaptive Directed-Hypergraph Convolution with VA Contrastive Learning for Multimodal Conversational Emotion Recognition

2026-04-29

Advanced modeling of interlanguage speech intelligibility benefit with L1-L2 multi-task learning using differentiable K-means for accent-robust discrete token-based ASR

2026-04-29

Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training

2026-04-29

Advancing Semi-Supervised Child Speech Recognition with Omni-Temporal Classification under Label Noise

2026-04-29

Advancing Speech Summarization in Multi-Modal LLMs with Reinforcement Learning

2026-04-29

Advancing Speech Understanding in Speech-Aware Language Models with GRPO

2026-04-29

Adversarial Defense via Generative Speech Enhancement Module

2026-04-29

Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition

2026-04-29

Adversarial Rivalry Learning for Music Classification

2026-04-29

Affect-Jigsaw: Integrating Core and Peripheral Emotions for Harmonious Fine-Grained Multimodal Emotion Recognition

2026-04-29

AFT: An Exemplar-Free Class Incremental Learning Method for Environmental Sound Classification

2026-04-29

AI-Generated Music Detection in Broadcast Monitoring

2026-04-29

Ailive Mixer: A Deep Learning Based Zero Latency Automatic Music Mixer for Live Music Performances

2026-04-29

AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines

2026-04-29

Aligning Generative Speech Enhancement with Perceptual Feedback

2026-04-29

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-29

ALMA-Chor: Leveraging Audio-Lyric Alignment with Mamba for Chorus Detection

2026-04-29

AMBER2: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text

2026-04-29

AmbiDrop: Array-Agnostic Speech Enhancement Using Ambisonics Encoding and Dropout-Based Learning

2026-04-29

AMBISONIC-DML: A Benchmark Dataset for Dynamic Higher-Order Ambisonics Music with Motion-Aligned Stems

2026-04-29

An Anomaly-Aware and Audio-Enhanced Dual-Pathway Framework for Alzheimer’s Disease Progression Classification

2026-04-29

An Audio-Visual Speech Separation Network with Joint Cross-Attention and Iterative Modeling

2026-04-29

An Efficient Neural Network for Modeling Human Auditory Neurograms for Speech

2026-04-29

An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas

2026-04-29

An Envelope Separation Aided Multi-Task Learning Model for Blind Source Counting and Localization

2026-04-29

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

2026-04-29

An Unsupervised Alignment Feature Fusion System for Spoken Language-Based Dementia Detection

2026-04-29

Aneural Forward Filtering for Speaker-Image Separation

2026-04-29

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

2026-04-29

AnyAccomp: Generalizable Accompaniment Generation Via Quantized Melodic Bottleneck

2026-04-29

AnyRIR: Robust Non-Intrusive Room Impulse Response Estimation in the Wild

2026-04-29

APKD: Aligned And Paced Knowledge Distillation Towards Lightweight Heterogeneous Multimodal Emotion Recognition

2026-04-29

AQUA-Bench: Beyond finding answers to knowing when there are None in Audio Question Answering

2026-04-29

AR-BSNet: Towards Ultra-Low Complexity Autoregressive Target Speaker Extraction With Band-Split Modeling

2026-04-29

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

2026-04-29

Ara-BEST-RQ: Multi Dialectal Arabic SSL

2026-04-29

Arbitrarily Settable Frame Rate Neural Speech Codec with Content Adaptive Variable Length Segmentation

2026-04-29

ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference

2026-04-29

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

2026-04-29

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

2026-04-29

Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

2026-04-29

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

2026-04-29

Assessing The Perceptual Impact of Low-Altitude Aircraft Noise in Cities: An Auralization Framework Using Gaussian Beam Tracing

2026-04-29

Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation

2026-04-29

ATOM: Adaptive Token-Level Optimal Transport Mixup for Speech Translation

2026-04-29

Atomic Norm Minimization Revisited: Progressive Atom Identification And Refinement

2026-04-29

Attention-Based Encoder-Decoder Target-Speaker Voice Activity Detection for Robust Speaker Diarization

2026-04-29

Attention-Weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied To Speech Emotion Recognition

2026-04-29

Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-text System

2026-04-29

Attentive AV-Fusionnet: Audio-Visual Quality Prediction with Hybrid Attention

2026-04-29

Attentive Masked Self-Distillation for Respiratory Sound Classification

2026-04-29

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

2026-04-29

Audience-Aware Co-speech Gesture Generation in Public Speaking via Anticipation Tokens

2026-04-29

Audio Classification Models are Vulnerable to Filter Perturbations

2026-04-29

Audio Deepfake Detection at the First Greeting: “Hi!”

2026-04-29

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-29

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

2026-04-29

Audio-Guided Multimodal Approach for Fine-Grained Alignment and Boundary Modeling in Active Speaker Detection

2026-04-29

Audio-Text Jailbreak Attack on Large Audio-Language Models: Towards Generality and Stealthiness

2026-04-29

Audio-to-Score Jazz Solo Transcription with the Rhythm Perceiver

2026-04-29

Audio-Visual Deepfake Generation and Detection: An Exploratory Survey

2026-04-29

Audio-Visual Feature Fusion for Calibrating Relevance Scores of Video Moment Retrieval

2026-04-29

AUDIOCARDS: Structured Metadata Improves Audio Language Models for Sound Design

2026-04-29

AudioFuse: Unified Spectral-Temporal Learning Via A Hybrid VIT-1D CNN Architecture for Phonocardiogram Classification

2026-04-29

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

2026-04-29

AUDIOGENIE-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

2026-04-29

Auditory Illusion Benchmark for Large Audio Language Models

2026-04-29

Auditory-Inspired Transformer for Binaural Speech Enhancement and Spatial Cue Preservation

2026-04-29

AURA: A Stegaformer-Based Scalable Deep Audio Watermark with Extreme Robustness

2026-04-29

Auto-MatchCut: An Audio-Visual Retrieval Framework for Seamless Match Cutting

2026-04-29

Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing

2026-04-29

Automatic Estimation of Speaker Diarization Error Rate Based on Features of Audio Quality and Speaker Discriminability

2026-04-29

Automatic Music Mixing Using a Generative Model of Effect Embeddings

2026-04-29

Automatic Music Sample Identification with Multi-Track Contrastive Learning

2026-04-29

AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

2026-04-29

Auxiliary Multi-Label Training For Improving the Robustness of Audio Deepfake Detection on AI-Processed Data

2026-04-29

AVATAR: Audio-Visual Adaptive Fusion via Trained Agent Reinforcement for Multimodal Deepfake Detection

2026-04-29

AVO-65: A Large-Scale Hierarchical Audio-Visual Object Dataset

2026-04-29

B-GRPO: Unsupervised Speech Emotion Recognition Based on Batched-Group Relative Policy Optimization

2026-04-29

BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on POP and Classical Music

2026-04-29

Bayesian Low-Rank Factorization for Robust Model Adaptation

2026-04-29

Bayesian Signal Separation Via Plug-and-Play Diffusion-Within-Gibbs Sampling

2026-04-29

BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Improved Multilingual Speech Recognition

2026-04-29

Beamforming Using Virtual Microphones for Hearing Aid Applications

2026-04-29

Beat and Downbeat Detection: A Reformulated Approach

2026-04-29

BeatMamba: Bidirectional Selective State-Space Modeling for Efficient Beat Tracking

2026-04-29

Behind the Scenes: Mechanistic Interpretability of Lora-Adapted Whisper for Speech Emotion Recognition

2026-04-29

Benchmarking Humans And Machines On Complex Multilingual Speech Understanding Tasks

2026-04-29

Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets

2026-04-29

BEST-RQ-based Self-Supervised Learning for Whisper Domain Adaptation

2026-04-29

BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

2026-04-29

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

2026-04-29

Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

2026-04-29

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

2026-04-29

Beyond Mapping: Domain-Invariant Representations via Spectral Embedding of Optimal Transport Plans

2026-04-29

Bimodal Fusion Framework for Dynamic Facial Expression Recognition In-The-Wild

2026-04-29

BioSEN: A Bio-Acoustic Signal Enhancement Network for Animal Vocalizations

2026-04-29

BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition

2026-04-29

Bleed No More: Generative Interference Reduction for Musical Recordings

2026-04-29

Bloodroot: When Watermarking Turns Poisonous for Stealthy Backdoor

2026-04-29

Bone-Conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models

2026-04-29

Brainprint-Modulated Target Speaker Extraction

2026-04-29

Break-the-Beat! Controllable MIDI-to-Drum audio synthesis

2026-04-29

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

2026-04-29

Bridging the Front-End and Back-End for Robust ASR via Cross-Attention-Based U-Net

2026-04-29

Bridging the Measurement–Simulation Gap in Room Acoustics with Real2sim Diffusion

2026-04-29

Bridging the Semantic Gap: Cross-Attentive Fusion for Joint Acoustic-Semantic Speech Quality Assessment

2026-04-29

BSMP-SENet:Band-Split Magnitude-Phase Network for Speech Enhancement

2026-04-29

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

2026-04-29

CaMoD: Causal-Aware Modality Denoising for Multimodal Dialogue Intent Recognition

2026-04-29

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

2026-04-29

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

2026-04-29

Caption and Audio-Guided Video Representation Learning with Gated Attention for Partially Relevant Video Retrieval

2026-04-29

Cardiobridge-DM: Bridging Cross-Cohort Heart Sound Synthesis via Rhythm-Aware Semi-Supervised Diffusion

2026-04-29

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

2026-04-29

CCST: Cross-Modal and Consistency-Aware Self-Training for Source-Free Unsupervised Domain Adaptation in Speech Recognition

2026-04-29

Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

2026-04-29

Chunkwise Aligners for Streaming Speech Recognition

2026-04-29

Class-Aware Permutation-Invariant Signal-to-Distortion Ratio for Semantic Segmentation of Sound Scene with Same-Class Sources

2026-04-29

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

2026-04-29

Clue2Emo: A Brain-Inspired Framework for Open-Vocabulary Multimodal Emotion Recognition

2026-04-29

CMSA-Mamba: Hierarchical State Space Modeling for Audio-Based Depression Detection

2026-04-29

Co-Initialization of Control Filter and Secondary Path via Meta-Learning for Active Noise Control

2026-04-29

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

2026-04-29

CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction

2026-04-29

Combining Multi-Order Attention and Multi-Resolution Discriminator for High-Fidelity Neural Vocoder

2026-04-29

Combining SSL Speech Features, Contextual Transformers and Mamba Models for Realistic Audio Spoofing Detection

2026-04-29

Compression meets Sampling: LZ78-SPA for Efficient Symbolic Music Generation

2026-04-29

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-Spoofing Countermeasures

2026-04-29

Condition-Invariant fMRI decoding of speech intelligibility with deep state space model

2026-04-29

Conditional Diffusion Models for Mental Health-Preserving Voice Conversion

2026-04-29

Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

2026-04-29

Confidence-Guided Error Correction for Disordered Speech Recognition

2026-04-29

Connecting Layer-Wise Representation of Wavlm with Spectro-Temporal Modulation on Speaker Verification

2026-04-29

Constraint Optimized Multichannel Mixer-Limiter Design

2026-04-29

Constructing Composite Features for Interpretable Music-Tagging

2026-04-29

Content Anonymization for Privacy in Long-Form Audio

2026-04-29

Content Leakage in Librispeech and its Impact on the Privacy Evaluation of Speaker Anonymization

2026-04-29

Content-Preserving Speech Representation Learning Via Adaptive Segment-Level Alignment

2026-04-29

Context-Aware Dynamic Graph Learning for Multimodal Emotion Recognition with Missing Modalities

2026-04-29

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-29

Continuation Method for Feedback Delay Network Modal Decomposition

2026-04-29

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

2026-04-29

Contrastive Timbre Representations for Musical Instrument And Synthesizer Retrieval

2026-04-29

Controllable Embedding Transformation for Mood-Guided Music Retrieval

2026-04-29

Cooperative Multi-Agent Reinforcement Learning for Adaptive Aggregation in Semi-Supervised Federated Learning with non-IID Data

2026-04-29

CosyAccent: Duration-Controllable Accent Normalization using Source-Synthesis Training Data

2026-04-29

Coupling Acoustic Geometry and Visual Semantics for Robust Depth Estimation

2026-04-29

CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content

2026-04-29

Cross-Architecture Knowledge Distillation of WavLM for Lightweight Speaker Verification

2026-04-29

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

2026-04-29

Cross-Domain Contrastive Learning with Dynamic Threshold Calibration for Source Speaker Tracing

2026-04-29

Cross-Lingual Alzheimer’s Disease Detection with Multimodal LLMs via Speech Cue-Augmented Prompting and Instruction Tuning

2026-04-29

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

2026-04-29

Cross-Lingual Interleaving for Speech Language Models

2026-04-29

Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh

2026-04-29

Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual Speech Recognition

2026-04-29

Cross-Modal Knowledge Distillation for Speech Large Language Models

2026-04-29

CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications

2026-04-29

Curriculum Learning with Contrastive Loss for Lightweight Speaker Verification

2026-04-29

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

2026-04-29

D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation from Lead Sheet

2026-04-29

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

2026-04-29

DAMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMS

2026-04-29

DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

2026-04-29

DBFT-SD: Weakly Supervised Multimodal Detection of Sensitive Audio-Visual Content

2026-04-29

DDSC: Dynamic Dual-Signal Curriculum for Data-Efficient Acoustic Scene Classification Under Domain Shift

2026-04-29

DDSR-Net: Robust Multimodal Sentiment Analysis via Dynamic Modality Reliability Assessment

2026-04-29

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

2026-04-29

Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR

2026-04-29

Decorrelation-Enhanced Multiband Subband Adaptive Filtering for RIR Tracking in Sound Field Control

2026-04-29

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS

2026-04-29

Deep Learning-Based Joint Optimization of Adaptive Feedback Cancellation and Residual Feedback Suppression for Hearing Aids

2026-04-29

Deep Spatial Clue Informed Ambisonic Encoding for Irregular Microphone Arrays

2026-04-29

Deepaq: A Perceptual Audio Quality Metric Based on Foundational Models and Weakly Supervised Learning

2026-04-29

Denoising Of Stochastic Ray Tracing Room Impulse Responses

2026-04-29

DepthTalk: Few-Shot Talking Head Generation with Depth-Aware 3D Gaussian Field Motion

2026-04-29

Detecting and Attributing Synthetic Spanish Speech: The HISPASpoof Dataset

2026-04-29

DGSDNet: Dual-Graph Spectral Diffusion Network for Incomplete Multimodal Emotion Recognition in Conversations

2026-04-29

Diff-vs: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

2026-04-29

Diffemotalk: Audio-Driven Facial Animation with Fine-Grained Emotion Control via Diffusion Models

2026-04-29

Differentiable Grouped Feedback Delay Networks for Learning Direction and Position-Dependent Late Reverberation

2026-04-29

Differentiable Pulsetable Synthesis for Wind Instrument Modeling

2026-04-29

Diffusion Timbre Transfer via Mutual Information Guided Inpainting

2026-04-29

Direct Preference Optimization For Speech Autoregressive Diffusion Models

2026-04-29

Direct Simultaneous Translation Activation for Large Audio-Language Models

2026-04-29

Direct Transfer of Prosody in Speech-to-speech Translation using Disentangled Speech Tokens

2026-04-29

Directly Trained Spiking Neural Networks with Adaptive Phase Coding

2026-04-29

DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings

2026-04-29

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

2026-04-29

Discrete-Continuous Fusion With Adaptive Hierarchical Features For Audio Deepfake Detection

2026-04-29

Disentangled Authenticity Representation for Partially Deepfake Audio Localization

2026-04-29

Disentangling Physiology from Fidelity: Latent-Guided Diffusion Models for Cross-Modal Cardiac Synthesis

2026-04-29

Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch

2026-04-29

DISSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

2026-04-29

Distilling Attention Knowledge for Speaker Verification

2026-04-29

Distributed Multichannel Active Noise Control with Asynchronous Communication

2026-04-29

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

2026-04-29

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

2026-04-29

Diverse and Few-Step Audio Captioning via Flow Matching

2026-04-29

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

2026-04-29

Do Bias Benchmarks Generalise? Evidence from Voice-Based Evaluation of Gender Bias in Speechllms

2026-04-29

Do Foundational Audio Encoders Understand Music Structure?

2026-04-29

Do Speech LLMs Learn Crossmodal Embedding Spaces?

2026-04-29

Do We Need EMA for Diffusion-Based Speech Enhancement? Toward A Magnitude-Preserving Network Architecture

2026-04-29

Do we really need self-attention for streaming automatic speech recognition?

2026-04-29

Do You Hear What I Mean? Quantifying the Instruction-Perception GAP in Instruction-Guided Expressive Text-to-Speech Systems

2026-04-29

Does the Pre-Training of an Embedding Influence its Encoding of Age?

2026-04-29

DOMA: Leveraging Diffusion Language Models with Adaptive Prior for Intent Classification and Slot Filling

2026-04-29

Domain Partitioning Meets Parameter-Efficient Fine-Tuning: A Novel Method for Improved Language-Queried Audio Source Separation

2026-04-29

Domain-Aware Scheduling for ASR Fine-Tuning

2026-04-29

Domain-Invariant Representation Learning of Bird Sounds

2026-04-29

DPO-Regularized Regression for Age Prediction

2026-04-29

DPT-Net: Dual-Path Transformer Network with Hierarchical Fusion for EEG-based Envelope Reconstruction

2026-04-29

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

2026-04-29

DSRMS-TransUnet: A Decentralized Non-Shifted Transunet for Shallow Water Acoustic Source Range Estimation

2026-04-29

DSSR: Decoupling Salient and Subtle Representations Under Missing Modalities for Multimodal Emotion Recognition

2026-04-29

Dual Contrastive Learning for Semi-Supervised Domain Adaptation in Bi-Modal Depression Recognition

2026-04-29

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

2026-04-29

Dual-Perspective Multimodal Sentiment Analysis with MoE Fusion: Representation Learning via Semantic Resonance and Divergence

2026-04-29

Dual-Strategy-Enhanced Conbimamba for Neural Speaker Diarization

2026-04-29

Dynamic Balanced Cross-Modal Attention with Gated Sequence Restoration: Towards Robust Multimodal Sentiment Analysis

2026-04-29

Dynamic Noise-Aware Multi Lora Framework Towards Real-World Audio Deepfake Detection

2026-04-29

Dynamic Spectrogram Analysis with Local-Aware Graph Networks for Audio Anti-Spoofing

2026-04-29

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

2026-04-29

E2E-AEC: Implementing An End-To-End Neural Network Learning Approach for Acoustic Echo Cancellation

2026-04-29

Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

2026-04-29

ECHO: Frequency-Aware Hierarchical Encoding for Variable-Length Signals

2026-04-29

EchoFake: A Replay-Aware Dataset For Practical Speech Deepfake Detection

2026-04-29

EchoRAG: A Two-Stage Framework for Audio-Text Retrieval and Temporal Grounding

2026-04-29

ECSA: Dual-Branch Emotion Compensation for Emotion-Consistent Speaker Anonymization

2026-04-29

EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting

2026-04-29

EEG and Eye-Tracking Driven Dynamic Target Speaker Extraction with Spontaneous Attention Switching

2026-04-29

EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors

2026-04-29

Efficient Audio-Visual Inference Via Token Clustering And Modality Fusion

2026-04-29

Efficient Depression Detection from Speech via Language-Independent Prompt-Driven Reprogramming

2026-04-29

Efficient Solutions for Mitigating Initialization Bias in Unsupervised Self-Adaptive Auditory Attention Decoding

2026-04-29

EMG-to-Speech with Fewer Channels

2026-04-29

Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling

2026-04-29

Emo-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

2026-04-29

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

2026-04-29

EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

2026-04-29

Emotion-Aligned Generation in Diffusion Text to Speech Models Via Preference-Guided Optimization

2026-04-29

Emotional Damage: Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

2026-04-29

Emotional Dimension Control in Language Model-Based Text-To-Speech: Spanning a Broad Spectrum of Human Emotions

2026-04-29

EmoTri-RL: Emotion- and Cause-Aware Reinforcement Learning for Multi-Modal Empathetic Dialogue

2026-04-29

Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness

2026-04-29

Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers

2026-04-29

Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

2026-04-29

Enhanced Generative Machine Listener

2026-04-29

Enhancing Audio Question-Answering Performance Through Log-Likelihood Guided Reward Functions

2026-04-29

Enhancing Automatic Drum Transcription with Online Dynamic Few-Shot Learning

2026-04-29

Enhancing Dialogue-Related Speech Tasks with Generated Spoken Dialogues

2026-04-29

Enhancing Noise Robustness for Neural Speech Codecs Through Resource-Efficient Progressive Quantization Perturbation Simulation

2026-04-29

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation Guided Structured Pruning

2026-04-29

Enhancing Speech Intelligibility Prediction for Hearing Aids with Complementary Speech Foundation Model Representations

2026-04-29

Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

2026-04-29

Equipping Large Language Model with Directional Speech Understanding Capabilities

2026-04-29

Erasing Your Voice Before it’s Heard: Training-Free Speaker Unlearning for Zero-Shot Text-to-Speech

2026-04-29

Estimating Hand-Related Features from Speech Using Machine Learning

2026-04-29

Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

2026-04-29

Etude: Piano Cover Generation with a Three-Stage Approach — Extract, Structuralize, and Decode

2026-04-29

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

2026-04-29

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

2026-04-29

Evaluating Compositional Structure in Audio Representations

2026-04-29

Evaluating Disentangled Representations for Controllable Music Generation

2026-04-29

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

2026-04-29

Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics

2026-04-29

Evaluating Pretrained Speech Embedding Systems for Dysarthria Detection Across Heterogenous Datasets

2026-04-29

Event Classification by Physics-Informed Inpainting for Distributed Multichannel Acoustic Sensor with Partially Degraded Channels

2026-04-29

Exploring Fine-Tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data

2026-04-29

Exploring How Audio Effects Alter Emotion with Foundation Models

2026-04-29

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

2026-04-29

Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition

2026-04-29

Expressive Voice Conversion with Controllable Emotional Intensity

2026-04-29

Exterior Sound Field Estimation Based on Physics-Constrained Kernel

2026-04-29

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

2026-04-29

Face-Voice Association with Inductive Bias for Maximum Class Separation

2026-04-29

Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

2026-04-29

Fast-ULCNet: A Fast and Ultra Low Complexity Network for Single-Channel Speech Enhancement

2026-04-29

FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference

2026-04-29

FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement

2026-04-29

FD-ARL: Feature Disentanglement with Adversarial-Reconstruction Learning for Cross-Subject Auditory Attention Decoding

2026-04-29

FDCNet: Frequency Domain Channel Attention and Convolution for Lipreading

2026-04-29

FED-PISA: Federated Voice Cloning Via Personalized Identity-Style Adaptation

2026-04-29

Feedback-Driven Retrieval-Augmented Audio Generation with Large Audio Language Models

2026-04-29

Few-Shot Recognition of Audio Deepfake Generators using Graph-Based Prototype Adaptation

2026-04-29

FIDIC:Fine-Grained Conversational Emotion Recognition via Individual Differences in Inertia and Contagion

2026-04-29

Fine-Grained Frame Modeling in Multi-Head Self-Attention for Speech Deepfake Detection

2026-04-29

Fine-Tuning Bigvgan-V2 for Robust Musical Tuning Preservation

2026-04-29

Fine-Tuning Large Audio-Language Models with Lora for Precise Temporal Localization of Prolonged Exposure Therapy Elements

2026-04-29

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

2026-04-29

FinHuBERT: Hierarchical Feature Imitating Networks for Low-Resource Speech Recognition

2026-04-29

FlashFoley: Fast Interactive Sketch2audio Generation

2026-04-29

Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks

2026-04-29

Flexio: Flexible Single- and Multi-Channel Speech Separation and Enhancement

2026-04-29

FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning

2026-04-29

FOCA: Multimodal Malware Classification via Hyperbolic Cross-Attention

2026-04-29

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

2026-04-29

FODGE : High-Fidelity Dance Generation via Full-Body Optimization

2026-04-29

FoleyBench: A Benchmark for Video-to-Audio Models

2026-04-29

Forward Convolutive Prediction for Frame Online Monaural Speech Dereverberation based on Kronecker Product Decomposition

2026-04-29

Frame-Stacked Local Transformers for Efficient Multi-Codebook Speech Generation

2026-04-29

Frequency-Independent Ambisonics Upscaling Using Deep Learning

2026-04-29

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-Modal Understanding in Multimodal LLMS

2026-04-29

From Diet to Free Lunch: Estimating Auxiliary Signal Properties Using Dynamic Pruning Masks in Speech Enhancement Networks

2026-04-29

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

2026-04-29

From Human Speech to Ocean Signals: Transferring Speech Large Models for Underwater Acoustic Target Recognition

2026-04-29

Frontend Token Enhancement for Token-Based Speech Recognition

2026-04-29

Full Band Denoising of Room Impulse Response in the Wavelet Domain with Dictionary Learning

2026-04-29

FUN-SSL: Full-Band Layer Followed by U-Net With Narrow-Band Layers for Multiple Moving Sound Source Localization

2026-04-29

FUSEMOS: Perceptual Evaluation of Text-to-Music Generation with Dual-Encoder Fusion and Ranking-Aware Composite Loss

2026-04-29

Fusion of Multimodal Estimations by Extended State Hidden Markov Model: Application to Fetal Heart Rate Monitoring

2026-04-29

FxSearcher: Gradient-Free Text-Driven Audio Transformation

2026-04-29

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

2026-04-29

Gdiffuse: Diffusion-Based Speech Enhancement with Noise Model Guidance

2026-04-29

Gelina: Unified Speech and Gesture Synthesis Via Interleaved Token Prediction

2026-04-29

Gen-SER: When the Generative Model Meets Speech Emotion Recognition

2026-04-29

Generalizability of Predictive and Generative Speech Enhancement Models to Pathological Speakers

2026-04-29

Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker

2026-04-29

Generating Moving 3d Soundscapes with Latent Diffusion Models

2026-04-29

Generative Audio Extension and Morphing

2026-04-29

Generative UI as an Accessibility Bridge: Lessons from C2C E-Commerce

2026-04-29

GLA-GRAD++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

2026-04-29

GLAP: General Contrastive Audio-Text Pretraining Across Domains and Languages

2026-04-29

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

2026-04-29

GLUE: Gradient-free Learning to Unify Experts

2026-04-29

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Constrative and Generative Pretraining

2026-04-29

Graph-Based Emotion Consensus Perception Learning for Multimodal Emotion Recognition in Conversation

2026-04-29

Graph-based Modality Alignment for Robustness in Conversational Emotion Recognition

2026-04-29

Graph-Biased EEG Transformers for Silent Speech Decoding

2026-04-29

Grey-Box Prompt Tuning With Graph Alignment for Speech-Language Models

2026-04-29

GRNet: Graph Reconstruction Network for Robust Multimodal Sentiment Analysis

2026-04-29

Group Relative Policy Optimization for Text-to-Speech with Large Language Models

2026-04-29

Group-Sparse Gaussian Process Regression for Inhomogeneous Sound Field Estimation

2026-04-29

H-nnPBFDAF: Hierarchical Neural Network Partitioned Block Frequency Domain Adaptive Filter with Novel Block Activation Probability

2026-04-29

Hair Noise Analysis and Mitigation for Smart Glasses Audio Captures

2026-04-29

Hanui: Harnessing Distributional Discrepancies for Singing Voice Deepfake Detection

2026-04-29

HarmoNet: Music Grounding by Short Video via Harmonic Resample and Dynamic Sparse Alignment

2026-04-29

Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models

2026-04-29

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection with Multichannel Audio and Multiscale Visual Cues

2026-04-29

HCGAN: Harmonic-Coupled Generative Adversarial Network for Speech Super-Resolution in Low-Bandwidth Scenarios

2026-04-29

HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS

2026-04-29

HergNet: A Fast Neural Surrogate Model for Sound Field Predictions Via Superposition of Plane Waves

2026-04-29

HFSQVAE: Hierarchical Vector Quantization with Residuals for Frequency-Specific Embedding

2026-04-29

Hierarchical Activity Recognition and Captioning from Long-Form Audio

2026-04-29

Hierarchical Discrete Flow Matching For Multi-Codebook Codec-Based Text-To-Speech

2026-04-29

Hierarchical Tokenization of Multimodal Music Data for Generative Music Retrieval

2026-04-29

HiFi-HARP: A High-Fidelity 7th-Order Ambisonic Room Impulse Response Dataset

2026-04-29

High-Fidelity Speech Enhancement Via Discrete Audio Tokens

2026-04-29

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-Resource Transfer

2026-04-29

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

2026-04-29

Huí Sù: Co-constructing a Dual Feedback Apparatus

2026-04-29

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

2026-04-29

HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems

2026-04-29

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

2026-04-29

HyFlowSE: Hybrid End-To-End Flow-Matching Speech Enhancement via Generative-Discriminative Learning

2026-04-29

I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-Based Single-Channel Speech Enhancement

2026-04-29

IBPCodec : A Low-Bitrate Lightweight Speech Codec With Inter-Band Prediction

2026-04-29

ICASSP 2026 - 主动噪声控制 论文列表

2026-04-29

ICASSP 2026 - 主动降噪 论文列表

2026-04-29

ICASSP 2026 - 主题建模 论文列表

2026-04-29

ICASSP 2026 - 信号处理 论文列表

2026-04-29

ICASSP 2026 - 关键词检测 论文列表

2026-04-29

ICASSP 2026 - 医疗AI 论文列表

2026-04-29

ICASSP 2026 - 听觉注意力解码 论文列表

2026-04-29

ICASSP 2026 - 听觉注意解码 论文列表

2026-04-29

ICASSP 2026 - 噪声控制 论文列表

2026-04-29

ICASSP 2026 - 回声消除 论文列表

2026-04-29

ICASSP 2026 - 基准测试 论文列表

2026-04-29

ICASSP 2026 - 基频估计 论文列表

2026-04-29

ICASSP 2026 - 声场估计 论文列表

2026-04-29

ICASSP 2026 - 声学建模 论文列表

2026-04-29

ICASSP 2026 - 声源定位 论文列表

2026-04-29

ICASSP 2026 - 多模态学习 论文列表

2026-04-29

ICASSP 2026 - 多模态对话意图识别 论文列表

2026-04-29

ICASSP 2026 - 多模态情感分析 论文列表

2026-04-29

ICASSP 2026 - 多模态情感识别 论文列表

2026-04-29

ICASSP 2026 - 多模态模型 论文列表

2026-04-29

ICASSP 2026 - 多通道 论文列表

2026-04-29

ICASSP 2026 - 多音高估计 #音符跟踪 论文列表

2026-04-29

ICASSP 2026 - 实体消歧 论文列表

2026-04-29

ICASSP 2026 - 实时处理 论文列表

2026-04-29

ICASSP 2026 - 对抗样本 论文列表

2026-04-29

ICASSP 2026 - 异常声音检测 论文列表

2026-04-29

ICASSP 2026 - 情感分析 论文列表

2026-04-29

ICASSP 2026 - 情感识别 论文列表

2026-04-29

ICASSP 2026 - 房间脉冲响应 论文列表

2026-04-29

ICASSP 2026 - 房间脉冲响应去噪 论文列表

2026-04-29

ICASSP 2026 - 数据集 论文列表

2026-04-29

ICASSP 2026 - 数据集对齐 论文列表

2026-04-29

ICASSP 2026 - 槽填充 论文列表

2026-04-29

ICASSP 2026 - 模型评估 论文列表

2026-04-29

ICASSP 2026 - 歌唱旋律提取 论文列表

2026-04-29

ICASSP 2026 - 歌唱语音合成 论文列表

2026-04-29

ICASSP 2026 - 歌唱语音转录 论文列表

2026-04-29

ICASSP 2026 - 歌唱语音转换 论文列表

2026-04-29

ICASSP 2026 - 水下声学目标识别 论文列表

2026-04-29

ICASSP 2026 - 生物声学 论文列表

2026-04-29

ICASSP 2026 - 目标说话人提取 论文列表

2026-04-29

ICASSP 2026 - 神经解码 论文列表

2026-04-29

ICASSP 2026 - 空间音频 论文列表

2026-04-29

ICASSP 2026 - 联邦学习 论文列表

2026-04-29

ICASSP 2026 - 脑信号编码 论文列表

2026-04-29

ICASSP 2026 - 脑机接口 论文列表

2026-04-29

ICASSP 2026 - 舞蹈生成 论文列表

2026-04-29

ICASSP 2026 - 视觉语音识别 论文列表

2026-04-29

ICASSP 2026 - 视频到音频生成 论文列表

2026-04-29

ICASSP 2026 - 视频检索 论文列表

2026-04-29

ICASSP 2026 - 视频片段检索 论文列表

2026-04-29

ICASSP 2026 - 视频理解 论文列表

2026-04-29

ICASSP 2026 - 视频生成 论文列表

2026-04-29

ICASSP 2026 - 视频设备识别 论文列表

2026-04-29

ICASSP 2026 - 视频问答 论文列表

2026-04-29

ICASSP 2026 - 视频高光检测 论文列表

2026-04-29

ICASSP 2026 - 语音伪造检测 论文列表

2026-04-29

ICASSP 2026 - 语音克隆 论文列表

2026-04-29

ICASSP 2026 - 语音分离 论文列表

2026-04-29

ICASSP 2026 - 语音匿名化 论文列表

2026-04-29

ICASSP 2026 - 语音发现 论文列表

2026-04-29

ICASSP 2026 - 语音合成 论文列表

2026-04-29

ICASSP 2026 - 语音增强 #对抗防御 论文列表

2026-04-29

ICASSP 2026 - 语音增强 论文列表

2026-04-29

ICASSP 2026 - 语音大模型 论文列表

2026-04-29

ICASSP 2026 - 语音对话系统 论文列表

2026-04-29

ICASSP 2026 - 语音情感识别 论文列表

2026-04-29

ICASSP 2026 - 语音摘要 论文列表

2026-04-29

ICASSP 2026 - 语音活动检测 论文列表

2026-04-29

ICASSP 2026 - 语音理解 论文列表

2026-04-29

ICASSP 2026 - 语音生成 论文列表

2026-04-29

ICASSP 2026 - 语音生物标志物 论文列表

2026-04-29

ICASSP 2026 - 语音编码 论文列表

2026-04-29

ICASSP 2026 - 语音编码器 论文列表

2026-04-29

ICASSP 2026 - 语音翻译 论文列表

2026-04-29

ICASSP 2026 - 语音表示学习 论文列表

2026-04-29

ICASSP 2026 - 语音解码 论文列表

2026-04-29

ICASSP 2026 - 语音评估 论文列表

2026-04-29

ICASSP 2026 - 语音识别 #语音合成 论文列表

2026-04-29

ICASSP 2026 - 语音识别 #语音翻译 论文列表

2026-04-29

ICASSP 2026 - 语音识别 论文列表

2026-04-29

ICASSP 2026 - 语音质量评估 论文列表

2026-04-29

ICASSP 2026 - 语音转换 #语音增强 论文列表

2026-04-29

ICASSP 2026 - 语音转换 论文列表

2026-04-29

ICASSP 2026 - 语音问答 论文列表

2026-04-29

ICASSP 2026 - 语音驱动动作生成 论文列表

2026-04-29

ICASSP 2026 - 说话人分离 论文列表

2026-04-29

ICASSP 2026 - 说话人合成 论文列表

2026-04-29

ICASSP 2026 - 说话人日志 #语音分离 论文列表

2026-04-29

ICASSP 2026 - 说话人日志 论文列表

2026-04-29

ICASSP 2026 - 说话人检测 论文列表

2026-04-29

ICASSP 2026 - 说话人生成 论文列表

2026-04-29

ICASSP 2026 - 说话人脸生成 论文列表

2026-04-29

ICASSP 2026 - 说话人识别 论文列表

2026-04-29

ICASSP 2026 - 说话人验证 论文列表

2026-04-29

ICASSP 2026 - 课堂阶段分割 论文列表

2026-04-29

ICASSP 2026 - 跨模态 论文列表

2026-04-29

ICASSP 2026 - 跨模态检索 论文列表

2026-04-29

ICASSP 2026 - 轻度认知障碍检测 论文列表

2026-04-29

ICASSP 2026 - 迁移学习 论文列表

2026-04-29

ICASSP 2026 - 零样本关键词检测 论文列表

2026-04-29

ICASSP 2026 - 音乐信息检索 论文列表

2026-04-29

ICASSP 2026 - 音乐分离 论文列表

2026-04-29

ICASSP 2026 - 音乐分类 论文列表

2026-04-29

ICASSP 2026 - 音乐推荐 论文列表

2026-04-29

ICASSP 2026 - 音乐检索 论文列表

2026-04-29

ICASSP 2026 - 音乐混合 论文列表

2026-04-29

ICASSP 2026 - 音乐源分离 论文列表

2026-04-29

ICASSP 2026 - 音乐源提取 论文列表

2026-04-29

ICASSP 2026 - 音乐理解 论文列表

2026-04-29

ICASSP 2026 - 音乐生成 论文列表

2026-04-29

ICASSP 2026 - 音乐转录 论文列表

2026-04-29

ICASSP 2026 - 音视频 论文列表

2026-04-29

ICASSP 2026 - 音视频实例分割 论文列表

2026-04-29

ICASSP 2026 - 音频事件检测 论文列表

2026-04-29

ICASSP 2026 - 音频信号处理 论文列表

2026-04-29

ICASSP 2026 - 音频分离 论文列表

2026-04-29

ICASSP 2026 - 音频分类 #零样本学习 论文列表

2026-04-29

ICASSP 2026 - 音频分类 论文列表

2026-04-29

ICASSP 2026 - 音频压缩 论文列表

2026-04-29

ICASSP 2026 - 音频场景分类 论文列表

2026-04-29

ICASSP 2026 - 音频场景理解 论文列表

2026-04-29

ICASSP 2026 - 音频增强 论文列表

2026-04-29

ICASSP 2026 - 音频大模型 论文列表

2026-04-29

ICASSP 2026 - 音频字幕生成 论文列表

2026-04-29

ICASSP 2026 - 音频安全 论文列表

2026-04-29

ICASSP 2026 - 音频描述 论文列表

2026-04-29

ICASSP 2026 - 音频效果估计 论文列表

2026-04-29

ICASSP 2026 - 音频无损编码 论文列表

2026-04-29

ICASSP 2026 - 音频检索 #音频分类 论文列表

2026-04-29

ICASSP 2026 - 音频检索 论文列表

2026-04-29

ICASSP 2026 - 音频水印 论文列表

2026-04-29

ICASSP 2026 - 音频深度伪造检测 论文列表

2026-04-29

ICASSP 2026 - 音频生成 论文列表

2026-04-29

ICASSP 2026 - 音频编辑 论文列表

2026-04-29

ICASSP 2026 - 音频质量评估 论文列表

2026-04-29

ICASSP 2026 - 音频超分辨率 论文列表

2026-04-29

ICASSP 2026 - 音频问答 论文列表

2026-04-29

ICASSP 2026 - 预训练 论文列表

2026-04-29

ICASSP 2026 - 领域适应 论文列表

2026-04-29

ICASSP 2026 语音/音频论文详细分析

2026-04-29

Identifying Birdsong Syllables without Labelled Data

2026-04-29

Identifying the Minimal and Maximal Phonetic Subspace of Speech Representations

2026-04-29

Identity Leakage Through Accent Cues in Voice Anonymisation

2026-04-29

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

2026-04-29

Improving Active Learning for Melody Estimation by Disentangling Uncertainties

2026-04-29

Improving Anomalous Sound Detection with Attribute-Aware Representation from Domain-Adaptive Pre-Training

2026-04-29

Improving Audio Event Recognition with Consistency Regularization

2026-04-29

Improving Audio Question Answering with Variational Inference

2026-04-29

Improving Automatic Speech Recognition by Mitigating Distortions Introduced by Speech Enhancement Under Drone Noise

2026-04-29

Improving Binaural Distance Estimation in Reverberant Rooms Through Contrastive And Multi-Task Learning

2026-04-29

Improving Contextual Asr Via Multi-Grained Fusion With Large Language Models

2026-04-29

Improving Interpretability in Generative Multitimbral DDSP Frameworks via Semantically-Disentangled Musical Attributes

2026-04-29

Improving Multimodal Brain Encoding Model with Dynamic Subject-Awareness Routing

2026-04-29

Improving the Speaker Anonymization Evaluation’s Robustness to Target Speakers with Adversarial Learning

2026-04-29

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word level timestamp predictions

2026-04-29

InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection

2026-04-29

Incremental Learning for Audio Classification with Hebbian Deep Neural Networks

2026-04-29

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

2026-04-29

Individualize the HRTF Neural Field Using Anthropometric Parameters Weighted by Direction-Attention

2026-04-29

Influence of Clean Speech Characteristics on Speech Enhancement Performance

2026-04-29

Influence-Aware Curation and Active Selection for Industrial and Surveillance Sound Events

2026-04-29

Input-Adaptive Differentiable Filterbanks via Hypernetworks for Robust Speech Processing

2026-04-29

InstructAudio: Unified Speech and Music Generation with Natural Language Instruction

2026-04-29

Instrument Generation Through Distributional Flow Matching and Test-Time Search

2026-04-29

Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

2026-04-29

Integrating Speaker Embeddings and LLM-Derived Semantic Representations for Streaming Speaker Diarization

2026-04-29

Inter-Dialog Contrastive Learning for Multimodal Emotion Recognition in Conversations

2026-04-29

Interpretable Music Harmonic Analysis Through Multilinear Mixture of Experts

2026-04-29

Interval-Aware Retrieval Framework For Speech-Based Automatic Alzheimer’s Detection

2026-04-29

Inverse-Hessian Regularization for Continual Learning in ASR

2026-04-29

Investigating Modality Contribution in Audio LLMs for Music

2026-04-29

Investigating The Effect Of Sentence-Level Syntactic Structure On Information Loss In The Human Auditory System

2026-04-29

Is Phase Really Needed for Weakly-Supervised Dereverberation?

2026-04-29

It Is Personal: The Importance of Personalization for Recognizing Self-Reported Emotion

2026-04-29

Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation

2026-04-29

Joint Deep Secondary Path Estimation and Adaptive Control for Active Noise Cancellation

2026-04-29

Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-Task Multi-Scale Network

2026-04-29

Joint Estimation of Primary and Secondary Paths for Personalized Hearable Applications

2026-04-29

Joint Multichannel Acoustic Feedback Cancellation and Speaker Extraction via Kalman Filter and Deep Non-Linear Spatial Filter

2026-04-29

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

2026-04-29

KAN We Make Models Simpler for Audio Deepfake Detection with Kolmogorov–Arnold Networks?

2026-04-29

Keeping Models Listening: Segment- and time-aware attention rescaling at decoding time

2026-04-29

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

2026-04-29

KSDIFF: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

2026-04-29

LAFUFU: Latent Acoustic Features For Ultra-Fast Utterance Restoration

2026-04-29

LAMB: LLM-Based Audio Captioning with Modality Gap Bridging Via Cauchy-Schwarz Divergence

2026-04-29

Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR

2026-04-29

Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition

2026-04-29

Learnable Mel-Frontend for Robust Underwater Acoustic Target Detection under Non-Target Interference

2026-04-29

Learning Domain-Robust Bioacoustic Representations for Mosquito Species Classification with Contrastive Learning and Distribution Alignment

2026-04-29

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

2026-04-29

Learning Piezoelectric Hysteresis in In-Ear MEMS Loudspeakers from Acoustic Measurements

2026-04-29

Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR

2026-04-29

Learning Vocal-Tract Area And Radiation With A Physics-Informed Webster Model

2026-04-29

Learning What to Hear: Boosting Sound-Source Association for Robust Audiovisual Instance Segmentation

2026-04-29

LenslessMic: Audio Encryption and Authentication via Lensless Computational Imaging

2026-04-29

LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data

2026-04-29

LETPAV: Lexicon-Enhanced Text with Progressive Audio-Visual Fusion for Multimodal Sentiment Analysis

2026-04-29

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

2026-04-29

Leveraging Diffusion U-Net Features for Predominant Instrument Recognition

2026-04-29

Leveraging Large Multimodal Models for Audio-Video Deepfake Detection: A Pilot Study

2026-04-29

Leveraging Large Speech Language Models as Evaluators for Expressive Speech

2026-04-29

Leveraging Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners

2026-04-29

Leveraging prediction entropy for Automatic prompt weighting in Zero-Shot Audio-Language Classification

2026-04-29

Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition

2026-04-29

Leveraging Text-to-Speech and Voice Conversion as Data Augmentation for Alzheimer’s Disease Detection from Spontaneous Speech

2026-04-29

Leveraging Whisper Embeddings For Audio-Based Lyrics Matching

2026-04-29

Lightweight and Generalizable Acoustic Scene Representations Via Contrastive Fine-Tuning and Distillation

2026-04-29

Lightweight and Perceptually-Guided Voice Conversion for Electro-Laryngeal Speech

2026-04-29

Lightweight Implicit Neural Network for Binaural Audio Synthesis

2026-04-29

Lightweight Phoneme-Conditioned Bandwidth Extension for Body-Conducted Speech

2026-04-29

Lingometer: On-Device Personal Speech Word Counting System

2026-04-29

Linguard: Authenticating Speech Recordings Using Speech Recognition and Watermark

2026-04-29

LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-And-Play Dereverberation

2026-04-29

Lisa: Lightweight Yet Superb Neural Speech Coding

2026-04-29

Listen, But Don’t Leak: Sensitive Data Protection for Privacy Aware Automatic Speech Recognition with Acoustic Triggers

2026-04-29

LLAC: Learned Lossless Audio Codec

2026-04-29

LLM-Based Post-ASR Error Correction for Disordered Speech

2026-04-29

Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

2026-04-29

LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

2026-04-29

Look, Listen and Segment: Towards Weakly Supervised Audio-Visual Semantic Segmentation

2026-04-29

Loose Coupling of Spectral and Spatial Models for Multi-Channel Diarization and Enhancement of Meetings in Dynamic Environments

2026-04-29

LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR

2026-04-29

Low-Bandwidth High-Fidelity Speech Transmission with Generative Latent Joint Source-Channel Coding

2026-04-29

Low-Frequency Harmonic Control for Speech Intelligibility in Open-Ear Headphones

2026-04-29

Low-Latency Audio Front-End Region-of-Interest Beamforming for Smart Glasses

2026-04-29

Low-Resource Guidance for Controllable Latent Audio Diffusion

2026-04-29

Low-Resource Speech-Based Early Alzheimers Detection via Cross-Lingual and Few-Shot Transfer Learning

2026-04-29

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling

2026-04-29

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation Without Vector Quantization

2026-04-29

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

2026-04-29

Malefa: Multi-Granularity Learning and Effective False Alarm Suppression for Zero-Shot Keyword Spotting

2026-04-29

Mambaformer: State-Space Augmented Self-Attention with Downup Sampling for Monaural Speech Enhancement

2026-04-29

Marco-Voice: A Unified Framework for Expressive Speech Synthesis with Voice Cloning

2026-04-29

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances

2026-04-29

Matching Reverberant Speech Through Learned Acoustic Embeddings

2026-04-29

Matrix-Structured Hierarchical Convolutional Modeling for Pronunciation Assessment and Mispronunciation Detection

2026-04-29

Maximum Likelihood Measurement Noise Estimation for Block-Time Domain Kalman Filters

2026-04-29

MC-MRX: Reference- and Midi-Guided Music Source Extraction with Contrastive Learning

2026-04-29

MCF: Text LLMS for Multimodal Emotional Causality

2026-04-29

MCI-OTFusion: A Multimodal Model for MCI Detection and Cognitive Score Prediction

2026-04-29

Meanflow-Accelerated Multimodal Video-to-Audio Synthesis Via One-Step Generation

2026-04-29

MeanFlowSE: One-Step Generative Speech Enhancement via Conditional Mean Flow

2026-04-29

MeanSE: Efficient Generative Speech Enhancement with Mean Flows

2026-04-29

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

2026-04-29

MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Mean Flows

2026-04-29

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

2026-04-29

MECap-R1: Emotion-Aware Policy with Reinforcement Learning for Multimodal Emotion Captioning

2026-04-29

Medical ASR Enhancement by Domain-Specific Reinforcement Fine-Tuning

2026-04-29

MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

2026-04-29

Melos: Sentence-To-Section Training with Multi-Task Learning for LLM-Driven Song Generation

2026-04-29

Membership Inference Attack against Music Diffusion Models via Generative Manifold Perturbation

2026-04-29

MFF-RVRDI: Multimodal Fusion Framework for Robust Video Recording Device Identification

2026-04-29

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large Audio-Language Model

2026-04-29

Microphone-Less Measurement of Three-Dimensional Radiating Impulse Response of Sound Source using Spherical Harmonic-Domain Acousto-Optic Tomography

2026-04-29

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

2026-04-29

Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

2026-04-29

Mind Your [m]S, Cross Your [t]S: a Large-Scale Phonetic Analysis of Speech Reproduction in Modern Speech Generators

2026-04-29

MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

2026-04-29

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

2026-04-29

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs

2026-04-29

Mitigating Data Replication in Text-to-Audio Generative Diffusion Models Through Anti-Memorization Guidance

2026-04-29

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

2026-04-29

Mitigating Language Prior-Induced Hallucinations via Bi-Level Contrastive Decoding

2026-04-29

Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

2026-04-29

Mix2Morph: Learning Sound Morphing from Noisy Mixes

2026-04-29

MixGAN-based Non-blind Bandwidth Extension for Audio Codec

2026-04-29

Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

2026-04-29

Mixture To Beamformed Mixture: Leveraging Beamformed Mixture As Weak-Supervision for Speech Enhancement and Noise-Robust ASR

2026-04-29

Mixture-of-Experts Based Soft-Label Learning for Multi-Label Speech Emotion Recognition

2026-04-29

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

2026-04-29

Mixtures of Lightweight Articulatory Experts for Multilingual Asr

2026-04-29

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

2026-04-29

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

2026-04-29

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

2026-04-29

MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

2026-04-29

Modeling Both Intra- And Inter-Utterance Variability for Conversational Emotion Recognition

2026-04-29

Modeling Inter-Segment Relationships in Speech for Dementia Detection with Audio Spectrogram Transformers and Graph Attention Networks

2026-04-29

Modeling Strategies For Speech Enhancement in The Latent Space of a Neural Audio Codec

2026-04-29

Monitoring exposure-length variations in submarine power cables using distributed fiber-optic sensing

2026-04-29

More Than a Shortcut: A Hyperbolic Approach to Early-Exit Networks

2026-04-29

Motionbeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding

2026-04-29

MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

2026-04-29

MSANET: Multi-Scale Semantic Aggregation Network for Brain-Assisted Speech Enhancement in Multi-Speaker Conditions

2026-04-29

MSCT: Differential Cross-Modal Attention for Deepfake Detection

2026-04-29

MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

2026-04-29

MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

2026-04-29

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-Token Prediction

2026-04-29

Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition

2026-04-29

Multi-Layer Attentive Probing Improves Transfer of Audio Representations for Bioacoustics

2026-04-29

Multi-Scale Physiologically-Motivated Alignment for Auditory Attention Decoding

2026-04-29

Multi-Task Learning For Speech Quality Assessment Using ASR-Derived Entropy Features

2026-04-29

Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

2026-04-29

Multi-View Hierarchical Hypergraph Neural Network for Automatic Stuttering Detection

2026-04-29

Multilingual Supervised Pretraining with Lm-Assisted Decoding for Visual Speech Recognition

2026-04-29

Multimodal Co-Training with Subtractive Unlabeled-Benefit Bounds

2026-04-29

Multimodal Fusion-Based IPCLIP Network for Mixed Reality Surgical Assistance

2026-04-29

Multimodal LLMs as Expert Speech Annotators: Acoustic Macro-Descriptors for Parkinson’s Detection

2026-04-29

Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

2026-04-29

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

2026-04-29

Multimodal Transformer with Multiperspective Training for Predicting Self-Expression Skills from Video Interview

2026-04-29

Multimodal Variational Graph Network for Multimodal Sentiment Analysis

2026-04-29

MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

2026-04-29

Musicdetr: A Position-Aware Spectral Note Detection Model for Singing Transcription

2026-04-29

MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

2026-04-29

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

2026-04-29

Natural Language to Spatial Audio Parameters: Lightweight Deterministic Rendering for Creative Authoring

2026-04-29

NCF-TTS: Enhancing Flow Matching Based Text-To-Speech with Neighborhood Consistency Flow

2026-04-29

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

2026-04-29

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction

2026-04-29

Neuromamba: Adaptive Frequency Filtering with a Pyramid Mamba for sEEG-driven Speech Synthesis

2026-04-29

NeuroSIFT: A Biologically-Inspired Framework with Explicit Signal-Noise Separation for Robust Multimodal Emotion Recognition

2026-04-29

nGPT as a Scalable Architecture for Speech Recognition and Translation

2026-04-29

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

2026-04-29

Noise-Robust AV-ASR Using Visual Features both in the Whisper Encoder and Decoder

2026-04-29

Noise-Robust Contrastive Learning with an MFCC-Conformer for Coronary Artery Disease Detection

2026-04-29

Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription

2026-04-29

Non-Line-of-Sight Vehicle Detection via Audio-Visual Fusion

2026-04-29

Obstructive Sleep Apnea Endotype Prediction During Wakefulness Using Voice Biomarkers

2026-04-29

Off-The-Grid Multi-Pitch Estimation Using Optimal Transport

2026-04-29

OMNI-AVSR: Towards Unified Multimodal Speech Recognition With Large Language Models

2026-04-29

On deepfake voice detection - It’s all in the presentation

2026-04-29

On The Design of Efficient Neural Methods for Geometry-Agnostic Multichannel Speech Enhancement

2026-04-29

On the Design of Higher-Order Time-Intensity Microphone Arrays for Panoramic Audio Recording and Reproduction

2026-04-29

One Model–Three Tasks: Discovering a Shared Winning Ticket for Low-Complexity Audio Intelligence

2026-04-29

Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context

2026-04-29

Optimizing Domain-Adaptive Self-Supervised Learning for Clinical Voice-Based Disease Classification

2026-04-29

Optimizing Speech Language Models for Acoustic Consistency

2026-04-29

OV-INSTRUCTTTS: Towards Open-Vocabulary Instruct Text-to-Speech

2026-04-29

PAC: Pronunciation-Aware Contextualized Large Language Model-Based Automatic Speech Recognition

2026-04-29

PADAM: Perceptual Audio Defect Assessment Model

2026-04-29

ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-Based Neural Speech Codec

2026-04-29

Parametric Neural Amp Modeling with Active Learning

2026-04-29

PC-MCL: Patient-Consistent Multi-Cycle Learning with Multi-Label Bias Correction for Respiratory Sound Classification

2026-04-29

Peeking Into the Future for Contextual Biasing

2026-04-29

Perceptual Loss Optimized HRTF Personalization in Spherical Harmonic Domain

2026-04-29

Perceptual Quality Assessment for Stylized Talking Heads

2026-04-29

PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

2026-04-29

Personal Sound Zones with Flexible Bright Zone Control

2026-04-29

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

2026-04-29

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

2026-04-29

PG-SE: Predictive Acceleration and Correction for Generative Speech Enhancement

2026-04-29

Phase-Retrieval-Based Physics-Informed Neural Networks For Acoustic Magnitude Field Reconstruction

2026-04-29

Phase-Space Signal Processing of Acoustic Data for Advanced Manufacturing In-Situ Monitoring

2026-04-29

PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speech Recognition

2026-04-29

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

2026-04-29

Phonological Tokenizer: Prosody-Aware Phonetic Token Via Multi-Objective Fine-Tuning with Differentiable K-Means

2026-04-29

Phrased: Phrase Dictionary Biasing for Speech Translation

2026-04-29

Physics-Informed Neural Networks for Ocean Acoustic Field Reconstruction and Source Localization

2026-04-29

Pianoroll-Event: A Novel Score Representation for Symbolic Music

2026-04-29

PICOAUDIO2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

2026-04-29

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

2026-04-29

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

2026-04-29

Polynomial Mixing for Efficient Self-Supervised Speech Encoders

2026-04-29

Position-Invariant Fine-Tuning Of Speech Enhancement Models With Self-Supervised Speech Representations

2026-04-29

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

2026-04-29

Principled Coarse-Grained Acceptance For Speculative Decoding In Speech

2026-04-29

PRoADS: Provably Secure And Robust Audio Diffusion Steganography With Latent Optimization And Backward Euler Inversion

2026-04-29

Probing the Hidden Talent of ASR foundation models for L2 English Oral Assessment

2026-04-29

Probing Whisper for Dysarthric Speech in Detection and Assessment

2026-04-29

Production-Scale Dynamic Vocabulary ASR Biasing with Word-Level FST and Robust Training

2026-04-29

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

2026-04-29

Prompt-Guided Mixture-of-Experts for Robust Multimodal Sentiment Analysis with Missing Modalities

2026-04-29

PromptSep: Generative Audio Separation Via Multimodal Prompting

2026-04-29

Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

2026-04-29

PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

2026-04-29

Prototype-Guided Cross-Modal Contrastive Learning for Continual Audio-Visual Sound Separation

2026-04-29

PRSA: Preventing Malicious Speaker Recognition and Speech Synthesis Simultaneously with Adversarial Examples

2026-04-29

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

2026-04-29

PSTalker: Realistic 3D Talking Head Synthesis via a Semantic-Aware Audio-Driven Point-Based Shape

2026-04-29

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

2026-04-29

Qastanet: A DNN-Based Quality Metric for Spatial Audio

2026-04-29

QE-XVC: Zero-Shot Cross-Lingual Voice Conversion via Query-Enhancement and Conditional Flow Matching

2026-04-29

QFOCUS: Controllable Synthesis for Automated Speech Stress Editing to Deliver Human-Like Emphatic Intent

2026-04-29

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for Voicemos 2024

2026-04-29

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

2026-04-29

Random Matrix-Driven Graph Representation Learning For Bioacoustic Recognition

2026-04-29

Ranking The Impact of Contextual Specialization in Neural Speech Enhancement

2026-04-29

RAP: Real-Time Audio-Driven Portrait Animation with Video Diffusion Transformer

2026-04-29

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-29

RASD-SR: A Robust Anomalous Sound Detection Framework with Score Recalibration

2026-04-29

Rationale-Guided Learning for Multimodal Emotion Recognition

2026-04-29

RCAL: Reinforced Cross-Modal Alignment for Multimodal Sentiment Analysis with Sparse Visual Frames

2026-04-29

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

2026-04-29

Real-Time Streaming MEL Vocoding with Generative Flow Matching

2026-04-29

Reasoning Driven Captions to Assist Noise Robust Speech Emotion Recognition

2026-04-29

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

2026-04-29

Reconstruction of Spherical Sound Source Radiation Characteristics with Graph Signal Processing

2026-04-29

Recovering Performance in Speech Emotion Recognition from Discrete Tokens Via Multi-Layer Fusion and Paralinguistic Feature Integration

2026-04-29

Reducing Prompt Sensitivity in LLM-Based Speech Recognition Through Learnable Projection

2026-04-29

Reference Microphone Selection for Guided Source Separation Based on The Normalized L-P Norm

2026-04-29

Reference-Aware SFM Layers for Intrusive Intelligibility Prediction

2026-04-29

Refgen: Reference-Guided Synthetic Data Generation for Anomalous Sound Detection

2026-04-29

Regularized Inverse Filter Design for Rigid Spherical Microphone Array Processing: Laplace- And Time-Domain Representations

2026-04-29

Relative Time Intervals Representation For Word-Level Timestamping With Masked Training

2026-04-29

Reliable AI via Age-Balanced Validation: Fair Model Selection for Parkinson’s Detection from Voice

2026-04-29

Representation-Based Data Quality Audits for Audio

2026-04-29

Representation-Diverse Self-Supervision for Cross-Domain Bioacoustic Learning in Low-Resource Settings

2026-04-29

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

2026-04-29

Respire-Mamba C-UNet: Consistency-Trained Autoencoder for High-Fidelity Respiratory Sound Compression

2026-04-29

Rethinking Entity Disambiguation in Complex Modalities

2026-04-29

Rethinking Music Captioning with Music Metadata LLMS

2026-04-29

Retrieval-Based Speculative Decoding For Autoregressive Speech Synthesis

2026-04-29

Revisiting Direct Speech-to-Text Translation with Speech LLMS: Better Scaling than Cot Prompting?

2026-04-29

RFM-Editing: Rectified Flow Matching for Text-Guided Audio Editing

2026-04-29

RHO-PERFECT: Correlation Ceiling for Subjective Evaluation Datasets

2026-04-29

RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses

2026-04-29

RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models

2026-04-29

RMODGDF: A Robust STFT-Derived Feature for Musical Instrument Recognition

2026-04-29

Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings

2026-04-29

Robust and Lightweight F0 Estimation Through Mid-Level Fusion of DSP-Informed Features

2026-04-29

Robust Deepfake Audio Detection via Multi-Level Intermediate Feature Fusion

2026-04-29

Robust Online Overdetermined Independent Vector Analysis Based on Bilinear Decomposition

2026-04-29

RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack

2026-04-29

RRPO: Robust Reward Policy Optimization for LLM-Based Emotional TTS

2026-04-29

S-PRESSO: Ultra Low Bitrate Sound Effect Compression with Diffusion Autoencoders and Offline Quantization

2026-04-29

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

2026-04-29

S2Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

2026-04-29

SA-SSL-MOS: Self-Supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

2026-04-29

SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech

2026-04-29

SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution

2026-04-29

Salad-VAE: Semantic Audio Compression with Language-Audio Distillation

2026-04-29

Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems with Deep Kernel Learning

2026-04-29

SAUNA: Song-Level Audio & User-Listening Data Neural Alignment

2026-04-29

Savgbench: Benchmarking Spatially Aligned Audio-Video Generation

2026-04-29

Scalable Evaluation for Audio Identification Via Synthetic Latent Fingerprint Generation

2026-04-29

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

2026-04-29

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

2026-04-29

Scaling Spoken Language Models with Syllabic Speech Tokenization

2026-04-29

SceneRAG: Scene-Level Retrieval-Augmented Generation for Video Understanding

2026-04-29

SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

2026-04-29

Secondary Source Placement for Sound Field Control Based on Ising Model

2026-04-29

SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR

2026-04-29

Segmentwise Pruning in Audio-Language Models

2026-04-29

SELD-MOHA: A Fine-Tuning Method with the Mixture of Heterogeneous Adapters for Sound Event Localization and Detection

2026-04-29

Selective Hub Fusion with Modality-Heterogeneous Experts for Multimodal Emotion Recognition

2026-04-29

Self-Supervised Note Tracking and Multi-Pitch Estimation Via Reconstruction-Based Learning

2026-04-29

Semantic Anchor Transfer from Short to Long Speech in a Distillation-Based Summarization Framework

2026-04-29

Semantic-Guided Pseudo-Feature Attention Network for Audio-Visual Zero-Shot Learning

2026-04-29

SEP-ST: Incorporating Speech Entity Prompt Into Large Language Models for Speech Translation

2026-04-29

Separate this, and all of these Things Around It: Music Source Separation Via Hyperellipsoidal Queries

2026-04-29

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

2026-04-29

Sequential and Simultaneous Optimization of Microphone Array Geometry and Region-of-Interest Beamforming

2026-04-29

Session-Level Spoken Language Assessment with A Multimodal Foundation Model Via Multi-Target Learning

2026-04-29

SFM-TTS: Lightweight and Rapid Speech Synthesis with Flexible Shortcut Flow Matching

2026-04-29

Shared Representation Learning for Reference-Guided Targeted Sound Detection

2026-04-29

Shortcut Flow Matching for Speech Enhancement: Step-Invariant Flows via Single Stage Training

2026-04-29

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-Scale Dataset Cleansing

2026-04-29

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

2026-04-29

Sing What You Fit: A Perception-Based Dataset and Benchmark for Vocal-Song Suitability Analysis

2026-04-29

Sing2Song: An Accompaniment Generation System Based on Solo Singing

2026-04-29

Single-Microphone Audio Point Source Discriminative Localization from Reverberation Late Tail Estimation

2026-04-29

Single-Step Controllable Music Bandwidth extension with Flow Matching

2026-04-29

SingMOS-Pro: An Comprehensive Benchmark For Singing Quality Assessment

2026-04-29

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

2026-04-29

SIRUP: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics

2026-04-29

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

2026-04-29

SLM-SS: Speech Language Model for Generative Speech Separation

2026-04-29

SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models

2026-04-29

Slot Filling as a Reasoning Task for Speechllms

2026-04-29

SmoothCLAP: Soft-Target Enhanced Contrastive Language-Audio Pretraining for Affective Computing

2026-04-29

Snore Sound Classification Based on Physiological Features and Adaptive Loss Function

2026-04-29

Solving the Helmholtz Equation Via Physics-Informed Neural Networks with an Adaptive Weighting Strategy

2026-04-29

SONAR: Self-Distilled Continual Pre-Training for Domain Adaptive Audio Representation

2026-04-29

SoundCompass: Navigating Target Sound Extraction with Effective Directional Clue Integration in Complex Acoustic Scenes

2026-04-29

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

2026-04-29

Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

2026-04-29

Source Separation For A Cappella Music

2026-04-29

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

2026-04-29

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

2026-04-29

SPAM: Style Prompt Adherence Metric for Prompt-Based TTS

2026-04-29

Sparse Autoencoders Make Audio Foundation Models More Explainable

2026-04-29

Sparse-View Visual-Acoustic Latent Learning for Novel-View Audio Synthesis

2026-04-29

Spatial Covariance Matrix Reconstruction for Speech Enhancement in Reverberant Multi-Source Environments

2026-04-29

Spatial-CLAP: Learning Spatially-Aware Audio–Text Embeddings for Multi-Source Conditions

2026-04-29

Spatially Aware Self-Supervised Models for Multi-Channel Neural Speaker Diarization

2026-04-29

SpatialNet-Echo: Real-Time Acoustic Echo Cancellation via Integrated Narrow-Band and Cross-Band Processing

2026-04-29

Speaker Anonymisation for Speech-Based Suicide Risk Detection

2026-04-29

Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

2026-04-29

Spectral or Spatial? Leveraging Both for Speaker Extraction in Challenging Data Conditions

2026-04-29

Spectrogram Event Based Feature Representation for Generalizable Automatic Music Transcription

2026-04-29

Speech Emotion Recognition based on Hierarchical Transformer with Shifted Windows

2026-04-29

Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts

2026-04-29

SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

2026-04-29

SpeechMapper: Speech-To-Text Embedding Projector for LLMs

2026-04-29

Spike-Driven Low-Power Speech Bandwidth Extension

2026-04-29

Spiking Attention Network: A Hybrid Neuromorphic Approach to Underwater Acoustic Localization and Zero-Shot Adaptation

2026-04-29

Spiking Temporal-Enhanced Network for Zero-Shot Audio-Visual Learning

2026-04-29

Spring Reverb Emulation with Hybrid Gated Convolutional Networks and State Space Models

2026-04-29

SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition

2026-04-29

ST-HNTM: Joint Speech-Text Neural Topic Modeling on the Hypersphere

2026-04-29

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

2026-04-29

Staged Diffusion with Hybrid Mixture-of-Experts (MOE) for Multimodal Sentiment Analysis

2026-04-29

Stemphonic: All-At-Once Flexible Multi-Stem Music Generation

2026-04-29

Step-Audio-R1.5 Technical Report

2026-04-29

StereoFoley: Object-Aware Stereo Audio Generation from Video

2026-04-29

Stereophonic Acoustic Echo Cancellation Using an Improved Affine Projection Algorithm with Adaptive Multiple Sub-Filters

2026-04-29

Still Thinking or Stopped Talking? Dialogue Silence Intention Classification Using Multimodal Large Language Model

2026-04-29

Str-DiffSep: Streamable Diffusion Model for Speech Separation

2026-04-29

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization Via Neural Audio Codec and Language Models

2026-04-29

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

2026-04-29

Streamingbench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

2026-04-29

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-29

Stress Prediction from Temporal Emotion Trajectories in Clinical Patient-Physician Conversations

2026-04-29

Structure-Aware Diffusion Schrödinger Bridge

2026-04-29

StyHarmo: Efficient Style-Specific Video Generation with Music Synchronization

2026-04-29

Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

2026-04-29

Style-Disentangled Diffusion for Controllable and Identity-Generalized Speech-Driven Body Motion Generation

2026-04-29

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

2026-04-29

StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks

2026-04-29

Subgraph Localization in the Subbands for Partially Spoofed Speech Detection

2026-04-29

Subsequence SDTW: Differentiable Alignment with Flexible Boundary Conditions

2026-04-29

Subspace Hybrid Adaptive Filtering for Phonocardiogram Signal Denoising

2026-04-29

Sunac: Source-Aware Unified Neural Audio Codec

2026-04-29

SURE: Synergistic Uncertainty-Aware Reasoning for Multimodal Emotion Recognition in Conversations

2026-04-29

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

2026-04-29

Symphony Rendering: Midi and Composer-Conditioned Auto Orchestration with Flow-Matching Transformers

2026-04-29

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

2026-04-29

SynaSpot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

2026-04-29

Synchronous Secondary Path Modeling and Kronecker-Factorized Adaptive Algorithm for Multichannel Active Noise Control

2026-04-29

Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer

2026-04-29

SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

2026-04-29

Synthcloner: Synthesizer-Style Audio Transfer via Factorized Codec with ADSR Envelope Control

2026-04-29

Synthesized Data Selection via Score Distribution Matching for Te Reo Māori Automatic Speech Recognition

2026-04-29

Synthetic Data Domain Adaptation for ASR via LLM-Based Text and Phonetic Respelling Augmentation

2026-04-29

Synthetic yet Striking? Assessing Vocal Charisma in TTS via Perceptual and Algorithmic Measures

2026-04-29

T-Cache: Fast Inference For Masked Generative Transformer-Based TTS Via Prompt-Aware Feature Caching

2026-04-29

T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS

2026-04-29

TAG: Structured Temporal Audio Generation via LLM-Guided Manual Scription and Control

2026-04-29

TAGARELA - A Portuguese Speech Dataset from Podcasts

2026-04-29

Taming Audio VAEs via Target-KL Regularization

2026-04-29

Target Speaker Anonymization in Multi-Speaker Recordings

2026-04-29

Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder

2026-04-29

Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis

2026-04-29

Task-Oriented Sound Privacy Preservation for Sound Event Detection Via End-to-End Adversarial Multi-Task Learning

2026-04-29

TASU: Text-only Alignment for Speech Understanding

2026-04-29

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

2026-04-29

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

2026-04-29

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-Wise Distillation

2026-04-29

Teaching the Teachers: Boosting Unsupervised Domain Adaptation In Speech Recognition By Ensemble Update

2026-04-29

Temporal Distillation for Music Representation Learning

2026-04-29

Temporal Graph Modeling for Speech Emotion Recognition Using LSTM-Aggregated Multigraph Networks

2026-04-29

Temporal-Spatial Decouple Before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

2026-04-29

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic Event Classification

2026-04-29

Test Time Adaptation for Speech Emotion Recognition

2026-04-29

Test-Time Scaling for Auditory Cognition in Audio Language Models

2026-04-29

Testing The Efficient Coding Hypothesis Beyond Humans: The Auditory Kernels of Bat Vocalizations

2026-04-29

Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

2026-04-29

Text2Move: Text-To-Moving Sound Generation via Trajectory Prediction and Temporal Alignment

2026-04-29

TextlessRAG: End-to-End Visual Document RAG by Speech without Text

2026-04-29

The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction

2026-04-29

The Curious Case of Visual Grounding: Different Effects for Speech-and Text-Based Language Encoders

2026-04-29

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

2026-04-29

The Muse Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMs

2026-04-29

The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

2026-04-29

The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion

2026-04-29

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

2026-04-29

The Synergistic Role of Audio and Large Video-Language Model in Source-Free Video Domain Adaptation

2026-04-29

Theory and Application of Circular Relative Harmonic Coefficients

2026-04-29

Thinking While Listening: Simple Test Time Scaling for Audio Classification

2026-04-29

Three Seconds is Sufficient: A Multi-Pronged Framework for Model-Based Speaker Adaptation in ASR Under Data-Scarce Conditions

2026-04-29

TICL: Text-Embedding KNN for Speech in-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

2026-04-29

Timbre-Aware Audio Difference Captioning for Anomalous Machine Sounds without Paired Training Data via Synthetic Perturbations

2026-04-29

Timbre-Based Pretraining with Pseudo-Labels for Multi-Instrument Automatic Music Transcription

2026-04-29

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

2026-04-29

Time-Domain Synthesis of Virtual Sound Source Within Personalized Sound Zone using a Linear Loudspeaker Array

2026-04-29

Time-Shifted Token Scheduling for Symbolic Music Generation

2026-04-29

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-29

Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection

2026-04-29

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for Ü-Tsang, Amdo and Kham Speech Dataset Generation

2026-04-29

Tokenchain: A Discrete Speech Chain via Semantic Token Modeling

2026-04-29

Toward Faithful Explanations in Acoustic Anomaly Detection

2026-04-29

Toward Robust And Efficient Beat Tracking Via Beat-Aware Attention

2026-04-29

Towards Blind Data Cleaning: A Case Study in Music Source Separation

2026-04-29

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

2026-04-29

Towards Data Drift Monitoring for Speech Deepfake Detection in the Context of MLOps

2026-04-29

Towards Distance-Aware Synthetic Audio Mixtures for Universal Sound Separation

2026-04-29

Towards Effective Negation Modeling in Joint Audio-Text Models for Music

2026-04-29

Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances

2026-04-29

Towards Fair ASR for Second Language Speakers using Fairness Prompted Finetuning

2026-04-29

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

2026-04-29

Towards Multi-View Hierarchical Video-to-Piano Generation with MIDI Guidance

2026-04-29

Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

2026-04-29

Towards Real-Time Generative Speech Restoration with Flow-Matching

2026-04-29

Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

2026-04-29

Tpeformer: Temporal Patch Embedding Transformer

2026-04-29

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

2026-04-29

Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction

2026-04-29

Training Flow Matching Models with Reliable Labels via Self-Purification

2026-04-29

Training-Free Inference-Time Scaling for Audio Source Separation

2026-04-29

Training-Free Multimodal Guidance for Video to Audio Generation

2026-04-29

Transfer Learning for Paediatric Sleep Apnoea Detection using Physiology-Guided Acoustic Models

2026-04-29

Transferable Audio Lottery Tickets: Gradient Accumulation for Extreme Sparsity

2026-04-29

Tri-Attention Fusion: Joint Temporal-Spectral and Bidirectional Modeling for Speech Spoofing Detection

2026-04-29

Triad: Tri-Head with Auxiliary Duplicating Permutation Invariant Training for Multi-Task Sound Event Localization and Detection

2026-04-29

Triage Knowledge Distillation for Speaker Verification

2026-04-29

TTA: Transcribe, Translate and Alignment for Cross-Lingual Speech Representation

2026-04-29

TVP-UNet: Threshold Variance Penalty U-Net for Voice Activity Detection in Dysarthric Speech

2026-04-29

Two-Stage Language Model Framework for Acoustic Echo Cancellation

2026-04-29

UJCodec: An End-to-end Unet-Style Codec for Joint Speech Compression and Enhancement

2026-04-29

UMA-SPLIT: Unimodal Aggregation for Both English and Mandarin Non-Autoregressive Speech Recognition

2026-04-29

UMV: A Mixture-Of-Experts Vision Transformer with Multi-Spectrogram Fusion for Underwater Ship Noise Classification

2026-04-29

Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation

2026-04-29

Understanding Textual Capability Degradation in Speech LLMS via Parameter Importance Analysis

2026-04-29

Understanding the Strengths and Weaknesses of SSL Models for Audio Deepfake Model Attribution

2026-04-29

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

2026-04-29

Universr: Unified and Versatile Audio Super-Resolution Via Vocoder-Free Flow Matching

2026-04-29

UNMIXX: Untangling Highly Correlated Singing Voices Mixtures

2026-04-29

Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

2026-04-29

Unseen but Not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models

2026-04-29

Unsupervised Discovery and Analysis of the Vocal Repertoires and Patterns of Select Corvid Species

2026-04-29

Unsupervised Lexicon Learning from Speech is Limited by Representations Rather than Clustering

2026-04-29

USVexplorer: Robust Detection of Ultrasonic Vocalizations with Cross Species Generalization

2026-04-29

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

2026-04-29

Utilizing Information Theoretic Approach to Study Cochlear Neural Degeneration

2026-04-29

UVT-LM: Unifying Visual and Tactile Perception with Language Model

2026-04-29

V2A-DPO: Omni-Preference Optimization for Video-To-Audio Generation

2026-04-29

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

2026-04-29

VBx for End-to-End Neural and Clustering-Based Diarization

2026-04-29

VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication

2026-04-29

Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation

2026-04-29

Vib2Sound: Separation Of Multimodal Sound Sources

2026-04-29

Vioptt: Violin Technique-Aware Transcription from Synthetic Data Augmentation

2026-04-29

Virtual Consistency for Audio Editing

2026-04-29

Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation

2026-04-29

ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models

2026-04-29

VividTalker: A Modular Framework for Expressive 3D Talking Avatars with Controllable Gaze and Blink

2026-04-29

VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays

2026-04-29

VMSP: Video-to-Music Generation with Two-Stage Alignment and Synthesis

2026-04-29

Vocalnet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

2026-04-29

Voting-Based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection

2026-04-29

VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings

2026-04-29

VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency

2026-04-29

VT-Heads: Voice Cloning and Talking Head Generation from Text Based on V-DiT

2026-04-29

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

2026-04-29

WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error

2026-04-29

Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features

2026-04-29

Wavenext 2: Convnext-Based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for Gan And Diffusion Models

2026-04-29

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

2026-04-29

WaveSpikeNet: A Wavelet-Spiking Fusion Architecture for Audio Classification on Edge Devices

2026-04-29

WavLink: Compact Audio–Text Embeddings with a Global Whisper Token

2026-04-29

What the student learns in knowledge distillation: A subspace view and evidence on Convolutional Recurrent Network

2026-04-29

When Audio Matters: A Lightweight, Hierarchical Fusion Model for Speech and Non-Verbal Emotion Recognition

2026-04-29

When Children Talk and Machines Listen: Toward an Interpretable Speech-Based Screener for Dutch Developmental Language Disorder

2026-04-29

When Noise Lowers the Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

2026-04-29

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

2026-04-29

When Voice Matters: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

2026-04-29

Whisper-FEST: Single-Channel Far-Field Enhanced Speech-to-text without Parallel Data

2026-04-29

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models Based on MHA2MLA Conversion

2026-04-29

Whisper-QF: Leveraging Dual Cross-Attention Q-Former for Speech Emotion Recognition With Multi-Task Learning

2026-04-29

Whisper: Courtside Edition - Enhancing ASR Performance through LLM-Driven Context Generation

2026-04-29

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

2026-04-29

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

2026-04-29

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition

2026-04-29

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

2026-04-29

ZK-VSA: Zero-Knowledge Verifiable Speaker Anonymization Leveraging Phase Vocoder with Time-Scale Modification

2026-04-29

ZSV2C-MLLM: Zero-Shot Visual Voice Cloning Via Multimodal Large Language Models

2026-04-29

β-AVSDNET: A Novel End-To-End Neural Network Architecture For Audio-Visual Speaker Diarization

2026-04-29

语音/音频论文速递 2026-04-29

2026-04-29

A Functorial Formulation of Neighborhood Aggregating Deep Learning

2026-04-28

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

2026-04-28

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

2026-04-28

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

2026-04-28

Come Together: Analyzing Popular Songs Through Statistical Embeddings

2026-04-28

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-28

Explainable AI in Speaker Recognition – Making Latent Representations Understandable

2026-04-28

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

2026-04-28

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

2026-04-28

Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions

2026-04-28

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-28

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-28

Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

2026-04-28

Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

2026-04-28

Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

2026-04-28

Psychologically-Grounded Graph Modeling for Interpretable Depression Detection

2026-04-28

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-28

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

2026-04-28

RTCFake: Speech Deepfake Detection in Real-Time Communication

2026-04-28

Scaling Properties of Continuous Diffusion Spoken Language Models

2026-04-28

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

2026-04-28

Speech Enhancement Based on Drifting Models

2026-04-28

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

2026-04-28

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-28

语音/音频论文速递 2026-04-28

2026-04-28

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

2026-04-27

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-27

Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues

2026-04-27

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

2026-04-27

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

2026-04-27

Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation

2026-04-27

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-27

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2026-04-27

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-27

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012

2026-04-27

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

2026-04-27

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-27

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2026-04-27

语音/音频论文速递 2026-04-27

2026-04-27

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-25

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-25

语音/音频论文速递 2026-04-25

2026-04-25

“This Wasn’t Made for Me”: Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

2026-04-24

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-24

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

2026-04-24

Beyond Rules: Towards Basso Continuo Personal Style Identification

2026-04-24

DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline

2026-04-24

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

2026-04-24

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

2026-04-24

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

2026-04-24

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-24

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

2026-04-24

Low-Rank Adaptation Redux for Large Models

2026-04-24

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-24

Materialistic RIR: Material Conditioned Realistic RIR Generation

2026-04-24

MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding

2026-04-24

Misinformation Span Detection in Videos via Audio Transcripts

2026-04-24

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

2026-04-24

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

2026-04-24

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-24

Sema: Semantic Transport for Real-Time Multimodal Agents

2026-04-24

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

2026-04-24

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-24

语音/音频论文速递 2026-04-24

2026-04-24

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

2026-04-23

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

2026-04-23

Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials

2026-04-23

Centering Ecological Goals in Automated Identification of Individual Animals

2026-04-23

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

2026-04-23

Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis

2026-04-23

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

2026-04-23

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

2026-04-23

Enhancing Speaker Verification with Whispered Speech via Post-Processing

2026-04-23

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-23

Explicit Dropout: Deterministic Regularization for Transformer Architectures

2026-04-23

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

2026-04-23

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-23

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

2026-04-23

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-23

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-23

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

2026-04-23

Qwen3.5-Omni Technical Report

2026-04-23

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-23

SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

2026-04-23

Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

2026-04-23

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

2026-04-23

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-23

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-23

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-23

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

2026-04-23

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-23

语音/音频论文速递 2026-04-23

2026-04-23

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

2026-04-22

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-22

Audio Spoof Detection with GaborNet

2026-04-22

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

2026-04-22

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-22

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-22

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

2026-04-22

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

2026-04-22

Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

2026-04-22

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-22

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

2026-04-22

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-22

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

2026-04-22

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-22

Qwen3.5-Omni Technical Report

2026-04-22

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-22

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-22

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-22

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-22

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

2026-04-22

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

2026-04-22

语音/音频论文速递 2026-04-22

2026-04-22

A novel LSTM music generator based on the fractional time-frequency feature extraction

2026-04-21

A state-space representation of the boundary integral equation for room acoustic modelling

2026-04-21

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-21

Anonymization, Not Elimination: Utility-Preserved Speech Anonymization

2026-04-21

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-21

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-21

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

2026-04-21

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

2026-04-21

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-21

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

2026-04-21

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-21

Coexisting Tempo Traditions in Beethoven’s Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012

2026-04-21

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-21

FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

2026-04-21

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-21

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

2026-04-21

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

2026-04-21

ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

2026-04-21

Incremental learning for audio classification with Hebbian Deep Neural Networks

2026-04-21

Latent Fourier Transform

2026-04-21

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

2026-04-21

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

2026-04-21

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

2026-04-21

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-21

Neural Encoding Detection is Not All You Need for Synthetic Speech Detection

2026-04-21

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

2026-04-21

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

2026-04-21

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-21

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

2026-04-21

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

2026-04-21

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

2026-04-21

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-21

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-21

Where Do Self-Supervised Speech Models Become Unfair?

2026-04-21

语音/音频论文速递 2026-04-21

2026-04-21

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

2026-04-20

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-20

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

2026-04-20

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

2026-04-20

BlasBench: An Open Benchmark for Irish Speech Recognition

2026-04-20

Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

2026-04-20

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

2026-04-20

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

2026-04-20

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

2026-04-20

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

2026-04-20

Hierarchical Codec Diffusion for Video-to-Speech Generation

2026-04-20

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

2026-04-20

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

2026-04-20

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-20

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

2026-04-20

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

2026-04-20

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-20

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

2026-04-20

Qwen3.5-Omni Technical Report

2026-04-20

Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

2026-04-20

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

2026-04-20

The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

2026-04-20

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-20

VoxMind: An End-to-End Agentic Spoken Dialogue System

2026-04-20

语音/音频论文速递 2026-04-20

2026-04-20

A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven’s Piano and Cello Sonatas

2026-04-19

Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

2026-04-19

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

2026-04-19

Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization

2026-04-19

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-19

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

2026-04-19

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

2026-04-19

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-19

Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

2026-04-19

Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals

2026-04-19

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-19

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

2026-04-19

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

2026-04-19

Diffusion Language Models for Speech Recognition

2026-04-19

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

2026-04-19

Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

2026-04-19

Enhancing time-frequency resolution with optimal transport and barycentric fusion of multiple spectrogram

2026-04-19

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

2026-04-19

Four Decades of Digital Waveguides

2026-04-19

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-19

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

2026-04-19

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

2026-04-19

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

2026-04-19

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

2026-04-19

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-19

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

2026-04-19

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

2026-04-19

Room compensation for loudspeaker reproduction using a supporting source

2026-04-19

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

2026-04-19

SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion

2026-04-19

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

2026-04-19

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-19

TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants

2026-04-19

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

2026-04-19

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

2026-04-19

Transformer Based Machine Fault Detection From Audio Input

2026-04-19

UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

2026-04-19

VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

2026-04-19

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-19

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

2026-04-19

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

2026-04-19

Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

2026-04-19

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-19

语音/音频论文速递 2026-04-19

2026-04-19

语音/音频论文速递 2026-04-18

2026-04-18