2026  1965

May  653

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

2026-05-19 · 更新于 2026-05-19 · 3 min · 508 words

A Fast Robust Adaptive filter using Improved Data-Reuse Method

2026-05-19 · 更新于 2026-05-19 · 2 min · 401 words

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

2026-05-19 · 更新于 2026-05-19 · 3 min · 431 words

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

2026-05-19 · 更新于 2026-05-19 · 3 min · 615 words

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

2026-05-19 · 更新于 2026-05-19 · 3 min · 634 words

Audio-Image Cross-Modal Retrieval with Onomatopoeic Images

2026-05-19 · 更新于 2026-05-19 · 3 min · 508 words

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

2026-05-19 · 更新于 2026-05-19 · 3 min · 573 words

Bridging the Gap: Converting Read Text to Conversational Dialogue

2026-05-19 · 更新于 2026-05-19 · 2 min · 277 words

Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities

2026-05-19 · 更新于 2026-05-19 · 4 min · 645 words

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

2026-05-19 · 更新于 2026-05-19 · 3 min · 456 words

Contextual Biasing for Streaming ASR via CTC-based Word Spotting

2026-05-19 · 更新于 2026-05-19 · 2 min · 371 words

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

2026-05-19 · 更新于 2026-05-19 · 2 min · 401 words

Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters

2026-05-19 · 更新于 2026-05-19 · 3 min · 547 words

Fractional-Order Subband p-Norm Adaptive Filter via Transformation Nearest Kronecker Product Decomposition for Active Noise Control

2026-05-19 · 更新于 2026-05-19 · 2 min · 277 words

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

2026-05-19 · 更新于 2026-05-19 · 3 min · 431 words

MusicDET: Zero-Shot AI-Generated Music Detection

2026-05-19 · 更新于 2026-05-19 · 3 min · 556 words

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

2026-05-19 · 更新于 2026-05-19 · 4 min · 673 words

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

2026-05-19 · 更新于 2026-05-19 · 3 min · 639 words

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

2026-05-19 · 更新于 2026-05-19 · 2 min · 411 words

Robust Audio Tagging under Class-wise Supervision Unreliability

2026-05-19 · 更新于 2026-05-19 · 3 min · 434 words

Robust Soft-Constrained Spatially Selective Active Noise Control for Hearables Under Secondary Path Variations

2026-05-19 · 更新于 2026-05-19 · 2 min · 364 words

S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation

2026-05-19 · 更新于 2026-05-19 · 3 min · 552 words

SAME: A Semantically-Aligned Music Autoencoder

2026-05-19 · 更新于 2026-05-19 · 3 min · 607 words

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

2026-05-19 · 更新于 2026-05-19 · 3 min · 550 words

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

2026-05-19 · 更新于 2026-05-19 · 3 min · 515 words

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

2026-05-19 · 更新于 2026-05-19 · 3 min · 482 words

Sonalyzer-Moz: A Framework for Analyzing the Structure of Mozart’s Sonata Form

2026-05-19 · 更新于 2026-05-19 · 2 min · 401 words

Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations

2026-05-19 · 更新于 2026-05-19 · 3 min · 445 words

Stable Audio 3

2026-05-19 · 更新于 2026-05-19 · 3 min · 621 words

Taming Audio VAEs via Target-KL Regularization

2026-05-19 · 更新于 2026-05-19 · 3 min · 434 words

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

2026-05-19 · 更新于 2026-05-19 · 2 min · 386 words

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

2026-05-19 · 更新于 2026-05-19 · 2 min · 313 words

Voice ‘‘Cloning’’ is Style Transfer

2026-05-19 · 更新于 2026-05-19 · 2 min · 323 words

WavFlow: Audio Generation in Waveform Space

2026-05-19 · 更新于 2026-05-19 · 3 min · 524 words

语音/音频论文速递 2026-05-19

2026-05-19 · 更新于 2026-05-19 · 23 min · 4805 words

ARIA: A Diagnostic Framework for Music Training Data Attribution

2026-05-18 · 更新于 2026-05-19 · 4 min · 833 words

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

2026-05-18 · 更新于 2026-05-19 · 3 min · 606 words

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

2026-05-18 · 更新于 2026-05-19 · 2 min · 330 words

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

2026-05-18 · 更新于 2026-05-19 · 2 min · 382 words

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

2026-05-18 · 更新于 2026-05-19 · 8 min · 1698 words

Improving Automatic Speech Recognition for Speakers Treated for Oral Cancer using Data Augmentation and LLM Error Correction

2026-05-18 · 更新于 2026-05-19 · 2 min · 426 words

Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

2026-05-18 · 更新于 2026-05-19 · 4 min · 792 words

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

2026-05-18 · 更新于 2026-05-19 · 4 min · 654 words

Perforated Neural Networks for Keyword Spotting

2026-05-18 · 更新于 2026-05-19 · 2 min · 379 words

Real-time Speech Restoration using Data Prediction Mean Flows

2026-05-18 · 更新于 2026-05-19 · 3 min · 466 words

Scalable neuromorphic computing from autonomous spiking dynamics in a clockless reconfigurable chip

2026-05-18 · 更新于 2026-05-19 · 3 min · 458 words

Sound Sparks Motion: Audio and Text Tuning for Video Editing

2026-05-18 · 更新于 2026-05-19 · 1 min · 211 words

Toward World Modeling of Physiological Signals with Chaos-Theoretic Balancing and Latent Dynamics

2026-05-18 · 更新于 2026-05-19 · 3 min · 455 words

语音/音频论文速递 2026-05-18

2026-05-18 · 更新于 2026-05-19 · 11 min · 2305 words

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

2026-05-17 · 更新于 2026-05-19 · 4 min · 681 words

ViMU: Benchmarking Video Metaphorical Understanding

2026-05-17 · 更新于 2026-05-19 · 3 min · 558 words

语音/音频论文速递 2026-05-17

2026-05-17 · 更新于 2026-05-19 · 3 min · 515 words

A Benchmark for Early-stage Parkinson’s Disease Detection from Speech

2026-05-15 · 更新于 2026-05-19 · 3 min · 531 words

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

2026-05-15 · 更新于 2026-05-19 · 4 min · 673 words

AudioMosaic: Contrastive Masked Audio Representation Learning

2026-05-15 · 更新于 2026-05-19 · 3 min · 635 words

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

2026-05-15 · 更新于 2026-05-19 · 3 min · 517 words

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

2026-05-15 · 更新于 2026-05-19 · 3 min · 543 words

FSD50K-Solo: Automated Curation of Single-Source Sound Events

2026-05-15 · 更新于 2026-05-19 · 2 min · 354 words

FutureSim: Replaying World Events to Evaluate Adaptive Agents

2026-05-15 · 更新于 2026-05-19 · 3 min · 570 words

IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

2026-05-15 · 更新于 2026-05-19 · 3 min · 459 words

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

2026-05-15 · 更新于 2026-05-19 · 3 min · 444 words

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

2026-05-15 · 更新于 2026-05-19 · 2 min · 303 words

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

2026-05-15 · 更新于 2026-05-19 · 3 min · 509 words

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

2026-05-15 · 更新于 2026-05-19 · 2 min · 290 words

Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

2026-05-15 · 更新于 2026-05-19 · 2 min · 407 words

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

2026-05-15 · 更新于 2026-05-19 · 3 min · 439 words

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

2026-05-15 · 更新于 2026-05-19 · 3 min · 453 words

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

2026-05-15 · 更新于 2026-05-19 · 3 min · 621 words

Streaming Speech-to-Text Translation with a SpeechLLM

2026-05-15 · 更新于 2026-05-19 · 2 min · 341 words

Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

2026-05-15 · 更新于 2026-05-19 · 3 min · 516 words

Transmit Beamforming for High-Rate Underwater Acoustic Communications

2026-05-15 · 更新于 2026-05-19 · 2 min · 352 words

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

2026-05-15 · 更新于 2026-05-19 · 3 min · 590 words

语音/音频论文速递 2026-05-15

2026-05-15 · 更新于 2026-05-19 · 15 min · 3187 words

Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval

2026-05-14 · 更新于 2026-05-19 · 2 min · 252 words

Decoupled Azimuth Elevation AoA Estimation Exploiting Kronecker Separable Steering Matrices

2026-05-14 · 更新于 2026-05-19 · 2 min · 331 words

Does language matter for spoken word classification? A multilingual generative meta-learning approach

2026-05-14 · 更新于 2026-05-19 · 2 min · 326 words

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

2026-05-14 · 更新于 2026-05-19 · 3 min · 545 words

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

2026-05-14 · 更新于 2026-05-19 · 3 min · 444 words

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

2026-05-14 · 更新于 2026-05-19 · 2 min · 357 words

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

2026-05-14 · 更新于 2026-05-19 · 3 min · 510 words

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

2026-05-14 · 更新于 2026-05-19 · 2 min · 381 words

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

2026-05-14 · 更新于 2026-05-19 · 2 min · 362 words

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

2026-05-14 · 更新于 2026-05-19 · 3 min · 434 words

Scaling few-shot spoken word classification with generative meta-continual learning

2026-05-14 · 更新于 2026-05-19 · 2 min · 336 words

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

2026-05-14 · 更新于 2026-05-19 · 4 min · 709 words

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

2026-05-14 · 更新于 2026-05-19 · 4 min · 720 words

Text2Score: Generating Sheet Music From Textual Prompts

2026-05-14 · 更新于 2026-05-19 · 3 min · 459 words

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

2026-05-14 · 更新于 2026-05-19 · 3 min · 453 words

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

2026-05-14 · 更新于 2026-05-19 · 3 min · 467 words

语音/音频论文速递 2026-05-14

2026-05-14 · 更新于 2026-05-19 · 11 min · 2240 words

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

2026-05-13 · 更新于 2026-05-19 · 3 min · 570 words

Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

2026-05-13 · 更新于 2026-05-19 · 2 min · 365 words

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

2026-05-13 · 更新于 2026-05-19 · 3 min · 578 words

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

2026-05-13 · 更新于 2026-05-19 · 3 min · 487 words

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

2026-05-13 · 更新于 2026-05-19 · 3 min · 568 words

Chunkwise Aligners for Streaming Speech Recognition

2026-05-13 · 更新于 2026-05-19 · 3 min · 605 words

Exploring Token-Space Manipulation in Latent Audio Tokenizers

2026-05-13 · 更新于 2026-05-19 · 5 min · 900 words

jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition

2026-05-13 · 更新于 2026-05-19 · 3 min · 447 words

Mechanistic Interpretability of ASR models using Sparse Autoencoders

2026-05-13 · 更新于 2026-05-19 · 3 min · 429 words

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

2026-05-13 · 更新于 2026-05-19 · 1 min · 197 words

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

2026-05-13 · 更新于 2026-05-19 · 3 min · 556 words

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

2026-05-13 · 更新于 2026-05-19 · 4 min · 728 words

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

2026-05-13 · 更新于 2026-05-19 · 4 min · 688 words

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

2026-05-13 · 更新于 2026-05-19 · 4 min · 674 words

Spatial Power Estimation via Riemannian Covariance Matching

2026-05-13 · 更新于 2026-05-19 · 2 min · 295 words

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

2026-05-13 · 更新于 2026-05-19 · 3 min · 435 words

The Deepfakes We Missed: We Built Detectors for a Threat That Didn’t Arrive

2026-05-13 · 更新于 2026-05-19 · 2 min · 324 words

The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

2026-05-13 · 更新于 2026-05-19 · 2 min · 343 words

Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement

2026-05-13 · 更新于 2026-05-19 · 4 min · 644 words

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

2026-05-13 · 更新于 2026-05-19 · 5 min · 943 words

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

2026-05-13 · 更新于 2026-05-19 · 2 min · 399 words

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

2026-05-13 · 更新于 2026-05-19 · 3 min · 429 words

语音/音频论文速递 2026-05-13

2026-05-13 · 更新于 2026-05-19 · 14 min · 2798 words

A Cold Diffusion Approach for Percussive Dereverberation

2026-05-12 · 更新于 2026-05-19 · 4 min · 708 words

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

2026-05-12 · 更新于 2026-05-19 · 2 min · 418 words

APEX: Audio Prototype EXplanations for Classification Tasks

2026-05-12 · 更新于 2026-05-19 · 4 min · 823 words

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

2026-05-12 · 更新于 2026-05-19 · 3 min · 505 words

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

2026-05-12 · 更新于 2026-05-19 · 2 min · 367 words

CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

2026-05-12 · 更新于 2026-05-19 · 4 min · 652 words

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

2026-05-12 · 更新于 2026-05-19 · 3 min · 612 words

Dolphin-CN-Dialect: Where Chinese Dialects Matter

2026-05-12 · 更新于 2026-05-19 · 4 min · 696 words

Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

2026-05-12 · 更新于 2026-05-19 · 4 min · 663 words

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

2026-05-12 · 更新于 2026-05-19 · 3 min · 507 words

Encoding and Decoding Temporal Signals with Spiking Bandpass Wavelets

2026-05-12 · 更新于 2026-05-19 · 2 min · 405 words

Evaluating the Expressive Appropriateness of Speech in Rich Contexts

2026-05-12 · 更新于 2026-05-19 · 3 min · 633 words

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

2026-05-12 · 更新于 2026-05-19 · 4 min · 708 words

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

2026-05-12 · 更新于 2026-05-19 · 4 min · 839 words

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

2026-05-12 · 更新于 2026-05-19 · 4 min · 716 words

Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces

2026-05-12 · 更新于 2026-05-19 · 3 min · 446 words

Low-Cost Detection of Degraded Voice Clones via Source-Output Acoustic Consistency

2026-05-12 · 更新于 2026-05-19 · 3 min · 444 words

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

2026-05-12 · 更新于 2026-05-19 · 3 min · 499 words

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

2026-05-12 · 更新于 2026-05-19 · 3 min · 433 words

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

2026-05-12 · 更新于 2026-05-19 · 3 min · 438 words

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

2026-05-12 · 更新于 2026-05-19 · 3 min · 558 words

Online Segmented Beamforming via Dynamic Programming

2026-05-12 · 更新于 2026-05-19 · 3 min · 448 words

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

2026-05-12 · 更新于 2026-05-19 · 3 min · 618 words

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

2026-05-12 · 更新于 2026-05-19 · 3 min · 547 words

Probing Cross-modal Information Hubs in Audio-Visual LLMs

2026-05-12 · 更新于 2026-05-19 · 4 min · 724 words

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

2026-05-12 · 更新于 2026-05-19 · 3 min · 429 words

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

2026-05-12 · 更新于 2026-05-19 · 4 min · 753 words

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

2026-05-12 · 更新于 2026-05-19 · 3 min · 529 words

Responsible Benchmarking of Fairness for Automatic Speech Recognition

2026-05-12 · 更新于 2026-05-19 · 2 min · 293 words

Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

2026-05-12 · 更新于 2026-05-19 · 3 min · 521 words

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

2026-05-12 · 更新于 2026-05-19 · 4 min · 660 words

SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements

2026-05-12 · 更新于 2026-05-19 · 3 min · 447 words

ShipEcho – An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

2026-05-12 · 更新于 2026-05-19 · 2 min · 295 words

Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation

2026-05-12 · 更新于 2026-05-19 · 2 min · 339 words

Speech-based Psychological Crisis Assessment using LLMs

2026-05-12 · 更新于 2026-05-19 · 3 min · 451 words

Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

2026-05-12 · 更新于 2026-05-19 · 2 min · 229 words

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

2026-05-12 · 更新于 2026-05-19 · 4 min · 773 words

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

2026-05-12 · 更新于 2026-05-19 · 3 min · 588 words

Voice Biomarkers for Depression and Anxiety

2026-05-12 · 更新于 2026-05-19 · 1 min · 166 words

语音/音频论文速递 2026-05-12

2026-05-12 · 更新于 2026-05-19 · 28 min · 5761 words

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

2026-05-11 · 更新于 2026-05-19 · 3 min · 432 words

Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers

2026-05-11 · 更新于 2026-05-19 · 2 min · 398 words

Anisotropic Modality Align

2026-05-11 · 更新于 2026-05-19 · 3 min · 585 words

Asymmetric Phase Coding Audio Watermarking

2026-05-11 · 更新于 2026-05-19 · 3 min · 429 words

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

2026-05-11 · 更新于 2026-05-19 · 2 min · 380 words

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

2026-05-11 · 更新于 2026-05-19 · 2 min · 305 words

Do Joint Audio-Video Generation Models Understand Physics?

2026-05-11 · 更新于 2026-05-19 · 3 min · 589 words

Evaluating voice anonymisation using similarity rank disclosure

2026-05-11 · 更新于 2026-05-19 · 3 min · 435 words

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

2026-05-11 · 更新于 2026-05-19 · 2 min · 363 words

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

2026-05-11 · 更新于 2026-05-19 · 4 min · 710 words

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

2026-05-11 · 更新于 2026-05-19 · 2 min · 410 words

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

2026-05-11 · 更新于 2026-05-19 · 2 min · 264 words

语音/音频论文速递 2026-05-11

2026-05-11 · 更新于 2026-05-19 · 9 min · 1723 words

Audio-Visual Intelligence in Large Foundation Models

2026-05-09 · 更新于 2026-05-19 · 1 min · 190 words

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

2026-05-09 · 更新于 2026-05-19 · 3 min · 520 words

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

2026-05-09 · 更新于 2026-05-19 · 2 min · 254 words

语音/音频论文速递 2026-05-09

2026-05-09 · 更新于 2026-05-19 · 3 min · 427 words

Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings

2026-05-08 · 更新于 2026-05-19 · 3 min · 543 words

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

2026-05-08 · 更新于 2026-05-19 · 2 min · 393 words

Do Melody and Rhythm Coevolve?

2026-05-08 · 更新于 2026-05-19 · 3 min · 633 words

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

2026-05-08 · 更新于 2026-05-19 · 3 min · 449 words

Linear Semantic Segmentation for Low-Resource Spoken Dialects

2026-05-08 · 更新于 2026-05-19 · 4 min · 738 words

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

2026-05-08 · 更新于 2026-05-19 · 5 min · 945 words

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

2026-05-08 · 更新于 2026-05-19 · 7 min · 1464 words

Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

2026-05-08 · 更新于 2026-05-19 · 3 min · 519 words

More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

2026-05-08 · 更新于 2026-05-19 · 1 min · 172 words

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

2026-05-08 · 更新于 2026-05-19 · 4 min · 774 words

NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

2026-05-08 · 更新于 2026-05-19 · 2 min · 414 words

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

2026-05-08 · 更新于 2026-05-19 · 6 min · 1097 words

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

2026-05-08 · 更新于 2026-05-19 · 3 min · 566 words

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

2026-05-08 · 更新于 2026-05-19 · 3 min · 607 words

PianoCoRe: Combined and Refined Piano MIDI Dataset

2026-05-08 · 更新于 2026-05-19 · 4 min · 813 words

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

2026-05-08 · 更新于 2026-05-19 · 7 min · 1301 words

Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction

2026-05-08 · 更新于 2026-05-19 · 3 min · 432 words

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

2026-05-08 · 更新于 2026-05-19 · 1 min · 196 words

Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

2026-05-08 · 更新于 2026-05-19 · 2 min · 399 words

Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models

2026-05-08 · 更新于 2026-05-19 · 4 min · 751 words

Topological Signatures of Grokking

2026-05-08 · 更新于 2026-05-19 · 3 min · 480 words

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

2026-05-08 · 更新于 2026-05-19 · 4 min · 761 words

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

2026-05-08 · 更新于 2026-05-19 · 3 min · 593 words

语音/音频论文速递 2026-05-08

2026-05-08 · 更新于 2026-05-19 · 17 min · 3434 words

Adaptive Diagonal Loading for Norm Constrained Beamforming

2026-05-07 · 更新于 2026-05-19 · 1 min · 183 words

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

2026-05-07 · 更新于 2026-05-19 · 3 min · 485 words

AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

2026-05-07 · 更新于 2026-05-19 · 3 min · 444 words

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

2026-05-07 · 更新于 2026-05-19 · 1 min · 116 words

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

2026-05-07 · 更新于 2026-05-19 · 2 min · 364 words

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

2026-05-07 · 更新于 2026-05-19 · 2 min · 282 words

Hearing the Ocean: Bio-inspired Gammatone-CNN framework for Robust Underwater Acoustic Target Classification

2026-05-07 · 更新于 2026-05-19 · 2 min · 341 words

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

2026-05-07 · 更新于 2026-05-19 · 2 min · 418 words

Library learning with e-graphs on jazz harmony

2026-05-07 · 更新于 2026-05-19 · 2 min · 304 words

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

2026-05-07 · 更新于 2026-05-19 · 3 min · 523 words

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

2026-05-07 · 更新于 2026-05-19 · 1 min · 208 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-05-07 · 更新于 2026-05-19 · 3 min · 468 words

RenCon 2025: Revival of the Expressive Performance Rendering Competition

2026-05-07 · 更新于 2026-05-19 · 2 min · 336 words

SEI-SHIELD: Robust Specific Emitter Identification Under Label Noise Via Self-Supervised Filtering and Iterative Rescue

2026-05-07 · 更新于 2026-05-19 · 3 min · 492 words

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

2026-05-07 · 更新于 2026-05-19 · 2 min · 417 words

Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

2026-05-07 · 更新于 2026-05-19 · 4 min · 797 words

Stage Light is Sequence^2: Multi-Light Control via Imitation Learning

2026-05-07 · 更新于 2026-05-19 · 3 min · 501 words

Stage-adaptive audio diffusion modeling

2026-05-07 · 更新于 2026-05-19 · 2 min · 353 words

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

2026-05-07 · 更新于 2026-05-19 · 3 min · 457 words

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

2026-05-07 · 更新于 2026-05-19 · 3 min · 540 words

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

2026-05-07 · 更新于 2026-05-19 · 3 min · 570 words

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

2026-05-07 · 更新于 2026-05-19 · 4 min · 643 words

语音/音频论文速递 2026-05-07

2026-05-07 · 更新于 2026-05-19 · 14 min · 2879 words

A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

2026-05-06 · 更新于 2026-05-19 · 2 min · 411 words

A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition

2026-05-06 · 更新于 2026-05-19 · 1 min · 112 words

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

2026-05-06 · 更新于 2026-05-19 · 3 min · 439 words

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

2026-05-06 · 更新于 2026-05-19 · 2 min · 357 words

Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs

2026-05-06 · 更新于 2026-05-19 · 2 min · 306 words

AsymK-Talker: Real-Time and Long-Horizon Talking Head Generation via Asymmetric Kernel Distillation

2026-05-06 · 更新于 2026-05-19 · 2 min · 418 words

Contrastive Regularization for Accent-Robust ASR

2026-05-06 · 更新于 2026-05-19 · 2 min · 359 words

Cosmodoit: A Python Package for Adaptive, Efficient Pipelining of Feature Extraction from Performed Music

2026-05-06 · 更新于 2026-05-19 · 1 min · 207 words

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

2026-05-06 · 更新于 2026-05-19 · 3 min · 485 words

Deepfake Audio Detection Using Self-supervised Fusion Representations

2026-05-06 · 更新于 2026-05-19 · 2 min · 265 words

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

2026-05-06 · 更新于 2026-05-19 · 2 min · 312 words

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

2026-05-06 · 更新于 2026-05-19 · 3 min · 428 words

Learning Generalizable Action Representations via Pre-training AEMG

2026-05-06 · 更新于 2026-05-19 · 2 min · 338 words

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

2026-05-06 · 更新于 2026-05-19 · 5 min · 929 words

Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection

2026-05-06 · 更新于 2026-05-19 · 3 min · 456 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-05-06 · 更新于 2026-05-19 · 3 min · 491 words

Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

2026-05-06 · 更新于 2026-05-19 · 2 min · 357 words

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

2026-05-06 · 更新于 2026-05-19 · 3 min · 429 words

Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

2026-05-06 · 更新于 2026-05-19 · 1 min · 123 words

Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

2026-05-06 · 更新于 2026-05-19 · 3 min · 497 words

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

2026-05-06 · 更新于 2026-05-19 · 3 min · 464 words

Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts

2026-05-06 · 更新于 2026-05-19 · 2 min · 325 words

Towards Open World Sound Event Detection

2026-05-06 · 更新于 2026-05-19 · 3 min · 475 words

语音/音频论文速递 2026-05-06

2026-05-06 · 更新于 2026-05-19 · 15 min · 3158 words

Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead

2026-05-05 · 更新于 2026-05-19 · 1 min · 129 words

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

2026-05-05 · 更新于 2026-05-19 · 2 min · 295 words

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

2026-05-05 · 更新于 2026-05-19 · 3 min · 461 words

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

2026-05-05 · 更新于 2026-05-19 · 3 min · 458 words

Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks

2026-05-05 · 更新于 2026-05-19 · 2 min · 413 words

HARMES: A Multi-Modal Dataset for Wearable Human Activity Recognition with Motion, Environmental Sensing and Sound

2026-05-05 · 更新于 2026-05-19 · 2 min · 286 words

Integrating acoustic tapping with a UAV platform for tile condition classification

2026-05-05 · 更新于 2026-05-19 · 3 min · 472 words

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

2026-05-05 · 更新于 2026-05-19 · 2 min · 403 words

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

2026-05-05 · 更新于 2026-05-19 · 1 min · 119 words

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

2026-05-05 · 更新于 2026-05-19 · 3 min · 495 words

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

2026-05-05 · 更新于 2026-05-19 · 2 min · 312 words

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

2026-05-05 · 更新于 2026-05-19 · 2 min · 331 words

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

2026-05-05 · 更新于 2026-05-19 · 2 min · 389 words

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

2026-05-05 · 更新于 2026-05-19 · 2 min · 405 words

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

2026-05-05 · 更新于 2026-05-19 · 3 min · 433 words

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

2026-05-05 · 更新于 2026-05-19 · 2 min · 386 words

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

2026-05-05 · 更新于 2026-05-19 · 2 min · 362 words

NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty

2026-05-05 · 更新于 2026-05-19 · 2 min · 396 words

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

2026-05-05 · 更新于 2026-05-19 · 2 min · 302 words

PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention

2026-05-05 · 更新于 2026-05-19 · 3 min · 464 words

Period-conscious Time-series Reconstruction under Local Differential Privacy

2026-05-05 · 更新于 2026-05-19 · 2 min · 255 words

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

2026-05-05 · 更新于 2026-05-19 · 2 min · 350 words

RenCon 2025: Revival of the Expressive Performance Rendering Competition

2026-05-05 · 更新于 2026-05-19 · 2 min · 277 words

Spoken Language Identification with Pre-trained Models and Margin Loss

2026-05-05 · 更新于 2026-05-19 · 1 min · 194 words

The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

2026-05-05 · 更新于 2026-05-19 · 2 min · 261 words

The AECM Algorithm for Deterministic Maximum Likelihood Direction Finding in the Presence of Gaussian Mixture Noise

2026-05-05 · 更新于 2026-05-19 · 1 min · 188 words

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

2026-05-05 · 更新于 2026-05-19 · 1 min · 202 words

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

2026-05-05 · 更新于 2026-05-19 · 2 min · 420 words

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

2026-05-05 · 更新于 2026-05-19 · 1 min · 109 words

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

2026-05-05 · 更新于 2026-05-19 · 1 min · 213 words

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

2026-05-05 · 更新于 2026-05-19 · 2 min · 237 words

When Attention Collapses: Residual Evidence Modeling for Compositional Inference

2026-05-05 · 更新于 2026-05-19 · 2 min · 323 words

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

2026-05-05 · 更新于 2026-05-19 · 1 min · 164 words

语音/音频论文速递 2026-05-05

2026-05-05 · 更新于 2026-05-19 · 19 min · 3988 words

A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks

2026-05-04 · 更新于 2026-05-19 · 2 min · 288 words

A cross-species neural foundation model for end-to-end speech decoding

2026-05-04 · 更新于 2026-05-19 · 2 min · 349 words

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

2026-05-04 · 更新于 2026-05-19 · 2 min · 378 words

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

2026-05-04 · 更新于 2026-05-19 · 2 min · 250 words

Alethia: A Foundational Encoder for Voice Deepfakes

2026-05-04 · 更新于 2026-05-19 · 1 min · 204 words

AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching

2026-05-04 · 更新于 2026-05-19 · 2 min · 299 words

Are Deep Speech Denoising Models Robust to Adversarial Noise?

2026-05-04 · 更新于 2026-05-19 · 2 min · 291 words

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

2026-05-04 · 更新于 2026-05-19 · 3 min · 440 words

AudioX: A Unified Framework for Anything-to-Audio Generation

2026-05-04 · 更新于 2026-05-19 · 4 min · 756 words

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

2026-05-04 · 更新于 2026-05-19 · 2 min · 328 words

Aurelius: Relation Aware Text-to-Audio Generation At Scale

2026-05-04 · 更新于 2026-05-19 · 2 min · 390 words

Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

2026-05-04 · 更新于 2026-05-19 · 3 min · 450 words

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

2026-05-04 · 更新于 2026-05-19 · 3 min · 477 words

AVEX: What Matters for Animal Vocalization Encoding

2026-05-04 · 更新于 2026-05-19 · 3 min · 432 words

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

2026-05-04 · 更新于 2026-05-19 · 3 min · 467 words

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 425 words

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

2026-05-04 · 更新于 2026-05-19 · 2 min · 258 words

Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval

2026-05-04 · 更新于 2026-05-19 · 2 min · 411 words

Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

2026-05-04 · 更新于 2026-05-19 · 3 min · 577 words

Can Speech LLMs Think while Listening?

2026-05-04 · 更新于 2026-05-19 · 2 min · 347 words

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

2026-05-04 · 更新于 2026-05-19 · 2 min · 261 words

Closing the Gap Between Text and Speech Understanding in LLMs

2026-05-04 · 更新于 2026-05-19 · 2 min · 323 words

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

2026-05-04 · 更新于 2026-05-19 · 2 min · 301 words

Confident and Adaptive Generative Speech Recognition via Risk Control

2026-05-04 · 更新于 2026-05-19 · 2 min · 351 words

Continuous Audio Language Models

2026-05-04 · 更新于 2026-05-19 · 3 min · 525 words

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

2026-05-04 · 更新于 2026-05-19 · 2 min · 345 words

CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

2026-05-04 · 更新于 2026-05-19 · 2 min · 296 words

Data-Centric Lessons To Improve Speech-Language Pretraining

2026-05-04 · 更新于 2026-05-19 · 2 min · 277 words

Deep Learning with Learnable Product-Structured Activations

2026-05-04 · 更新于 2026-05-19 · 2 min · 298 words

DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

2026-05-04 · 更新于 2026-05-19 · 3 min · 589 words

Discovering and Steering Interpretable Concepts in Large Generative Music Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 224 words

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

2026-05-04 · 更新于 2026-05-19 · 2 min · 392 words

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

2026-05-04 · 更新于 2026-05-19 · 2 min · 381 words

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

2026-05-04 · 更新于 2026-05-19 · 2 min · 226 words

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 261 words

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

2026-05-04 · 更新于 2026-05-19 · 2 min · 251 words

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

2026-05-04 · 更新于 2026-05-19 · 2 min · 229 words

End-to-end Listen, Look, Speak and Act

2026-05-04 · 更新于 2026-05-19 · 2 min · 277 words

Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

2026-05-04 · 更新于 2026-05-19 · 2 min · 393 words

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

2026-05-04 · 更新于 2026-05-19 · 4 min · 669 words

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

2026-05-04 · 更新于 2026-05-19 · 2 min · 348 words

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

2026-05-04 · 更新于 2026-05-19 · 2 min · 373 words

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

2026-05-04 · 更新于 2026-05-19 · 3 min · 487 words

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

2026-05-04 · 更新于 2026-05-19 · 3 min · 577 words

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

2026-05-04 · 更新于 2026-05-19 · 2 min · 345 words

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

2026-05-04 · 更新于 2026-05-19 · 2 min · 286 words

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

2026-05-04 · 更新于 2026-05-19 · 2 min · 367 words

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

2026-05-04 · 更新于 2026-05-19 · 1 min · 162 words

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

2026-05-04 · 更新于 2026-05-19 · 2 min · 342 words

Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation

2026-05-04 · 更新于 2026-05-19 · 3 min · 461 words

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

2026-05-04 · 更新于 2026-05-19 · 2 min · 367 words

Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis

2026-05-04 · 更新于 2026-05-19 · 4 min · 776 words

Human Behavior Atlas: Benchmarking Unified Psychological And Social Behavior Understanding

2026-05-04 · 更新于 2026-05-19 · 2 min · 384 words

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

2026-05-04 · 更新于 2026-05-19 · 2 min · 233 words

ICLR 2026 - 动作生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 115 words

ICLR 2026 - 图像生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 100 words

ICLR 2026 - 基准测试 #数据集 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 136 words

ICLR 2026 - 基准测试 论文列表

2026-05-04 · 更新于 2026-05-19 · 6 min · 1203 words

ICLR 2026 - 声源定位 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 113 words

ICLR 2026 - 多模态推理 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 102 words

ICLR 2026 - 多模态模型 论文列表

2026-05-04 · 更新于 2026-05-19 · 4 min · 671 words

ICLR 2026 - 序列解耦 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 193 words

ICLR 2026 - 数据集 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 144 words

ICLR 2026 - 机器人操作 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 122 words

ICLR 2026 - 模型可解释性 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 149 words

ICLR 2026 - 模型比较 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 121 words

ICLR 2026 - 模型评估 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 281 words

ICLR 2026 - 生态计算 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 130 words

ICLR 2026 - 生成模型 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 272 words

ICLR 2026 - 生物声学 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 193 words

ICLR 2026 - 神经网络架构 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 97 words

ICLR 2026 - 空间音频 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 105 words

ICLR 2026 - 脑编码 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 97 words

ICLR 2026 - 视频描述生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 187 words

ICLR 2026 - 视频摘要 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 103 words

ICLR 2026 - 视频生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 171 words

ICLR 2026 - 语音分离 论文列表

2026-05-04 · 更新于 2026-05-19 · 4 min · 708 words

ICLR 2026 - 语音合成 论文列表

2026-05-04 · 更新于 2026-05-19 · 8 min · 1679 words

ICLR 2026 - 语音合成评估 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 198 words

ICLR 2026 - 语音增强 #对抗样本 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 131 words

ICLR 2026 - 语音增强 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 105 words

ICLR 2026 - 语音大模型 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 128 words

ICLR 2026 - 语音对话系统 论文列表

2026-05-04 · 更新于 2026-05-19 · 4 min · 817 words

ICLR 2026 - 语音情感识别 论文列表

2026-05-04 · 更新于 2026-05-19 · 3 min · 637 words

ICLR 2026 - 语音生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 126 words

ICLR 2026 - 语音翻译 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 214 words

ICLR 2026 - 语音识别 #语音合成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 197 words

ICLR 2026 - 语音识别 论文列表

2026-05-04 · 更新于 2026-05-19 · 6 min · 1099 words

ICLR 2026 - 语音转换 #语音匿名化 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 168 words

ICLR 2026 - 语音问答 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 145 words

ICLR 2026 - 跨模态检索 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 91 words

ICLR 2026 - 跨模态生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 108 words

ICLR 2026 - 音乐信息检索 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 262 words

ICLR 2026 - 音乐理解 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 224 words

ICLR 2026 - 音乐生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 7 min · 1298 words

ICLR 2026 - 音视频 论文列表

2026-05-04 · 更新于 2026-05-19 · 2 min · 400 words

ICLR 2026 - 音视频事件检测 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 128 words

ICLR 2026 - 音视频深度伪造检测 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 109 words

ICLR 2026 - 音视频联合推理 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 91 words

ICLR 2026 - 音频分离 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 119 words

ICLR 2026 - 音频分类 论文列表

2026-05-04 · 更新于 2026-05-19 · 4 min · 839 words

ICLR 2026 - 音频场景理解 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 114 words

ICLR 2026 - 音频安全 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 127 words

ICLR 2026 - 音频检索 论文列表

2026-05-04 · 更新于 2026-05-19 · 3 min · 500 words

ICLR 2026 - 音频生成 论文列表

2026-05-04 · 更新于 2026-05-19 · 9 min · 1782 words

ICLR 2026 - 音频编辑 论文列表

2026-05-04 · 更新于 2026-05-19 · 1 min · 130 words

ICLR 2026 - 音频问答 论文列表

2026-05-04 · 更新于 2026-05-19 · 3 min · 541 words

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

2026-05-04 · 更新于 2026-05-19 · 2 min · 261 words

Instilling an Active Mind in Avatars via Cognitive Simulation

2026-05-04 · 更新于 2026-05-19 · 2 min · 285 words

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

2026-05-04 · 更新于 2026-05-19 · 2 min · 376 words

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 283 words

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

2026-05-04 · 更新于 2026-05-19 · 2 min · 370 words

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 327 words

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

2026-05-04 · 更新于 2026-05-19 · 2 min · 358 words

Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks

2026-05-04 · 更新于 2026-05-19 · 3 min · 439 words

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

2026-05-04 · 更新于 2026-05-19 · 2 min · 331 words

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

2026-05-04 · 更新于 2026-05-19 · 2 min · 397 words

Latent Fourier Transform

2026-05-04 · 更新于 2026-05-19 · 2 min · 294 words

Latent Speech-Text Transformer

2026-05-04 · 更新于 2026-05-19 · 3 min · 485 words

LayerSync: Self-aligning Intermediate Layers

2026-05-04 · 更新于 2026-05-19 · 2 min · 311 words

Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition

2026-05-04 · 更新于 2026-05-19 · 2 min · 402 words

Learning multimodal dictionary decompositions with group-sparse autoencoders

2026-05-04 · 更新于 2026-05-19 · 2 min · 290 words

LLM2Fx-Tools: Tool Calling for Music Post-Production

2026-05-04 · 更新于 2026-05-19 · 2 min · 385 words

MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control

2026-05-04 · 更新于 2026-05-19 · 2 min · 252 words

MAPSS: Manifold-based Assessment of Perceptual Source Separation

2026-05-04 · 更新于 2026-05-19 · 2 min · 237 words

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

2026-05-04 · 更新于 2026-05-19 · 5 min · 908 words

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

2026-05-04 · 更新于 2026-05-19 · 2 min · 289 words

Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 243 words

MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications

2026-05-04 · 更新于 2026-05-19 · 2 min · 421 words

MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment

2026-05-04 · 更新于 2026-05-19 · 3 min · 444 words

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

2026-05-04 · 更新于 2026-05-19 · 2 min · 373 words

MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

2026-05-04 · 更新于 2026-05-19 · 2 min · 382 words

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

2026-05-04 · 更新于 2026-05-19 · 1 min · 176 words

Music Flamingo: Scaling Music Understanding in Audio Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 392 words

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

2026-05-04 · 更新于 2026-05-19 · 2 min · 316 words

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

2026-05-04 · 更新于 2026-05-19 · 2 min · 364 words

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

2026-05-04 · 更新于 2026-05-19 · 2 min · 367 words

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

2026-05-04 · 更新于 2026-05-19 · 2 min · 247 words

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

2026-05-04 · 更新于 2026-05-19 · 2 min · 292 words

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

2026-05-04 · 更新于 2026-05-19 · 2 min · 406 words

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

2026-05-04 · 更新于 2026-05-19 · 3 min · 464 words

OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 326 words

PACE: Pretrained Audio Continual Learning

2026-05-04 · 更新于 2026-05-19 · 2 min · 376 words

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

2026-05-04 · 更新于 2026-05-19 · 2 min · 272 words

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

2026-05-04 · 更新于 2026-05-19 · 2 min · 324 words

Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

2026-05-04 · 更新于 2026-05-19 · 2 min · 275 words

PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 316 words

Query-Guided Spatial–Temporal–Frequency Interaction for Music Audio–Visual Question Answering

2026-05-04 · 更新于 2026-05-19 · 2 min · 244 words

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

2026-05-04 · 更新于 2026-05-19 · 3 min · 545 words

RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

2026-05-04 · 更新于 2026-05-19 · 2 min · 285 words

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

2026-05-04 · 更新于 2026-05-19 · 2 min · 340 words

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

2026-05-04 · 更新于 2026-05-19 · 2 min · 329 words

Scaling Speech Tokenizers with Diffusion Autoencoders

2026-05-04 · 更新于 2026-05-19 · 2 min · 342 words

SCRAPL: Scattering Transform with Random Paths for Machine Learning

2026-05-04 · 更新于 2026-05-19 · 3 min · 516 words

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

2026-05-04 · 更新于 2026-05-19 · 2 min · 290 words

SmartDJ: Declarative Audio Editing with Audio Language Model

2026-05-04 · 更新于 2026-05-19 · 2 min · 330 words

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

2026-05-04 · 更新于 2026-05-19 · 3 min · 578 words

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

2026-05-04 · 更新于 2026-05-19 · 2 min · 326 words

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 383 words

Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech

2026-05-04 · 更新于 2026-05-19 · 3 min · 499 words

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

2026-05-04 · 更新于 2026-05-19 · 2 min · 288 words

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

2026-05-04 · 更新于 2026-05-19 · 3 min · 619 words

SpeechOp: Inference-Time Task Composition for Generative Speech Processing

2026-05-04 · 更新于 2026-05-19 · 2 min · 344 words

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

2026-05-04 · 更新于 2026-05-19 · 2 min · 280 words

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2026-05-04 · 更新于 2026-05-19 · 1 min · 207 words

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

2026-05-04 · 更新于 2026-05-19 · 2 min · 257 words

Steering Autoregressive Music Generation with Recursive Feature Machines

2026-05-04 · 更新于 2026-05-19 · 2 min · 422 words

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 241 words

SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis

2026-05-04 · 更新于 2026-05-19 · 2 min · 420 words

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

2026-05-04 · 更新于 2026-05-19 · 2 min · 376 words

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

2026-05-04 · 更新于 2026-05-19 · 2 min · 358 words

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 345 words

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2026-05-04 · 更新于 2026-05-19 · 5 min · 1000 words

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

2026-05-04 · 更新于 2026-05-19 · 2 min · 379 words

Tell me Habibi, is it Real or Fake?

2026-05-04 · 更新于 2026-05-19 · 2 min · 276 words

The Deleuzian Representation Hypothesis

2026-05-04 · 更新于 2026-05-19 · 2 min · 285 words

Timing is Everything: Temporal Scaffolding of Semantic Surprise in Humor

2026-05-04 · 更新于 2026-05-19 · 2 min · 349 words

TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES

2026-05-04 · 更新于 2026-05-19 · 2 min · 227 words

Token-Based Audio Inpainting via Discrete Diffusion

2026-05-04 · 更新于 2026-05-19 · 3 min · 508 words

Toward Complex-Valued Neural Networks for Waveform Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 308 words

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

2026-05-04 · 更新于 2026-05-19 · 2 min · 226 words

Towards True Speech-to-Speech Models Without Text Guidance

2026-05-04 · 更新于 2026-05-19 · 2 min · 393 words

Transformer-based End-to-End Control Filter Generation for Active Noise Control

2026-05-04 · 更新于 2026-05-19 · 2 min · 316 words

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

2026-05-04 · 更新于 2026-05-19 · 2 min · 348 words

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

2026-05-04 · 更新于 2026-05-19 · 2 min · 332 words

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

2026-05-04 · 更新于 2026-05-19 · 2 min · 365 words

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

2026-05-04 · 更新于 2026-05-19 · 2 min · 327 words

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

2026-05-04 · 更新于 2026-05-19 · 2 min · 386 words

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

2026-05-04 · 更新于 2026-05-19 · 2 min · 340 words

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

2026-05-04 · 更新于 2026-05-19 · 2 min · 306 words

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

2026-05-04 · 更新于 2026-05-19 · 2 min · 300 words

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

2026-05-04 · 更新于 2026-05-19 · 2 min · 323 words

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

2026-05-04 · 更新于 2026-05-19 · 2 min · 220 words

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

2026-05-04 · 更新于 2026-05-19 · 2 min · 335 words

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 292 words

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

2026-05-04 · 更新于 2026-05-19 · 3 min · 552 words

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

2026-05-04 · 更新于 2026-05-19 · 2 min · 327 words

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

2026-05-04 · 更新于 2026-05-19 · 2 min · 240 words

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

2026-05-04 · 更新于 2026-05-19 · 2 min · 269 words

YuE: Scaling Open Foundation Models for Long-Form Music Generation

2026-05-04 · 更新于 2026-05-19 · 2 min · 424 words

语音/音频论文速递 2026-05-04

2026-05-04 · 更新于 2026-05-19 · 9 min · 1720 words

语音/音频论文速递 2026-05-03

2026-05-03 · 更新于 2026-05-19 · 8 min · 1688 words

A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks

2026-05-02 · 更新于 2026-05-19 · 3 min · 552 words

A cross-species neural foundation model for end-to-end speech decoding

2026-05-02 · 更新于 2026-05-19 · 2 min · 412 words

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

2026-05-02 · 更新于 2026-05-19 · 2 min · 395 words

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

2026-05-02 · 更新于 2026-05-19 · 2 min · 382 words

AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching

2026-05-02 · 更新于 2026-05-19 · 3 min · 441 words

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

2026-05-02 · 更新于 2026-05-19 · 3 min · 485 words

Are Deep Speech Denoising Models Robust to Adversarial Noise?

2026-05-02 · 更新于 2026-05-19 · 1 min · 203 words

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

2026-05-02 · 更新于 2026-05-19 · 3 min · 476 words

AudioX: A Unified Framework for Anything-to-Audio Generation

2026-05-02 · 更新于 2026-05-19 · 3 min · 442 words

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

2026-05-02 · 更新于 2026-05-19 · 2 min · 423 words

Aurelius: Relation Aware Text-to-Audio Generation At Scale

2026-05-02 · 更新于 2026-05-19 · 2 min · 386 words

Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

2026-05-02 · 更新于 2026-05-19 · 3 min · 454 words

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

2026-05-02 · 更新于 2026-05-19 · 2 min · 293 words

AVEX: What Matters for Animal Vocalization Encoding

2026-05-02 · 更新于 2026-05-19 · 2 min · 318 words

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

2026-05-02 · 更新于 2026-05-19 · 2 min · 346 words

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 406 words

Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval

2026-05-02 · 更新于 2026-05-19 · 2 min · 343 words

Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

2026-05-02 · 更新于 2026-05-19 · 2 min · 417 words

Can Speech LLMs Think while Listening?

2026-05-02 · 更新于 2026-05-19 · 2 min · 298 words

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

2026-05-02 · 更新于 2026-05-19 · 2 min · 254 words

Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

2026-05-02 · 更新于 2026-05-19 · 2 min · 395 words

Closing the Gap Between Text and Speech Understanding in LLMs

2026-05-02 · 更新于 2026-05-19 · 3 min · 579 words

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

2026-05-02 · 更新于 2026-05-19 · 2 min · 355 words

Confident and Adaptive Generative Speech Recognition via Risk Control

2026-05-02 · 更新于 2026-05-19 · 2 min · 229 words

Continuous Audio Language Models

2026-05-02 · 更新于 2026-05-19 · 3 min · 587 words

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

2026-05-02 · 更新于 2026-05-19 · 2 min · 374 words

Data-Centric Lessons To Improve Speech-Language Pretraining

2026-05-02 · 更新于 2026-05-19 · 2 min · 265 words

Deep Learning with Learnable Product-Structured Activations

2026-05-02 · 更新于 2026-05-19 · 2 min · 326 words

DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

2026-05-02 · 更新于 2026-05-19 · 2 min · 365 words

Discovering and Steering Interpretable Concepts in Large Generative Music Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 297 words

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

2026-05-02 · 更新于 2026-05-19 · 3 min · 445 words

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

2026-05-02 · 更新于 2026-05-19 · 3 min · 496 words

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

2026-05-02 · 更新于 2026-05-19 · 2 min · 225 words

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 287 words

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

2026-05-02 · 更新于 2026-05-19 · 2 min · 358 words

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

2026-05-02 · 更新于 2026-05-19 · 2 min · 251 words

End-to-end Listen, Look, Speak and Act

2026-05-02 · 更新于 2026-05-19 · 3 min · 444 words

Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

2026-05-02 · 更新于 2026-05-19 · 2 min · 316 words

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

2026-05-02 · 更新于 2026-05-19 · 3 min · 544 words

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

2026-05-02 · 更新于 2026-05-19 · 2 min · 332 words

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

2026-05-02 · 更新于 2026-05-19 · 2 min · 353 words

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

2026-05-02 · 更新于 2026-05-19 · 3 min · 431 words

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

2026-05-02 · 更新于 2026-05-19 · 2 min · 326 words

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

2026-05-02 · 更新于 2026-05-19 · 2 min · 400 words

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

2026-05-02 · 更新于 2026-05-19 · 2 min · 295 words

Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation

2026-05-02 · 更新于 2026-05-19 · 2 min · 372 words

Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis

2026-05-02 · 更新于 2026-05-19 · 3 min · 457 words

Human Behavior Atlas: Benchmarking Unified Psychological And Social Behavior Understanding

2026-05-02 · 更新于 2026-05-19 · 2 min · 424 words

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

2026-05-02 · 更新于 2026-05-19 · 1 min · 191 words

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

2026-05-02 · 更新于 2026-05-19 · 2 min · 289 words

Instilling an Active Mind in Avatars via Cognitive Simulation

2026-05-02 · 更新于 2026-05-19 · 2 min · 263 words

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

2026-05-02 · 更新于 2026-05-19 · 2 min · 350 words

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

2026-05-02 · 更新于 2026-05-19 · 3 min · 452 words

JaiTTS: A Thai Voice Cloning Model

2026-05-02 · 更新于 2026-05-19 · 2 min · 425 words

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

2026-05-02 · 更新于 2026-05-19 · 3 min · 631 words

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

2026-05-02 · 更新于 2026-05-19 · 3 min · 566 words

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

2026-05-02 · 更新于 2026-05-19 · 3 min · 567 words

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

2026-05-02 · 更新于 2026-05-19 · 2 min · 306 words

Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks

2026-05-02 · 更新于 2026-05-19 · 2 min · 372 words

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

2026-05-02 · 更新于 2026-05-19 · 3 min · 469 words

Latent Fourier Transform

2026-05-02 · 更新于 2026-05-19 · 2 min · 322 words

Latent Speech-Text Transformer

2026-05-02 · 更新于 2026-05-19 · 3 min · 535 words

LayerSync: Self-aligning Intermediate Layers

2026-05-02 · 更新于 2026-05-19 · 2 min · 346 words

Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition

2026-05-02 · 更新于 2026-05-19 · 2 min · 329 words

Learning multimodal dictionary decompositions with group-sparse autoencoders

2026-05-02 · 更新于 2026-05-19 · 2 min · 317 words

LLM2Fx-Tools: Tool Calling for Music Post-Production

2026-05-02 · 更新于 2026-05-19 · 3 min · 439 words

MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control

2026-05-02 · 更新于 2026-05-19 · 3 min · 453 words

MAPSS: Manifold-based Assessment of Perceptual Source Separation

2026-05-02 · 更新于 2026-05-19 · 2 min · 404 words

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

2026-05-02 · 更新于 2026-05-19 · 2 min · 385 words

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

2026-05-02 · 更新于 2026-05-19 · 2 min · 349 words

Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 284 words

MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications

2026-05-02 · 更新于 2026-05-19 · 2 min · 275 words

MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment

2026-05-02 · 更新于 2026-05-19 · 3 min · 459 words

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

2026-05-02 · 更新于 2026-05-19 · 2 min · 406 words

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

2026-05-02 · 更新于 2026-05-19 · 2 min · 229 words

Music Flamingo: Scaling Music Understanding in Audio Language Models

2026-05-02 · 更新于 2026-05-19 · 3 min · 495 words

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

2026-05-02 · 更新于 2026-05-19 · 2 min · 248 words

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

2026-05-02 · 更新于 2026-05-19 · 2 min · 291 words

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

2026-05-02 · 更新于 2026-05-19 · 2 min · 243 words

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

2026-05-02 · 更新于 2026-05-19 · 2 min · 300 words

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

2026-05-02 · 更新于 2026-05-19 · 2 min · 292 words

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

2026-05-02 · 更新于 2026-05-19 · 2 min · 388 words

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

2026-05-02 · 更新于 2026-05-19 · 3 min · 581 words

OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 406 words

PACE: Pretrained Audio Continual Learning

2026-05-02 · 更新于 2026-05-19 · 2 min · 384 words

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

2026-05-02 · 更新于 2026-05-19 · 2 min · 361 words

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

2026-05-02 · 更新于 2026-05-19 · 2 min · 371 words

Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

2026-05-02 · 更新于 2026-05-19 · 2 min · 277 words

PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation

2026-05-02 · 更新于 2026-05-19 · 2 min · 397 words

Query-Guided Spatial–Temporal–Frequency Interaction for Music Audio–Visual Question Answering

2026-05-02 · 更新于 2026-05-19 · 2 min · 286 words

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

2026-05-02 · 更新于 2026-05-19 · 2 min · 346 words

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

2026-05-02 · 更新于 2026-05-19 · 2 min · 246 words

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

2026-05-02 · 更新于 2026-05-19 · 3 min · 599 words

Scaling Speech Tokenizers with Diffusion Autoencoders

2026-05-02 · 更新于 2026-05-19 · 2 min · 282 words

SCRAPL: Scattering Transform with Random Paths for Machine Learning

2026-05-02 · 更新于 2026-05-19 · 3 min · 487 words

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

2026-05-02 · 更新于 2026-05-19 · 2 min · 347 words

SmartDJ: Declarative Audio Editing with Audio Language Model

2026-05-02 · 更新于 2026-05-19 · 2 min · 328 words

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

2026-05-02 · 更新于 2026-05-19 · 3 min · 494 words

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

2026-05-02 · 更新于 2026-05-19 · 3 min · 518 words

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

2026-05-02 · 更新于 2026-05-19 · 2 min · 387 words

Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech

2026-05-02 · 更新于 2026-05-19 · 2 min · 351 words

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

2026-05-02 · 更新于 2026-05-19 · 2 min · 334 words

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

2026-05-02 · 更新于 2026-05-19 · 2 min · 349 words

SpeechOp: Inference-Time Task Composition for Generative Speech Processing

2026-05-02 · 更新于 2026-05-19 · 2 min · 340 words

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

2026-05-02 · 更新于 2026-05-19 · 2 min · 382 words

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2026-05-02 · 更新于 2026-05-19 · 3 min · 506 words

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

2026-05-02 · 更新于 2026-05-19 · 2 min · 329 words

Steering Autoregressive Music Generation with Recursive Feature Machines

2026-05-02 · 更新于 2026-05-19 · 2 min · 318 words

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 319 words

SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis

2026-05-02 · 更新于 2026-05-19 · 2 min · 334 words

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

2026-05-02 · 更新于 2026-05-19 · 2 min · 422 words

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

2026-05-02 · 更新于 2026-05-19 · 3 min · 512 words

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

2026-05-02 · 更新于 2026-05-19 · 3 min · 497 words

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2026-05-02 · 更新于 2026-05-19 · 2 min · 295 words

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

2026-05-02 · 更新于 2026-05-19 · 2 min · 318 words

Tell me Habibi, is it Real or Fake?

2026-05-02 · 更新于 2026-05-19 · 2 min · 305 words

The Deleuzian Representation Hypothesis

2026-05-02 · 更新于 2026-05-19 · 2 min · 262 words

TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES

2026-05-02 · 更新于 2026-05-19 · 2 min · 284 words

Token-Based Audio Inpainting via Discrete Diffusion

2026-05-02 · 更新于 2026-05-19 · 3 min · 519 words

Toward Complex-Valued Neural Networks for Waveform Generation

2026-05-02 · 更新于 2026-05-19 · 3 min · 446 words

Towards True Speech-to-Speech Models Without Text Guidance

2026-05-02 · 更新于 2026-05-19 · 2 min · 368 words

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

2026-05-02 · 更新于 2026-05-19 · 2 min · 341 words

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

2026-05-02 · 更新于 2026-05-19 · 2 min · 236 words

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

2026-05-02 · 更新于 2026-05-19 · 2 min · 294 words

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

2026-05-02 · 更新于 2026-05-19 · 2 min · 396 words

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

2026-05-02 · 更新于 2026-05-19 · 2 min · 336 words

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

2026-05-02 · 更新于 2026-05-19 · 2 min · 357 words

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

2026-05-02 · 更新于 2026-05-19 · 2 min · 338 words

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

2026-05-02 · 更新于 2026-05-19 · 2 min · 323 words

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

2026-05-02 · 更新于 2026-05-19 · 3 min · 432 words

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

2026-05-02 · 更新于 2026-05-19 · 2 min · 300 words

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

2026-05-02 · 更新于 2026-05-19 · 3 min · 457 words

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 361 words

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

2026-05-02 · 更新于 2026-05-19 · 2 min · 391 words

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

2026-05-02 · 更新于 2026-05-19 · 2 min · 422 words

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

2026-05-02 · 更新于 2026-05-19 · 2 min · 353 words

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

2026-05-02 · 更新于 2026-05-19 · 2 min · 312 words

YuE: Scaling Open Foundation Models for Long-Form Music Generation

2026-05-02 · 更新于 2026-05-19 · 2 min · 354 words

语音/音频论文速递 2026-05-02

2026-05-02 · 更新于 2026-05-19 · 4 min · 724 words

ICASSP 2026 语音/音频论文详细分析

2026-05-01 · 更新于 2026-05-19 · 430 min · 91382 words

ICLR 2026 语音/音频论文详细分析

2026-05-01 · 更新于 2026-05-19 · 72 min · 15177 words

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

2026-05-01 · 更新于 2026-05-19 · 2 min · 336 words

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

2026-05-01 · 更新于 2026-05-19 · 1 min · 148 words

Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

2026-05-01 · 更新于 2026-05-19 · 1 min · 181 words

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

2026-05-01 · 更新于 2026-05-19 · 2 min · 344 words

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

2026-05-01 · 更新于 2026-05-19 · 2 min · 357 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-05-01 · 更新于 2026-05-19 · 2 min · 267 words

Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues

2026-05-01 · 更新于 2026-05-19 · 1 min · 160 words

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

2026-05-01 · 更新于 2026-05-19 · 3 min · 593 words

Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification

2026-05-01 · 更新于 2026-05-19 · 2 min · 378 words

BUT System Description for CHiME-9 MCoRec Challenge

2026-05-01 · 更新于 2026-05-19 · 2 min · 334 words

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

2026-05-01 · 更新于 2026-05-19 · 2 min · 396 words

Do Sparse Autoencoders Capture Concept Manifolds?

2026-05-01 · 更新于 2026-05-19 · 2 min · 283 words

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

2026-05-01 · 更新于 2026-05-19 · 3 min · 452 words

Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation

2026-05-01 · 更新于 2026-05-19 · 2 min · 271 words

EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures

2026-05-01 · 更新于 2026-05-19 · 3 min · 568 words

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

2026-05-01 · 更新于 2026-05-19 · 2 min · 311 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-05-01 · 更新于 2026-05-19 · 2 min · 319 words

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

2026-05-01 · 更新于 2026-05-19 · 2 min · 314 words

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2026-05-01 · 更新于 2026-05-19 · 2 min · 261 words

JaiTTS: A Thai Voice Cloning Model

2026-05-01 · 更新于 2026-05-19 · 2 min · 264 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-05-01 · 更新于 2026-05-19 · 2 min · 378 words

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

2026-05-01 · 更新于 2026-05-19 · 2 min · 228 words

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

2026-05-01 · 更新于 2026-05-19 · 1 min · 153 words

MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

2026-05-01 · 更新于 2026-05-19 · 3 min · 434 words

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

2026-05-01 · 更新于 2026-05-19 · 3 min · 461 words

Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People

2026-05-01 · 更新于 2026-05-19 · 1 min · 125 words

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

2026-05-01 · 更新于 2026-05-19 · 3 min · 434 words

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

2026-05-01 · 更新于 2026-05-19 · 1 min · 139 words

Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

2026-05-01 · 更新于 2026-05-19 · 1 min · 174 words

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012

2026-05-01 · 更新于 2026-05-19 · 2 min · 237 words

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

2026-05-01 · 更新于 2026-05-19 · 1 min · 133 words

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

2026-05-01 · 更新于 2026-05-19 · 2 min · 274 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-05-01 · 更新于 2026-05-19 · 2 min · 327 words

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2026-05-01 · 更新于 2026-05-19 · 4 min · 708 words

语音/音频论文速递 2026-05-01

2026-05-01 · 更新于 2026-05-19 · 12 min · 2481 words

April  1312

A New Location Estimator for Mixed LOS & NLOS scenarios

2026-04-30 · 更新于 2026-05-19 · 2 min · 319 words

A Toolkit for Detecting Spurious Correlations in Speech Datasets

2026-04-30 · 更新于 2026-05-19 · 2 min · 345 words

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

2026-04-30 · 更新于 2026-05-19 · 2 min · 404 words

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

2026-04-30 · 更新于 2026-05-19 · 2 min · 318 words

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

2026-04-30 · 更新于 2026-05-19 · 2 min · 422 words

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

2026-04-30 · 更新于 2026-05-19 · 2 min · 411 words

Fitting Large Nonlinear Mixed Effects Models Using Variational Expectation Maximization

2026-04-30 · 更新于 2026-05-19 · 1 min · 103 words

Full band denoising of room impulse response in the wavelet domain with dictionary learning

2026-04-30 · 更新于 2026-05-19 · 2 min · 270 words

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

2026-04-30 · 更新于 2026-05-19 · 2 min · 344 words

Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation

2026-04-30 · 更新于 2026-05-19 · 2 min · 255 words

Multimodal LLMs are not all you need for Pediatric Speech Language Pathology

2026-04-30 · 更新于 2026-05-19 · 2 min · 405 words

Multiple Additive Neural Networks for Structured and Unstructured Data

2026-04-30 · 更新于 2026-05-19 · 2 min · 297 words

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

2026-04-30 · 更新于 2026-05-19 · 2 min · 365 words

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

2026-04-30 · 更新于 2026-05-19 · 2 min · 411 words

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

2026-04-30 · 更新于 2026-05-19 · 3 min · 444 words

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

2026-04-30 · 更新于 2026-05-19 · 2 min · 410 words

Random Cloud: Finding Minimal Neural Architectures Without Training

2026-04-30 · 更新于 2026-05-19 · 2 min · 286 words

Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

2026-04-30 · 更新于 2026-05-19 · 1 min · 207 words

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

2026-04-30 · 更新于 2026-05-19 · 3 min · 493 words

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

2026-04-30 · 更新于 2026-05-19 · 2 min · 223 words

StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

2026-04-30 · 更新于 2026-05-19 · 2 min · 230 words

Step-Audio-R1.5 Technical Report

2026-04-30 · 更新于 2026-05-19 · 2 min · 266 words

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

2026-04-30 · 更新于 2026-05-19 · 2 min · 374 words

Text-Utilization for Encoder-dominated Speech Recognition Models

2026-04-30 · 更新于 2026-05-19 · 1 min · 135 words

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

2026-04-30 · 更新于 2026-05-19 · 2 min · 414 words

语音/音频论文速递 2026-04-30

2026-04-30 · 更新于 2026-05-19 · 16 min · 3385 words

3D Mesh Grid Room Impulse Responses Measured with A Linear Microphone Array And Suppression of Frame Reflections

2026-04-29 · 更新于 2026-05-19 · 1 min · 202 words

A Bayesian Approach to Singing Skill Evaluation Using Semitone Pitch Histogram and MCMC-Based Generated Quantities

2026-04-29 · 更新于 2026-05-19 · 2 min · 271 words

A Bimodal Approach for Detecting Fatigue Using Speech and Personal Assessments in College Students

2026-04-29 · 更新于 2026-05-19 · 1 min · 194 words

A Consistent Learning Depression Detection Framework Integrating Multi-View Attention

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

A Data-Driven Framework for Personal Sound Zone Control Addressing Loudspeaker Nonlinearities

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

2026-04-29 · 更新于 2026-05-19 · 2 min · 279 words

A Dynamic Gated Cross-Attention Framework for Audio-Text Apparent Personality Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 285 words

A Feature-Optimized Audio Watermarking Algorithm with Adaptive Embedding Strength

2026-04-29 · 更新于 2026-05-19 · 2 min · 375 words

A Framework for Controlled Multi-Speaker Audio Synthesis for Robustness Evaluation of Speaker Diarisation Systems

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

A Generalization Strategy for Speech Quality Prediction: From Domain-Specific to Unified Datasets

2026-04-29 · 更新于 2026-05-19 · 2 min · 274 words

A Generative-First Neural Audio Autoencoder

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

A Hybrid Convolution-Mamba Network with Tone-Octave Contrastive Learning for Stratified Semi-Supervised Singing Melody Extraction

2026-04-29 · 更新于 2026-05-19 · 2 min · 391 words

A Learning-Based Automotive Sound Field Reproduction Method Using Plane-Wave Decomposition and Multi-Position Constraint

2026-04-29 · 更新于 2026-05-19 · 2 min · 243 words

A Lightweight Fourier-Based Network for Binaural Speech Enhancement with Spatial Cue Preservation

2026-04-29 · 更新于 2026-05-19 · 2 min · 395 words

A LLM-Driven Acoustic Semantic Enriched Framework for Underwater Acoustic Target Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 379 words

A Metric Learning Approach to Heart Murmur Detection from Phonocardiogram Recordings

2026-04-29 · 更新于 2026-05-19 · 2 min · 389 words

A New Method and Dataset for Classroom Teaching Stage Segmentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 372 words

A Noniterative Phase Retrieval Considering the Zeros of STFT Magnitude

2026-04-29 · 更新于 2026-05-19 · 2 min · 214 words

A Noval Monte Carlo Gradient Method Based on Meta-Learning for Effective Step-Size Selection in Active Noise Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 242 words

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 314 words

A Personalized Real-Time Proactive Voice Memory Assistant

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

A Robust KNN Approach for Multi-Class Laryngeal Disease Detection using MFCC Features

2026-04-29 · 更新于 2026-05-19 · 2 min · 219 words

A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding

2026-04-29 · 更新于 2026-05-19 · 1 min · 194 words

A Speech-Driven Paradigm for Physics-Informed Modeling of Coupled Micro-Speakers

2026-04-29 · 更新于 2026-05-19 · 2 min · 280 words

A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

A State-Dependent Markov Diffusion Process for Generative Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 3 min · 463 words

A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 293 words

A Superb-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 3 min · 507 words

A Task-Aware Dual-Level Self-Supervised Learning Method for Effective Sound Event Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 308 words

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

2026-04-29 · 更新于 2026-05-19 · 2 min · 387 words

A Unified SVD-Modal Solution for Sparse Sound Field Reconstruction with Hybrid Spherical-Linear Microphone Arrays

2026-04-29 · 更新于 2026-05-19 · 2 min · 264 words

A Unsupervised Domain Adaptation Framework For Semi-Supervised Melody Extraction Using Confidence Matrix Replace and Nearest Neighbour Supervision

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 268 words

Accelerating Regularized Attention Kernel Regression for Spectrum Cartography

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

AccLID: Accent-aware Language Identification for Robust Multilingual Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 417 words

ACIR-MACL: Effective Multimodal Sentiment Analysis via Attention-Based Causal Intervention Regularization and Multi-Aspect Contrastive Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 399 words

Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Acoustic Feedback Cancellation in Hearing Aids Exploiting an Inertial Sensor

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

Acoustic Non-Stationarity Objective Assessment with Hard Label Criteria for Supervised Learning Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Acoustic Teleportation Via Disentangled Neural Audio Codec Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 313 words

Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 330 words

Adaptive Deterministic Flow Matching for Target Speaker Extraction

2026-04-29 · 更新于 2026-05-19 · 2 min · 383 words

Adaptive Embedding Fusion with Contrastive Learning for Robust Fully Few-Shot Class-Incremental Audio Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 378 words

Adaptive Per-Channel Energy Normalization Front-End for Robust Audio Signal Processing

2026-04-29 · 更新于 2026-05-19 · 2 min · 266 words

Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

Adaptive Spectral Weighting in Sagittal-Plane Sound Localization: A Reliability-Driven Approach

2026-04-29 · 更新于 2026-05-19 · 1 min · 193 words

Adaptive Task-Incremental Learning For Underwater Acoustic Recognition Based on Mixture-of-Experts Adapter

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 261 words

ADH-VA: Adaptive Directed-Hypergraph Convolution with VA Contrastive Learning for Multimodal Conversational Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 401 words

Advanced modeling of interlanguage speech intelligibility benefit with L1-L2 multi-task learning using differentiable K-means for accent-robust discrete token-based ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 367 words

Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 274 words

Advancing Semi-Supervised Child Speech Recognition with Omni-Temporal Classification under Label Noise

2026-04-29 · 更新于 2026-05-19 · 2 min · 397 words

Advancing Speech Summarization in Multi-Modal LLMs with Reinforcement Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 278 words

Advancing Speech Understanding in Speech-Aware Language Models with GRPO

2026-04-29 · 更新于 2026-05-19 · 2 min · 359 words

Adversarial Defense via Generative Speech Enhancement Module

2026-04-29 · 更新于 2026-05-19 · 2 min · 311 words

Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 457 words

Adversarial Rivalry Learning for Music Classification

2026-04-29 · 更新于 2026-05-19 · 3 min · 476 words

Affect-Jigsaw: Integrating Core and Peripheral Emotions for Harmonious Fine-Grained Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

AFT: An Exemplar-Free Class Incremental Learning Method for Environmental Sound Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

AI-Generated Music Detection in Broadcast Monitoring

2026-04-29 · 更新于 2026-05-19 · 2 min · 235 words

Ailive Mixer: A Deep Learning Based Zero Latency Automatic Music Mixer for Live Music Performances

2026-04-29 · 更新于 2026-05-19 · 1 min · 197 words

AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

Aligning Generative Speech Enhancement with Perceptual Feedback

2026-04-29 · 更新于 2026-05-19 · 3 min · 481 words

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

ALMA-Chor: Leveraging Audio-Lyric Alignment with Mamba for Chorus Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

AMBER2: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text

2026-04-29 · 更新于 2026-05-19 · 3 min · 533 words

AmbiDrop: Array-Agnostic Speech Enhancement Using Ambisonics Encoding and Dropout-Based Learning

2026-04-29 · 更新于 2026-05-19 · 1 min · 108 words

AMBISONIC-DML: A Benchmark Dataset for Dynamic Higher-Order Ambisonics Music with Motion-Aligned Stems

2026-04-29 · 更新于 2026-05-19 · 2 min · 322 words

An Anomaly-Aware and Audio-Enhanced Dual-Pathway Framework for Alzheimer’s Disease Progression Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

An Audio-Visual Speech Separation Network with Joint Cross-Attention and Iterative Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 358 words

An Efficient Neural Network for Modeling Human Auditory Neurograms for Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 300 words

An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas

2026-04-29 · 更新于 2026-05-19 · 2 min · 269 words

An Envelope Separation Aided Multi-Task Learning Model for Blind Source Counting and Localization

2026-04-29 · 更新于 2026-05-19 · 2 min · 262 words

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

2026-04-29 · 更新于 2026-05-19 · 2 min · 263 words

An Unsupervised Alignment Feature Fusion System for Spoken Language-Based Dementia Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Aneural Forward Filtering for Speaker-Image Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

AnyAccomp: Generalizable Accompaniment Generation Via Quantized Melodic Bottleneck

2026-04-29 · 更新于 2026-05-19 · 2 min · 370 words

AnyRIR: Robust Non-Intrusive Room Impulse Response Estimation in the Wild

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

APKD: Aligned And Paced Knowledge Distillation Towards Lightweight Heterogeneous Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

AQUA-Bench: Beyond finding answers to knowing when there are None in Audio Question Answering

2026-04-29 · 更新于 2026-05-19 · 2 min · 356 words

AR-BSNet: Towards Ultra-Low Complexity Autoregressive Target Speaker Extraction With Band-Split Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 364 words

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Ara-BEST-RQ: Multi Dialectal Arabic SSL

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

Arbitrarily Settable Frame Rate Neural Speech Codec with Content Adaptive Variable Length Segmentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 320 words

ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference

2026-04-29 · 更新于 2026-05-19 · 3 min · 528 words

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

2026-04-29 · 更新于 2026-05-19 · 2 min · 369 words

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

2026-04-29 · 更新于 2026-05-19 · 2 min · 286 words

Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

2026-04-29 · 更新于 2026-05-19 · 3 min · 520 words

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

Assessing The Perceptual Impact of Low-Altitude Aircraft Noise in Cities: An Auralization Framework Using Gaussian Beam Tracing

2026-04-29 · 更新于 2026-05-19 · 2 min · 222 words

Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 286 words

ATOM: Adaptive Token-Level Optimal Transport Mixup for Speech Translation

2026-04-29 · 更新于 2026-05-19 · 2 min · 301 words

Atomic Norm Minimization Revisited: Progressive Atom Identification And Refinement

2026-04-29 · 更新于 2026-05-19 · 2 min · 258 words

Attention-Based Encoder-Decoder Target-Speaker Voice Activity Detection for Robust Speaker Diarization

2026-04-29 · 更新于 2026-05-19 · 3 min · 509 words

Attention-Weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied To Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 478 words

Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-text System

2026-04-29 · 更新于 2026-05-19 · 2 min · 412 words

Attentive AV-Fusionnet: Audio-Visual Quality Prediction with Hybrid Attention

2026-04-29 · 更新于 2026-05-19 · 2 min · 334 words

Attentive Masked Self-Distillation for Respiratory Sound Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

2026-04-29 · 更新于 2026-05-19 · 3 min · 450 words

Audience-Aware Co-speech Gesture Generation in Public Speaking via Anticipation Tokens

2026-04-29 · 更新于 2026-05-19 · 2 min · 274 words

Audio Classification Models are Vulnerable to Filter Perturbations

2026-04-29 · 更新于 2026-05-19 · 1 min · 199 words

Audio Deepfake Detection at the First Greeting: “Hi!”

2026-04-29 · 更新于 2026-05-19 · 2 min · 315 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

Audio-Guided Multimodal Approach for Fine-Grained Alignment and Boundary Modeling in Active Speaker Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 270 words

Audio-Text Jailbreak Attack on Large Audio-Language Models: Towards Generality and Stealthiness

2026-04-29 · 更新于 2026-05-19 · 2 min · 264 words

Audio-to-Score Jazz Solo Transcription with the Rhythm Perceiver

2026-04-29 · 更新于 2026-05-19 · 2 min · 282 words

Audio-Visual Deepfake Generation and Detection: An Exploratory Survey

2026-04-29 · 更新于 2026-05-19 · 1 min · 176 words

Audio-Visual Feature Fusion for Calibrating Relevance Scores of Video Moment Retrieval

2026-04-29 · 更新于 2026-05-19 · 2 min · 346 words

AUDIOCARDS: Structured Metadata Improves Audio Language Models for Sound Design

2026-04-29 · 更新于 2026-05-19 · 2 min · 257 words

AudioFuse: Unified Spectral-Temporal Learning Via A Hybrid VIT-1D CNN Architecture for Phonocardiogram Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 293 words

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 412 words

AUDIOGENIE-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

2026-04-29 · 更新于 2026-05-19 · 3 min · 468 words

Auditory Illusion Benchmark for Large Audio Language Models

2026-04-29 · 更新于 2026-05-19 · 1 min · 196 words

Auditory-Inspired Transformer for Binaural Speech Enhancement and Spatial Cue Preservation

2026-04-29 · 更新于 2026-05-19 · 2 min · 271 words

AURA: A Stegaformer-Based Scalable Deep Audio Watermark with Extreme Robustness

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

Auto-MatchCut: An Audio-Visual Retrieval Framework for Seamless Match Cutting

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing

2026-04-29 · 更新于 2026-05-19 · 2 min · 376 words

Automatic Estimation of Speaker Diarization Error Rate Based on Features of Audio Quality and Speaker Discriminability

2026-04-29 · 更新于 2026-05-19 · 2 min · 270 words

Automatic Music Mixing Using a Generative Model of Effect Embeddings

2026-04-29 · 更新于 2026-05-19 · 2 min · 352 words

Automatic Music Sample Identification with Multi-Track Contrastive Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 412 words

AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

2026-04-29 · 更新于 2026-05-19 · 2 min · 374 words

Auxiliary Multi-Label Training For Improving the Robustness of Audio Deepfake Detection on AI-Processed Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

AVATAR: Audio-Visual Adaptive Fusion via Trained Agent Reinforcement for Multimodal Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

AVO-65: A Large-Scale Hierarchical Audio-Visual Object Dataset

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

B-GRPO: Unsupervised Speech Emotion Recognition Based on Batched-Group Relative Policy Optimization

2026-04-29 · 更新于 2026-05-19 · 2 min · 393 words

BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on POP and Classical Music

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

Bayesian Low-Rank Factorization for Robust Model Adaptation

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

Bayesian Signal Separation Via Plug-and-Play Diffusion-Within-Gibbs Sampling

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Improved Multilingual Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 310 words

Beamforming Using Virtual Microphones for Hearing Aid Applications

2026-04-29 · 更新于 2026-05-19 · 1 min · 210 words

Beat and Downbeat Detection: A Reformulated Approach

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

BeatMamba: Bidirectional Selective State-Space Modeling for Efficient Beat Tracking

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Behind the Scenes: Mechanistic Interpretability of Lora-Adapted Whisper for Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 233 words

Benchmarking Humans And Machines On Complex Multilingual Speech Understanding Tasks

2026-04-29 · 更新于 2026-05-19 · 2 min · 262 words

Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

BEST-RQ-based Self-Supervised Learning for Whisper Domain Adaptation

2026-04-29 · 更新于 2026-05-19 · 2 min · 320 words

BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

2026-04-29 · 更新于 2026-05-19 · 4 min · 650 words

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 389 words

Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

Beyond Mapping: Domain-Invariant Representations via Spectral Embedding of Optimal Transport Plans

2026-04-29 · 更新于 2026-05-19 · 3 min · 446 words

Bimodal Fusion Framework for Dynamic Facial Expression Recognition In-The-Wild

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

BioSEN: A Bio-Acoustic Signal Enhancement Network for Animal Vocalizations

2026-04-29 · 更新于 2026-05-19 · 2 min · 395 words

BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 415 words

Bleed No More: Generative Interference Reduction for Musical Recordings

2026-04-29 · 更新于 2026-05-19 · 3 min · 600 words

Bloodroot: When Watermarking Turns Poisonous for Stealthy Backdoor

2026-04-29 · 更新于 2026-05-19 · 2 min · 230 words

Bone-Conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 448 words

Brainprint-Modulated Target Speaker Extraction

2026-04-29 · 更新于 2026-05-19 · 2 min · 320 words

Break-the-Beat! Controllable MIDI-to-Drum audio synthesis

2026-04-29 · 更新于 2026-05-19 · 3 min · 440 words

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

Bridging the Front-End and Back-End for Robust ASR via Cross-Attention-Based U-Net

2026-04-29 · 更新于 2026-05-19 · 2 min · 255 words

Bridging the Measurement–Simulation Gap in Room Acoustics with Real2sim Diffusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 276 words

Bridging the Semantic Gap: Cross-Attentive Fusion for Joint Acoustic-Semantic Speech Quality Assessment

2026-04-29 · 更新于 2026-05-19 · 2 min · 404 words

BSMP-SENet:Band-Split Magnitude-Phase Network for Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 301 words

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

2026-04-29 · 更新于 2026-05-19 · 3 min · 520 words

CaMoD: Causal-Aware Modality Denoising for Multimodal Dialogue Intent Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

2026-04-29 · 更新于 2026-05-19 · 2 min · 309 words

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

Caption and Audio-Guided Video Representation Learning with Gated Attention for Partially Relevant Video Retrieval

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

Cardiobridge-DM: Bridging Cross-Cohort Heart Sound Synthesis via Rhythm-Aware Semi-Supervised Diffusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 309 words

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

2026-04-29 · 更新于 2026-05-19 · 2 min · 216 words

CCST: Cross-Modal and Consistency-Aware Self-Training for Source-Free Unsupervised Domain Adaptation in Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 486 words

Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

Chunkwise Aligners for Streaming Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

Class-Aware Permutation-Invariant Signal-to-Distortion Ratio for Semantic Segmentation of Sound Scene with Same-Class Sources

2026-04-29 · 更新于 2026-05-19 · 2 min · 252 words

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

2026-04-29 · 更新于 2026-05-19 · 3 min · 596 words

Clue2Emo: A Brain-Inspired Framework for Open-Vocabulary Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 441 words

CMSA-Mamba: Hierarchical State Space Modeling for Audio-Based Depression Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 288 words

Co-Initialization of Control Filter and Secondary Path via Meta-Learning for Active Noise Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 290 words

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 351 words

Combining Multi-Order Attention and Multi-Resolution Discriminator for High-Fidelity Neural Vocoder

2026-04-29 · 更新于 2026-05-19 · 3 min · 487 words

Combining SSL Speech Features, Contextual Transformers and Mamba Models for Realistic Audio Spoofing Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 352 words

Compression meets Sampling: LZ78-SPA for Efficient Symbolic Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 396 words

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-Spoofing Countermeasures

2026-04-29 · 更新于 2026-05-19 · 2 min · 411 words

Condition-Invariant fMRI decoding of speech intelligibility with deep state space model

2026-04-29 · 更新于 2026-05-19 · 3 min · 448 words

Conditional Diffusion Models for Mental Health-Preserving Voice Conversion

2026-04-29 · 更新于 2026-05-19 · 2 min · 246 words

Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Confidence-Guided Error Correction for Disordered Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 425 words

Connecting Layer-Wise Representation of Wavlm with Spectro-Temporal Modulation on Speaker Verification

2026-04-29 · 更新于 2026-05-19 · 2 min · 214 words

Constraint Optimized Multichannel Mixer-Limiter Design

2026-04-29 · 更新于 2026-05-19 · 2 min · 370 words

Constructing Composite Features for Interpretable Music-Tagging

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

Content Anonymization for Privacy in Long-Form Audio

2026-04-29 · 更新于 2026-05-19 · 2 min · 237 words

Content Leakage in Librispeech and its Impact on the Privacy Evaluation of Speaker Anonymization

2026-04-29 · 更新于 2026-05-19 · 1 min · 192 words

Content-Preserving Speech Representation Learning Via Adaptive Segment-Level Alignment

2026-04-29 · 更新于 2026-05-19 · 3 min · 434 words

Context-Aware Dynamic Graph Learning for Multimodal Emotion Recognition with Missing Modalities

2026-04-29 · 更新于 2026-05-19 · 2 min · 367 words

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-29 · 更新于 2026-05-19 · 3 min · 492 words

Continuation Method for Feedback Delay Network Modal Decomposition

2026-04-29 · 更新于 2026-05-19 · 1 min · 184 words

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

2026-04-29 · 更新于 2026-05-19 · 3 min · 454 words

Contrastive Timbre Representations for Musical Instrument And Synthesizer Retrieval

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

Controllable Embedding Transformation for Mood-Guided Music Retrieval

2026-04-29 · 更新于 2026-05-19 · 2 min · 347 words

Cooperative Multi-Agent Reinforcement Learning for Adaptive Aggregation in Semi-Supervised Federated Learning with non-IID Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 275 words

CosyAccent: Duration-Controllable Accent Normalization using Source-Synthesis Training Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 246 words

Coupling Acoustic Geometry and Visual Semantics for Robust Depth Estimation

2026-04-29 · 更新于 2026-05-19 · 4 min · 742 words

CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content

2026-04-29 · 更新于 2026-05-19 · 2 min · 345 words

Cross-Architecture Knowledge Distillation of WavLM for Lightweight Speaker Verification

2026-04-29 · 更新于 2026-05-19 · 2 min · 376 words

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

2026-04-29 · 更新于 2026-05-19 · 2 min · 256 words

Cross-Domain Contrastive Learning with Dynamic Threshold Calibration for Source Speaker Tracing

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

Cross-Lingual Alzheimer’s Disease Detection with Multimodal LLMs via Speech Cue-Augmented Prompting and Instruction Tuning

2026-04-29 · 更新于 2026-05-19 · 3 min · 479 words

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 3 min · 428 words

Cross-Lingual Interleaving for Speech Language Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 507 words

Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh

2026-04-29 · 更新于 2026-05-19 · 1 min · 22 words

Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 289 words

Cross-Modal Knowledge Distillation for Speech Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 371 words

CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications

2026-04-29 · 更新于 2026-05-19 · 2 min · 237 words

Curriculum Learning with Contrastive Loss for Lightweight Speaker Verification

2026-04-29 · 更新于 2026-05-19 · 3 min · 428 words

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

2026-04-29 · 更新于 2026-05-19 · 3 min · 458 words

D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation from Lead Sheet

2026-04-29 · 更新于 2026-05-19 · 2 min · 305 words

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 408 words

DAMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMS

2026-04-29 · 更新于 2026-05-19 · 3 min · 446 words

DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

DBFT-SD: Weakly Supervised Multimodal Detection of Sensitive Audio-Visual Content

2026-04-29 · 更新于 2026-05-19 · 2 min · 215 words

DDSC: Dynamic Dual-Signal Curriculum for Data-Efficient Acoustic Scene Classification Under Domain Shift

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

DDSR-Net: Robust Multimodal Sentiment Analysis via Dynamic Modality Reliability Assessment

2026-04-29 · 更新于 2026-05-19 · 5 min · 864 words

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

2026-04-29 · 更新于 2026-05-19 · 2 min · 221 words

Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 379 words

Decorrelation-Enhanced Multiband Subband Adaptive Filtering for RIR Tracking in Sound Field Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 299 words

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

Deep Learning-Based Joint Optimization of Adaptive Feedback Cancellation and Residual Feedback Suppression for Hearing Aids

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

Deep Spatial Clue Informed Ambisonic Encoding for Irregular Microphone Arrays

2026-04-29 · 更新于 2026-05-19 · 3 min · 478 words

Deepaq: A Perceptual Audio Quality Metric Based on Foundational Models and Weakly Supervised Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 400 words

Denoising Of Stochastic Ray Tracing Room Impulse Responses

2026-04-29 · 更新于 2026-05-19 · 2 min · 360 words

DepthTalk: Few-Shot Talking Head Generation with Depth-Aware 3D Gaussian Field Motion

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

Detecting and Attributing Synthetic Spanish Speech: The HISPASpoof Dataset

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

DGSDNet: Dual-Graph Spectral Diffusion Network for Incomplete Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-05-19 · 3 min · 438 words

Diff-vs: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

Diffemotalk: Audio-Driven Facial Animation with Fine-Grained Emotion Control via Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 317 words

Differentiable Grouped Feedback Delay Networks for Learning Direction and Position-Dependent Late Reverberation

2026-04-29 · 更新于 2026-05-19 · 2 min · 340 words

Differentiable Pulsetable Synthesis for Wind Instrument Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Diffusion Timbre Transfer via Mutual Information Guided Inpainting

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

Direct Preference Optimization For Speech Autoregressive Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 347 words

Direct Simultaneous Translation Activation for Large Audio-Language Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 465 words

Direct Transfer of Prosody in Speech-to-speech Translation using Disentangled Speech Tokens

2026-04-29 · 更新于 2026-05-19 · 3 min · 523 words

Directly Trained Spiking Neural Networks with Adaptive Phase Coding

2026-04-29 · 更新于 2026-05-19 · 1 min · 206 words

DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings

2026-04-29 · 更新于 2026-05-19 · 3 min · 431 words

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

2026-04-29 · 更新于 2026-05-19 · 2 min · 392 words

Discrete-Continuous Fusion With Adaptive Hierarchical Features For Audio Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

Disentangled Authenticity Representation for Partially Deepfake Audio Localization

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Disentangling Physiology from Fidelity: Latent-Guided Diffusion Models for Cross-Modal Cardiac Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 313 words

Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

DISSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

2026-04-29 · 更新于 2026-05-19 · 3 min · 431 words

Distilling Attention Knowledge for Speaker Verification

2026-04-29 · 更新于 2026-05-19 · 3 min · 462 words

Distributed Multichannel Active Noise Control with Asynchronous Communication

2026-04-29 · 更新于 2026-05-19 · 2 min · 216 words

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

2026-04-29 · 更新于 2026-05-19 · 3 min · 513 words

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

2026-04-29 · 更新于 2026-05-19 · 2 min · 334 words

Diverse and Few-Step Audio Captioning via Flow Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

2026-04-29 · 更新于 2026-05-19 · 2 min · 399 words

Do Bias Benchmarks Generalise? Evidence from Voice-Based Evaluation of Gender Bias in Speechllms

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

Do Foundational Audio Encoders Understand Music Structure?

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

Do Speech LLMs Learn Crossmodal Embedding Spaces?

2026-04-29 · 更新于 2026-05-19 · 1 min · 213 words

Do We Need EMA for Diffusion-Based Speech Enhancement? Toward A Magnitude-Preserving Network Architecture

2026-04-29 · 更新于 2026-05-19 · 3 min · 476 words

Do we really need self-attention for streaming automatic speech recognition?

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

Do You Hear What I Mean? Quantifying the Instruction-Perception GAP in Instruction-Guided Expressive Text-to-Speech Systems

2026-04-29 · 更新于 2026-05-19 · 2 min · 224 words

Does the Pre-Training of an Embedding Influence its Encoding of Age?

2026-04-29 · 更新于 2026-05-19 · 1 min · 169 words

DOMA: Leveraging Diffusion Language Models with Adaptive Prior for Intent Classification and Slot Filling

2026-04-29 · 更新于 2026-05-19 · 3 min · 427 words

Domain Partitioning Meets Parameter-Efficient Fine-Tuning: A Novel Method for Improved Language-Queried Audio Source Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 376 words

Domain-Aware Scheduling for ASR Fine-Tuning

2026-04-29 · 更新于 2026-05-19 · 2 min · 269 words

Domain-Invariant Representation Learning of Bird Sounds

2026-04-29 · 更新于 2026-05-19 · 2 min · 412 words

DPO-Regularized Regression for Age Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 236 words

DPT-Net: Dual-Path Transformer Network with Hierarchical Fusion for EEG-based Envelope Reconstruction

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

DSRMS-TransUnet: A Decentralized Non-Shifted Transunet for Shallow Water Acoustic Source Range Estimation

2026-04-29 · 更新于 2026-05-19 · 2 min · 294 words

DSSR: Decoupling Salient and Subtle Representations Under Missing Modalities for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Dual Contrastive Learning for Semi-Supervised Domain Adaptation in Bi-Modal Depression Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

2026-04-29 · 更新于 2026-05-19 · 2 min · 405 words

Dual-Perspective Multimodal Sentiment Analysis with MoE Fusion: Representation Learning via Semantic Resonance and Divergence

2026-04-29 · 更新于 2026-05-19 · 3 min · 434 words

Dual-Strategy-Enhanced Conbimamba for Neural Speaker Diarization

2026-04-29 · 更新于 2026-05-19 · 2 min · 367 words

Dynamic Balanced Cross-Modal Attention with Gated Sequence Restoration: Towards Robust Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 233 words

Dynamic Noise-Aware Multi Lora Framework Towards Real-World Audio Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 294 words

Dynamic Spectrogram Analysis with Local-Aware Graph Networks for Audio Anti-Spoofing

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 244 words

E2E-AEC: Implementing An End-To-End Neural Network Learning Approach for Acoustic Echo Cancellation

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

2026-04-29 · 更新于 2026-05-19 · 2 min · 324 words

ECHO: Frequency-Aware Hierarchical Encoding for Variable-Length Signals

2026-04-29 · 更新于 2026-05-19 · 2 min · 340 words

EchoFake: A Replay-Aware Dataset For Practical Speech Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 393 words

EchoRAG: A Two-Stage Framework for Audio-Text Retrieval and Temporal Grounding

2026-04-29 · 更新于 2026-05-19 · 2 min · 308 words

ECSA: Dual-Branch Emotion Compensation for Emotion-Consistent Speaker Anonymization

2026-04-29 · 更新于 2026-05-19 · 2 min · 404 words

EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting

2026-04-29 · 更新于 2026-05-19 · 2 min · 277 words

EEG and Eye-Tracking Driven Dynamic Target Speaker Extraction with Spontaneous Attention Switching

2026-04-29 · 更新于 2026-05-19 · 2 min · 295 words

EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors

2026-04-29 · 更新于 2026-05-19 · 2 min · 396 words

Efficient Audio-Visual Inference Via Token Clustering And Modality Fusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

Efficient Depression Detection from Speech via Language-Independent Prompt-Driven Reprogramming

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

Efficient Solutions for Mitigating Initialization Bias in Unsupervised Self-Adaptive Auditory Attention Decoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 261 words

EMG-to-Speech with Fewer Channels

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 391 words

Emo-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 486 words

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 274 words

EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

Emotion-Aligned Generation in Diffusion Text to Speech Models Via Preference-Guided Optimization

2026-04-29 · 更新于 2026-05-19 · 2 min · 402 words

Emotional Damage: Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

2026-04-29 · 更新于 2026-05-19 · 2 min · 230 words

Emotional Dimension Control in Language Model-Based Text-To-Speech: Spanning a Broad Spectrum of Human Emotions

2026-04-29 · 更新于 2026-05-19 · 1 min · 186 words

EmoTri-RL: Emotion- and Cause-Aware Reinforcement Learning for Multi-Modal Empathetic Dialogue

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness

2026-04-29 · 更新于 2026-05-19 · 2 min · 408 words

Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers

2026-04-29 · 更新于 2026-05-19 · 2 min · 294 words

Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Enhanced Generative Machine Listener

2026-04-29 · 更新于 2026-05-19 · 2 min · 256 words

Enhancing Audio Question-Answering Performance Through Log-Likelihood Guided Reward Functions

2026-04-29 · 更新于 2026-05-19 · 2 min · 367 words

Enhancing Automatic Drum Transcription with Online Dynamic Few-Shot Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 245 words

Enhancing Dialogue-Related Speech Tasks with Generated Spoken Dialogues

2026-04-29 · 更新于 2026-05-19 · 2 min · 291 words

Enhancing Noise Robustness for Neural Speech Codecs Through Resource-Efficient Progressive Quantization Perturbation Simulation

2026-04-29 · 更新于 2026-05-19 · 1 min · 178 words

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation Guided Structured Pruning

2026-04-29 · 更新于 2026-05-19 · 3 min · 443 words

Enhancing Speech Intelligibility Prediction for Hearing Aids with Complementary Speech Foundation Model Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

2026-04-29 · 更新于 2026-05-19 · 1 min · 179 words

Equipping Large Language Model with Directional Speech Understanding Capabilities

2026-04-29 · 更新于 2026-05-19 · 2 min · 249 words

Erasing Your Voice Before it’s Heard: Training-Free Speaker Unlearning for Zero-Shot Text-to-Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 384 words

Estimating Hand-Related Features from Speech Using Machine Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 226 words

Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

2026-04-29 · 更新于 2026-05-19 · 2 min · 223 words

Etude: Piano Cover Generation with a Three-Stage Approach — Extract, Structuralize, and Decode

2026-04-29 · 更新于 2026-05-19 · 2 min · 421 words

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

2026-04-29 · 更新于 2026-05-19 · 3 min · 437 words

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

2026-04-29 · 更新于 2026-05-19 · 2 min · 313 words

Evaluating Compositional Structure in Audio Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 324 words

Evaluating Disentangled Representations for Controllable Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 289 words

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 240 words

Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics

2026-04-29 · 更新于 2026-05-19 · 2 min · 351 words

Evaluating Pretrained Speech Embedding Systems for Dysarthria Detection Across Heterogenous Datasets

2026-04-29 · 更新于 2026-05-19 · 2 min · 249 words

Event Classification by Physics-Informed Inpainting for Distributed Multichannel Acoustic Sensor with Partially Degraded Channels

2026-04-29 · 更新于 2026-05-19 · 2 min · 230 words

Exploring Fine-Tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 375 words

Exploring How Audio Effects Alter Emotion with Foundation Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 220 words

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 3 min · 572 words

Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

Expressive Voice Conversion with Controllable Emotional Intensity

2026-04-29 · 更新于 2026-05-19 · 2 min · 387 words

Exterior Sound Field Estimation Based on Physics-Constrained Kernel

2026-04-29 · 更新于 2026-05-19 · 1 min · 199 words

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Face-Voice Association with Inductive Bias for Maximum Class Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 382 words

Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

2026-04-29 · 更新于 2026-05-19 · 2 min · 418 words

Fast-ULCNet: A Fast and Ultra Low Complexity Network for Single-Channel Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 421 words

FD-ARL: Feature Disentanglement with Adversarial-Reconstruction Learning for Cross-Subject Auditory Attention Decoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

FDCNet: Frequency Domain Channel Attention and Convolution for Lipreading

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

FED-PISA: Federated Voice Cloning Via Personalized Identity-Style Adaptation

2026-04-29 · 更新于 2026-05-19 · 3 min · 442 words

Feedback-Driven Retrieval-Augmented Audio Generation with Large Audio Language Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 431 words

Few-Shot Recognition of Audio Deepfake Generators using Graph-Based Prototype Adaptation

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

FIDIC:Fine-Grained Conversational Emotion Recognition via Individual Differences in Inertia and Contagion

2026-04-29 · 更新于 2026-05-19 · 2 min · 234 words

Fine-Grained Frame Modeling in Multi-Head Self-Attention for Speech Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 299 words

Fine-Tuning Bigvgan-V2 for Robust Musical Tuning Preservation

2026-04-29 · 更新于 2026-05-19 · 2 min · 252 words

Fine-Tuning Large Audio-Language Models with Lora for Precise Temporal Localization of Prolonged Exposure Therapy Elements

2026-04-29 · 更新于 2026-05-19 · 4 min · 698 words

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

2026-04-29 · 更新于 2026-05-19 · 3 min · 568 words

FinHuBERT: Hierarchical Feature Imitating Networks for Low-Resource Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 322 words

FlashFoley: Fast Interactive Sketch2audio Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

Flexio: Flexible Single- and Multi-Channel Speech Separation and Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

FOCA: Multimodal Malware Classification via Hyperbolic Cross-Attention

2026-04-29 · 更新于 2026-05-19 · 2 min · 373 words

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

2026-04-29 · 更新于 2026-05-19 · 3 min · 626 words

FODGE : High-Fidelity Dance Generation via Full-Body Optimization

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

FoleyBench: A Benchmark for Video-to-Audio Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Forward Convolutive Prediction for Frame Online Monaural Speech Dereverberation based on Kronecker Product Decomposition

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

Frame-Stacked Local Transformers for Efficient Multi-Codebook Speech Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 421 words

Frequency-Independent Ambisonics Upscaling Using Deep Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 243 words

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-Modal Understanding in Multimodal LLMS

2026-04-29 · 更新于 2026-05-19 · 2 min · 370 words

From Diet to Free Lunch: Estimating Auxiliary Signal Properties Using Dynamic Pruning Masks in Speech Enhancement Networks

2026-04-29 · 更新于 2026-05-19 · 2 min · 403 words

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

2026-04-29 · 更新于 2026-05-19 · 2 min · 285 words

From Human Speech to Ocean Signals: Transferring Speech Large Models for Underwater Acoustic Target Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 285 words

Frontend Token Enhancement for Token-Based Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 460 words

Full Band Denoising of Room Impulse Response in the Wavelet Domain with Dictionary Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 227 words

FUN-SSL: Full-Band Layer Followed by U-Net With Narrow-Band Layers for Multiple Moving Sound Source Localization

2026-04-29 · 更新于 2026-05-19 · 2 min · 271 words

FUSEMOS: Perceptual Evaluation of Text-to-Music Generation with Dual-Encoder Fusion and Ranking-Aware Composite Loss

2026-04-29 · 更新于 2026-05-19 · 3 min · 506 words

Fusion of Multimodal Estimations by Extended State Hidden Markov Model: Application to Fetal Heart Rate Monitoring

2026-04-29 · 更新于 2026-05-19 · 2 min · 286 words

FxSearcher: Gradient-Free Text-Driven Audio Transformation

2026-04-29 · 更新于 2026-05-19 · 2 min · 359 words

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 245 words

Gdiffuse: Diffusion-Based Speech Enhancement with Noise Model Guidance

2026-04-29 · 更新于 2026-05-19 · 3 min · 498 words

Gelina: Unified Speech and Gesture Synthesis Via Interleaved Token Prediction

2026-04-29 · 更新于 2026-05-19 · 3 min · 433 words

Gen-SER: When the Generative Model Meets Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 255 words

Generalizability of Predictive and Generative Speech Enhancement Models to Pathological Speakers

2026-04-29 · 更新于 2026-05-19 · 3 min · 434 words

Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker

2026-04-29 · 更新于 2026-05-19 · 1 min · 202 words

Generating Moving 3d Soundscapes with Latent Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 257 words

Generative Audio Extension and Morphing

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

Generative UI as an Accessibility Bridge: Lessons from C2C E-Commerce

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

GLA-GRAD++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

GLAP: General Contrastive Audio-Text Pretraining Across Domains and Languages

2026-04-29 · 更新于 2026-05-19 · 3 min · 434 words

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

2026-04-29 · 更新于 2026-05-19 · 3 min · 455 words

GLUE: Gradient-free Learning to Unify Experts

2026-04-29 · 更新于 2026-05-19 · 2 min · 315 words

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Constrative and Generative Pretraining

2026-04-29 · 更新于 2026-05-19 · 2 min · 354 words

Graph-Based Emotion Consensus Perception Learning for Multimodal Emotion Recognition in Conversation

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

Graph-based Modality Alignment for Robustness in Conversational Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Graph-Biased EEG Transformers for Silent Speech Decoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 351 words

Grey-Box Prompt Tuning With Graph Alignment for Speech-Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

GRNet: Graph Reconstruction Network for Robust Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Group Relative Policy Optimization for Text-to-Speech with Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 347 words

Group-Sparse Gaussian Process Regression for Inhomogeneous Sound Field Estimation

2026-04-29 · 更新于 2026-05-19 · 2 min · 241 words

H-nnPBFDAF: Hierarchical Neural Network Partitioned Block Frequency Domain Adaptive Filter with Novel Block Activation Probability

2026-04-29 · 更新于 2026-05-19 · 2 min · 405 words

Hair Noise Analysis and Mitigation for Smart Glasses Audio Captures

2026-04-29 · 更新于 2026-05-19 · 2 min · 288 words

Hanui: Harnessing Distributional Discrepancies for Singing Voice Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 264 words

HarmoNet: Music Grounding by Short Video via Harmonic Resample and Dynamic Sparse Alignment

2026-04-29 · 更新于 2026-05-19 · 2 min · 373 words

Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 268 words

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection with Multichannel Audio and Multiscale Visual Cues

2026-04-29 · 更新于 2026-05-19 · 2 min · 415 words

HCGAN: Harmonic-Coupled Generative Adversarial Network for Speech Super-Resolution in Low-Bandwidth Scenarios

2026-04-29 · 更新于 2026-05-19 · 2 min · 301 words

HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

HergNet: A Fast Neural Surrogate Model for Sound Field Predictions Via Superposition of Plane Waves

2026-04-29 · 更新于 2026-05-19 · 2 min · 259 words

HFSQVAE: Hierarchical Vector Quantization with Residuals for Frequency-Specific Embedding

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

Hierarchical Activity Recognition and Captioning from Long-Form Audio

2026-04-29 · 更新于 2026-05-19 · 2 min · 410 words

Hierarchical Discrete Flow Matching For Multi-Codebook Codec-Based Text-To-Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

Hierarchical Tokenization of Multimodal Music Data for Generative Music Retrieval

2026-04-29 · 更新于 2026-05-19 · 2 min · 337 words

HiFi-HARP: A High-Fidelity 7th-Order Ambisonic Room Impulse Response Dataset

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

High-Fidelity Speech Enhancement Via Discrete Audio Tokens

2026-04-29 · 更新于 2026-05-19 · 2 min · 322 words

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-Resource Transfer

2026-04-29 · 更新于 2026-05-19 · 1 min · 162 words

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 243 words

Huí Sù: Co-constructing a Dual Feedback Apparatus

2026-04-29 · 更新于 2026-05-19 · 1 min · 149 words

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

2026-04-29 · 更新于 2026-05-19 · 2 min · 315 words

HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems

2026-04-29 · 更新于 2026-05-19 · 2 min · 423 words

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

2026-04-29 · 更新于 2026-05-19 · 2 min · 395 words

HyFlowSE: Hybrid End-To-End Flow-Matching Speech Enhancement via Generative-Discriminative Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-Based Single-Channel Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 370 words

IBPCodec : A Low-Bitrate Lightweight Speech Codec With Inter-Band Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

ICASSP 2026 - 主动噪声控制 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 145 words

ICASSP 2026 - 主动降噪 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 132 words

ICASSP 2026 - 主题建模 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 86 words

ICASSP 2026 - 信号处理 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 162 words

ICASSP 2026 - 关键词检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 4 min · 682 words

ICASSP 2026 - 医疗AI 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 96 words

ICASSP 2026 - 听觉注意力解码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 208 words

ICASSP 2026 - 听觉注意解码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 117 words

ICASSP 2026 - 噪声控制 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 103 words

ICASSP 2026 - 回声消除 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 127 words

ICASSP 2026 - 基准测试 论文列表

2026-04-29 · 更新于 2026-05-19 · 4 min · 748 words

ICASSP 2026 - 基频估计 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 106 words

ICASSP 2026 - 声场估计 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 91 words

ICASSP 2026 - 声学建模 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 80 words

ICASSP 2026 - 声源定位 论文列表

2026-04-29 · 更新于 2026-05-19 · 7 min · 1446 words

ICASSP 2026 - 多模态学习 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 85 words

ICASSP 2026 - 多模态对话意图识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 90 words

ICASSP 2026 - 多模态情感分析 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 151 words

ICASSP 2026 - 多模态情感识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 231 words

ICASSP 2026 - 多模态模型 论文列表

2026-04-29 · 更新于 2026-05-19 · 4 min · 672 words

ICASSP 2026 - 多通道 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 80 words

ICASSP 2026 - 多音高估计 #音符跟踪 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 383 words

ICASSP 2026 - 实体消歧 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 91 words

ICASSP 2026 - 实时处理 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 155 words

ICASSP 2026 - 对抗样本 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 128 words

ICASSP 2026 - 异常声音检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 154 words

ICASSP 2026 - 情感分析 论文列表

2026-04-29 · 更新于 2026-05-19 · 4 min · 748 words

ICASSP 2026 - 情感识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 154 words

ICASSP 2026 - 房间脉冲响应 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 131 words

ICASSP 2026 - 房间脉冲响应去噪 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 102 words

ICASSP 2026 - 数据集 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

ICASSP 2026 - 数据集对齐 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 122 words

ICASSP 2026 - 槽填充 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 95 words

ICASSP 2026 - 模型评估 论文列表

2026-04-29 · 更新于 2026-05-19 · 11 min · 2176 words

ICASSP 2026 - 歌唱旋律提取 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 102 words

ICASSP 2026 - 歌唱语音合成 论文列表

2026-04-29 · 更新于 2026-05-19 · 3 min · 601 words

ICASSP 2026 - 歌唱语音转录 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 134 words

ICASSP 2026 - 歌唱语音转换 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 403 words

ICASSP 2026 - 水下声学目标识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 146 words

ICASSP 2026 - 生物声学 论文列表

2026-04-29 · 更新于 2026-05-19 · 7 min · 1362 words

ICASSP 2026 - 目标说话人提取 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 81 words

ICASSP 2026 - 神经解码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 130 words

ICASSP 2026 - 空间音频 论文列表

2026-04-29 · 更新于 2026-05-19 · 18 min · 3752 words

ICASSP 2026 - 联邦学习 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 195 words

ICASSP 2026 - 脑信号编码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 167 words

ICASSP 2026 - 脑机接口 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 155 words

ICASSP 2026 - 舞蹈生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 95 words

ICASSP 2026 - 视觉语音识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 365 words

ICASSP 2026 - 视频到音频生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 135 words

ICASSP 2026 - 视频检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 96 words

ICASSP 2026 - 视频片段检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 110 words

ICASSP 2026 - 视频理解 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 97 words

ICASSP 2026 - 视频生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 165 words

ICASSP 2026 - 视频设备识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 87 words

ICASSP 2026 - 视频问答 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 183 words

ICASSP 2026 - 视频高光检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 108 words

ICASSP 2026 - 语音伪造检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 5 min · 938 words

ICASSP 2026 - 语音克隆 论文列表

2026-04-29 · 更新于 2026-05-19 · 3 min · 470 words

ICASSP 2026 - 语音分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 13 min · 2634 words

ICASSP 2026 - 语音匿名化 论文列表

2026-04-29 · 更新于 2026-05-19 · 6 min · 1240 words

ICASSP 2026 - 语音发现 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 145 words

ICASSP 2026 - 语音合成 论文列表

2026-04-29 · 更新于 2026-05-19 · 37 min · 7808 words

ICASSP 2026 - 语音增强 #对抗防御 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 80 words

ICASSP 2026 - 语音增强 论文列表

2026-04-29 · 更新于 2026-05-19 · 40 min · 8423 words

ICASSP 2026 - 语音大模型 论文列表

2026-04-29 · 更新于 2026-05-19 · 3 min · 457 words

ICASSP 2026 - 语音对话系统 论文列表

2026-04-29 · 更新于 2026-05-19 · 7 min · 1302 words

ICASSP 2026 - 语音情感识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 26 min · 5504 words

ICASSP 2026 - 语音摘要 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 204 words

ICASSP 2026 - 语音活动检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 5 min · 863 words

ICASSP 2026 - 语音理解 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 362 words

ICASSP 2026 - 语音生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 128 words

ICASSP 2026 - 语音生物标志物 论文列表

2026-04-29 · 更新于 2026-05-19 · 13 min · 2674 words

ICASSP 2026 - 语音编码 论文列表

2026-04-29 · 更新于 2026-05-19 · 3 min · 515 words

ICASSP 2026 - 语音编码器 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 130 words

ICASSP 2026 - 语音翻译 论文列表

2026-04-29 · 更新于 2026-05-19 · 6 min · 1095 words

ICASSP 2026 - 语音表示学习 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 170 words

ICASSP 2026 - 语音解码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 116 words

ICASSP 2026 - 语音评估 论文列表

2026-04-29 · 更新于 2026-05-19 · 3 min · 531 words

ICASSP 2026 - 语音识别 #语音合成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 149 words

ICASSP 2026 - 语音识别 #语音翻译 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 389 words

ICASSP 2026 - 语音识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 55 min · 11705 words

ICASSP 2026 - 语音质量评估 论文列表

2026-04-29 · 更新于 2026-05-19 · 6 min · 1238 words

ICASSP 2026 - 语音转换 #语音增强 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 144 words

ICASSP 2026 - 语音转换 论文列表

2026-04-29 · 更新于 2026-05-19 · 5 min · 962 words

ICASSP 2026 - 语音问答 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 311 words

ICASSP 2026 - 语音驱动动作生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 130 words

ICASSP 2026 - 说话人分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 6 min · 1217 words

ICASSP 2026 - 说话人合成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 96 words

ICASSP 2026 - 说话人日志 #语音分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 202 words

ICASSP 2026 - 说话人日志 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 278 words

ICASSP 2026 - 说话人检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 86 words

ICASSP 2026 - 说话人生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 91 words

ICASSP 2026 - 说话人脸生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 163 words

ICASSP 2026 - 说话人识别 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 103 words

ICASSP 2026 - 说话人验证 论文列表

2026-04-29 · 更新于 2026-05-19 · 6 min · 1183 words

ICASSP 2026 - 课堂阶段分割 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 83 words

ICASSP 2026 - 跨模态 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 213 words

ICASSP 2026 - 跨模态检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 215 words

ICASSP 2026 - 轻度认知障碍检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 241 words

ICASSP 2026 - 迁移学习 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 96 words

ICASSP 2026 - 零样本关键词检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 105 words

ICASSP 2026 - 音乐信息检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 17 min · 3478 words

ICASSP 2026 - 音乐分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 99 words

ICASSP 2026 - 音乐分类 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 181 words

ICASSP 2026 - 音乐推荐 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 167 words

ICASSP 2026 - 音乐检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

ICASSP 2026 - 音乐混合 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 100 words

ICASSP 2026 - 音乐源分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 242 words

ICASSP 2026 - 音乐源提取 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 134 words

ICASSP 2026 - 音乐理解 论文列表

2026-04-29 · 更新于 2026-05-19 · 7 min · 1392 words

ICASSP 2026 - 音乐生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 18 min · 3742 words

ICASSP 2026 - 音乐转录 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 85 words

ICASSP 2026 - 音视频 论文列表

2026-04-29 · 更新于 2026-05-19 · 5 min · 1042 words

ICASSP 2026 - 音视频实例分割 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 130 words

ICASSP 2026 - 音频事件检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 12 min · 2538 words

ICASSP 2026 - 音频信号处理 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 83 words

ICASSP 2026 - 音频分离 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 236 words

ICASSP 2026 - 音频分类 #零样本学习 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 81 words

ICASSP 2026 - 音频分类 论文列表

2026-04-29 · 更新于 2026-05-19 · 22 min · 4671 words

ICASSP 2026 - 音频压缩 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

ICASSP 2026 - 音频场景分类 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 178 words

ICASSP 2026 - 音频场景理解 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

ICASSP 2026 - 音频增强 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 331 words

ICASSP 2026 - 音频大模型 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 101 words

ICASSP 2026 - 音频字幕生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 102 words

ICASSP 2026 - 音频安全 论文列表

2026-04-29 · 更新于 2026-05-19 · 8 min · 1559 words

ICASSP 2026 - 音频描述 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 105 words

ICASSP 2026 - 音频效果估计 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 140 words

ICASSP 2026 - 音频无损编码 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 108 words

ICASSP 2026 - 音频检索 #音频分类 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 174 words

ICASSP 2026 - 音频检索 论文列表

2026-04-29 · 更新于 2026-05-19 · 8 min · 1662 words

ICASSP 2026 - 音频水印 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 148 words

ICASSP 2026 - 音频深度伪造检测 论文列表

2026-04-29 · 更新于 2026-05-19 · 17 min · 3544 words

ICASSP 2026 - 音频生成 论文列表

2026-04-29 · 更新于 2026-05-19 · 22 min · 4597 words

ICASSP 2026 - 音频编辑 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 96 words

ICASSP 2026 - 音频质量评估 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 209 words

ICASSP 2026 - 音频超分辨率 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 111 words

ICASSP 2026 - 音频问答 论文列表

2026-04-29 · 更新于 2026-05-19 · 9 min · 1795 words

ICASSP 2026 - 预训练 论文列表

2026-04-29 · 更新于 2026-05-19 · 1 min · 159 words

ICASSP 2026 - 领域适应 论文列表

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

Identifying Birdsong Syllables without Labelled Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 292 words

Identifying the Minimal and Maximal Phonetic Subspace of Speech Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 221 words

Identity Leakage Through Accent Cues in Voice Anonymisation

2026-04-29 · 更新于 2026-05-19 · 2 min · 382 words

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

2026-04-29 · 更新于 2026-05-19 · 2 min · 252 words

Improving Active Learning for Melody Estimation by Disentangling Uncertainties

2026-04-29 · 更新于 2026-05-19 · 3 min · 462 words

Improving Anomalous Sound Detection with Attribute-Aware Representation from Domain-Adaptive Pre-Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 288 words

Improving Audio Event Recognition with Consistency Regularization

2026-04-29 · 更新于 2026-05-19 · 2 min · 289 words

Improving Audio Question Answering with Variational Inference

2026-04-29 · 更新于 2026-05-19 · 2 min · 377 words

Improving Automatic Speech Recognition by Mitigating Distortions Introduced by Speech Enhancement Under Drone Noise

2026-04-29 · 更新于 2026-05-19 · 3 min · 630 words

Improving Binaural Distance Estimation in Reverberant Rooms Through Contrastive And Multi-Task Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 267 words

Improving Contextual Asr Via Multi-Grained Fusion With Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 317 words

Improving Interpretability in Generative Multitimbral DDSP Frameworks via Semantically-Disentangled Musical Attributes

2026-04-29 · 更新于 2026-05-19 · 2 min · 404 words

Improving Multimodal Brain Encoding Model with Dynamic Subject-Awareness Routing

2026-04-29 · 更新于 2026-05-19 · 3 min · 476 words

Improving the Speaker Anonymization Evaluation’s Robustness to Target Speakers with Adversarial Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word level timestamp predictions

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 352 words

Incremental Learning for Audio Classification with Hebbian Deep Neural Networks

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

2026-04-29 · 更新于 2026-05-19 · 2 min · 264 words

Individualize the HRTF Neural Field Using Anthropometric Parameters Weighted by Direction-Attention

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

Influence of Clean Speech Characteristics on Speech Enhancement Performance

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Influence-Aware Curation and Active Selection for Industrial and Surveillance Sound Events

2026-04-29 · 更新于 2026-05-19 · 3 min · 547 words

Input-Adaptive Differentiable Filterbanks via Hypernetworks for Robust Speech Processing

2026-04-29 · 更新于 2026-05-19 · 2 min · 418 words

InstructAudio: Unified Speech and Music Generation with Natural Language Instruction

2026-04-29 · 更新于 2026-05-19 · 4 min · 791 words

Instrument Generation Through Distributional Flow Matching and Test-Time Search

2026-04-29 · 更新于 2026-05-19 · 2 min · 270 words

Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

2026-04-29 · 更新于 2026-05-19 · 3 min · 487 words

Integrating Speaker Embeddings and LLM-Derived Semantic Representations for Streaming Speaker Diarization

2026-04-29 · 更新于 2026-05-19 · 2 min · 408 words

Inter-Dialog Contrastive Learning for Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-05-19 · 3 min · 436 words

Interpretable Music Harmonic Analysis Through Multilinear Mixture of Experts

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

Interval-Aware Retrieval Framework For Speech-Based Automatic Alzheimer’s Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 317 words

Inverse-Hessian Regularization for Continual Learning in ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 219 words

Investigating Modality Contribution in Audio LLMs for Music

2026-04-29 · 更新于 2026-05-19 · 1 min · 151 words

Investigating The Effect Of Sentence-Level Syntactic Structure On Information Loss In The Human Auditory System

2026-04-29 · 更新于 2026-05-19 · 2 min · 309 words

Is Phase Really Needed for Weakly-Supervised Dereverberation?

2026-04-29 · 更新于 2026-05-19 · 2 min · 224 words

It Is Personal: The Importance of Personalization for Recognizing Self-Reported Emotion

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation

2026-04-29 · 更新于 2026-05-19 · 2 min · 394 words

Joint Deep Secondary Path Estimation and Adaptive Control for Active Noise Cancellation

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-Task Multi-Scale Network

2026-04-29 · 更新于 2026-05-19 · 3 min · 531 words

Joint Estimation of Primary and Secondary Paths for Personalized Hearable Applications

2026-04-29 · 更新于 2026-05-19 · 2 min · 275 words

Joint Multichannel Acoustic Feedback Cancellation and Speaker Extraction via Kalman Filter and Deep Non-Linear Spatial Filter

2026-04-29 · 更新于 2026-05-19 · 2 min · 247 words

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

2026-04-29 · 更新于 2026-05-19 · 2 min · 247 words

KAN We Make Models Simpler for Audio Deepfake Detection with Kolmogorov–Arnold Networks?

2026-04-29 · 更新于 2026-05-19 · 2 min · 309 words

Keeping Models Listening: Segment- and time-aware attention rescaling at decoding time

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

2026-04-29 · 更新于 2026-05-19 · 1 min · 135 words

KSDIFF: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

2026-04-29 · 更新于 2026-05-19 · 3 min · 457 words

LAFUFU: Latent Acoustic Features For Ultra-Fast Utterance Restoration

2026-04-29 · 更新于 2026-05-19 · 3 min · 480 words

LAMB: LLM-Based Audio Captioning with Modality Gap Bridging Via Cauchy-Schwarz Divergence

2026-04-29 · 更新于 2026-05-19 · 2 min · 243 words

Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR

2026-04-29 · 更新于 2026-05-19 · 1 min · 209 words

Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 396 words

Learnable Mel-Frontend for Robust Underwater Acoustic Target Detection under Non-Target Interference

2026-04-29 · 更新于 2026-05-19 · 2 min · 397 words

Learning Domain-Robust Bioacoustic Representations for Mosquito Species Classification with Contrastive Learning and Distribution Alignment

2026-04-29 · 更新于 2026-05-19 · 3 min · 462 words

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

2026-04-29 · 更新于 2026-05-19 · 2 min · 295 words

Learning Piezoelectric Hysteresis in In-Ear MEMS Loudspeakers from Acoustic Measurements

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 277 words

Learning Vocal-Tract Area And Radiation With A Physics-Informed Webster Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 415 words

Learning What to Hear: Boosting Sound-Source Association for Robust Audiovisual Instance Segmentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 377 words

LenslessMic: Audio Encryption and Authentication via Lensless Computational Imaging

2026-04-29 · 更新于 2026-05-19 · 3 min · 574 words

LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 364 words

LETPAV: Lexicon-Enhanced Text with Progressive Audio-Visual Fusion for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 3 min · 480 words

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

Leveraging Diffusion U-Net Features for Predominant Instrument Recognition

2026-04-29 · 更新于 2026-05-19 · 1 min · 175 words

Leveraging Large Multimodal Models for Audio-Video Deepfake Detection: A Pilot Study

2026-04-29 · 更新于 2026-05-19 · 2 min · 385 words

Leveraging Large Speech Language Models as Evaluators for Expressive Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

Leveraging Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners

2026-04-29 · 更新于 2026-05-19 · 2 min · 340 words

Leveraging prediction entropy for Automatic prompt weighting in Zero-Shot Audio-Language Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 290 words

Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Leveraging Text-to-Speech and Voice Conversion as Data Augmentation for Alzheimer’s Disease Detection from Spontaneous Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

Leveraging Whisper Embeddings For Audio-Based Lyrics Matching

2026-04-29 · 更新于 2026-05-19 · 3 min · 442 words

Lightweight and Generalizable Acoustic Scene Representations Via Contrastive Fine-Tuning and Distillation

2026-04-29 · 更新于 2026-05-19 · 2 min · 350 words

Lightweight and Perceptually-Guided Voice Conversion for Electro-Laryngeal Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 388 words

Lightweight Implicit Neural Network for Binaural Audio Synthesis

2026-04-29 · 更新于 2026-05-19 · 3 min · 443 words

Lightweight Phoneme-Conditioned Bandwidth Extension for Body-Conducted Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 279 words

Lingometer: On-Device Personal Speech Word Counting System

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Linguard: Authenticating Speech Recordings Using Speech Recognition and Watermark

2026-04-29 · 更新于 2026-05-19 · 2 min · 335 words

LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-And-Play Dereverberation

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Lisa: Lightweight Yet Superb Neural Speech Coding

2026-04-29 · 更新于 2026-05-19 · 2 min · 371 words

Listen, But Don’t Leak: Sensitive Data Protection for Privacy Aware Automatic Speech Recognition with Acoustic Triggers

2026-04-29 · 更新于 2026-05-19 · 1 min · 190 words

LLAC: Learned Lossless Audio Codec

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

LLM-Based Post-ASR Error Correction for Disordered Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 219 words

Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 250 words

Look, Listen and Segment: Towards Weakly Supervised Audio-Visual Semantic Segmentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Loose Coupling of Spectral and Spatial Models for Multi-Channel Diarization and Enhancement of Meetings in Dynamic Environments

2026-04-29 · 更新于 2026-05-19 · 2 min · 383 words

LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 220 words

Low-Bandwidth High-Fidelity Speech Transmission with Generative Latent Joint Source-Channel Coding

2026-04-29 · 更新于 2026-05-19 · 2 min · 262 words

Low-Frequency Harmonic Control for Speech Intelligibility in Open-Ear Headphones

2026-04-29 · 更新于 2026-05-19 · 2 min · 234 words

Low-Latency Audio Front-End Region-of-Interest Beamforming for Smart Glasses

2026-04-29 · 更新于 2026-05-19 · 2 min · 236 words

Low-Resource Guidance for Controllable Latent Audio Diffusion

2026-04-29 · 更新于 2026-05-19 · 3 min · 563 words

Low-Resource Speech-Based Early Alzheimers Detection via Cross-Lingual and Few-Shot Transfer Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 254 words

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 313 words

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation Without Vector Quantization

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

2026-04-29 · 更新于 2026-05-19 · 3 min · 542 words

Malefa: Multi-Granularity Learning and Effective False Alarm Suppression for Zero-Shot Keyword Spotting

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Mambaformer: State-Space Augmented Self-Attention with Downup Sampling for Monaural Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 382 words

Marco-Voice: A Unified Framework for Expressive Speech Synthesis with Voice Cloning

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances

2026-04-29 · 更新于 2026-05-19 · 3 min · 477 words

Matching Reverberant Speech Through Learned Acoustic Embeddings

2026-04-29 · 更新于 2026-05-19 · 2 min · 227 words

Matrix-Structured Hierarchical Convolutional Modeling for Pronunciation Assessment and Mispronunciation Detection

2026-04-29 · 更新于 2026-05-19 · 3 min · 429 words

Maximum Likelihood Measurement Noise Estimation for Block-Time Domain Kalman Filters

2026-04-29 · 更新于 2026-05-19 · 2 min · 233 words

MC-MRX: Reference- and Midi-Guided Music Source Extraction with Contrastive Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 388 words

MCF: Text LLMS for Multimodal Emotional Causality

2026-04-29 · 更新于 2026-05-19 · 2 min · 334 words

MCI-OTFusion: A Multimodal Model for MCI Detection and Cognitive Score Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

Meanflow-Accelerated Multimodal Video-to-Audio Synthesis Via One-Step Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

MeanFlowSE: One-Step Generative Speech Enhancement via Conditional Mean Flow

2026-04-29 · 更新于 2026-05-19 · 2 min · 393 words

MeanSE: Efficient Generative Speech Enhancement with Mean Flows

2026-04-29 · 更新于 2026-05-19 · 2 min · 350 words

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

2026-04-29 · 更新于 2026-05-19 · 3 min · 451 words

MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Mean Flows

2026-04-29 · 更新于 2026-05-19 · 2 min · 389 words

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

2026-04-29 · 更新于 2026-05-19 · 2 min · 293 words

MECap-R1: Emotion-Aware Policy with Reinforcement Learning for Multimodal Emotion Captioning

2026-04-29 · 更新于 2026-05-19 · 2 min · 375 words

Medical ASR Enhancement by Domain-Specific Reinforcement Fine-Tuning

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 426 words

Melos: Sentence-To-Section Training with Multi-Task Learning for LLM-Driven Song Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 417 words

Membership Inference Attack against Music Diffusion Models via Generative Manifold Perturbation

2026-04-29 · 更新于 2026-05-19 · 2 min · 235 words

MFF-RVRDI: Multimodal Fusion Framework for Robust Video Recording Device Identification

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large Audio-Language Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 353 words

Microphone-Less Measurement of Three-Dimensional Radiating Impulse Response of Sound Source using Spherical Harmonic-Domain Acousto-Optic Tomography

2026-04-29 · 更新于 2026-05-19 · 1 min · 161 words

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 245 words

Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 270 words

Mind Your [m]S, Cross Your [t]S: a Large-Scale Phonetic Analysis of Speech Reproduction in Modern Speech Generators

2026-04-29 · 更新于 2026-05-19 · 1 min · 196 words

MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

2026-04-29 · 更新于 2026-05-19 · 2 min · 273 words

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs

2026-04-29 · 更新于 2026-05-19 · 2 min · 229 words

Mitigating Data Replication in Text-to-Audio Generative Diffusion Models Through Anti-Memorization Guidance

2026-04-29 · 更新于 2026-05-19 · 2 min · 405 words

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

2026-04-29 · 更新于 2026-05-19 · 1 min · 195 words

Mitigating Language Prior-Induced Hallucinations via Bi-Level Contrastive Decoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 388 words

Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 335 words

Mix2Morph: Learning Sound Morphing from Noisy Mixes

2026-04-29 · 更新于 2026-05-19 · 2 min · 322 words

MixGAN-based Non-blind Bandwidth Extension for Audio Codec

2026-04-29 · 更新于 2026-05-19 · 2 min · 311 words

Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

Mixture To Beamformed Mixture: Leveraging Beamformed Mixture As Weak-Supervision for Speech Enhancement and Noise-Robust ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 310 words

Mixture-of-Experts Based Soft-Label Learning for Multi-Label Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

2026-04-29 · 更新于 2026-05-19 · 2 min · 244 words

Mixtures of Lightweight Articulatory Experts for Multilingual Asr

2026-04-29 · 更新于 2026-05-19 · 2 min · 378 words

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-05-19 · 2 min · 283 words

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 222 words

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 385 words

MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

2026-04-29 · 更新于 2026-05-19 · 1 min · 176 words

Modeling Both Intra- And Inter-Utterance Variability for Conversational Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Modeling Inter-Segment Relationships in Speech for Dementia Detection with Audio Spectrogram Transformers and Graph Attention Networks

2026-04-29 · 更新于 2026-05-19 · 2 min · 346 words

Modeling Strategies For Speech Enhancement in The Latent Space of a Neural Audio Codec

2026-04-29 · 更新于 2026-05-19 · 3 min · 460 words

Monitoring exposure-length variations in submarine power cables using distributed fiber-optic sensing

2026-04-29 · 更新于 2026-05-19 · 1 min · 146 words

More Than a Shortcut: A Hyperbolic Approach to Early-Exit Networks

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Motionbeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 263 words

MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 425 words

MSANET: Multi-Scale Semantic Aggregation Network for Brain-Assisted Speech Enhancement in Multi-Speaker Conditions

2026-04-29 · 更新于 2026-05-19 · 2 min · 420 words

MSCT: Differential Cross-Modal Attention for Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 220 words

MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 405 words

MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

2026-04-29 · 更新于 2026-05-19 · 6 min · 1085 words

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-Token Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 377 words

Multi-Layer Attentive Probing Improves Transfer of Audio Representations for Bioacoustics

2026-04-29 · 更新于 2026-05-19 · 2 min · 254 words

Multi-Scale Physiologically-Motivated Alignment for Auditory Attention Decoding

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Multi-Task Learning For Speech Quality Assessment Using ASR-Derived Entropy Features

2026-04-29 · 更新于 2026-05-19 · 3 min · 488 words

Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Multi-View Hierarchical Hypergraph Neural Network for Automatic Stuttering Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 392 words

Multilingual Supervised Pretraining with Lm-Assisted Decoding for Visual Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 290 words

Multimodal Co-Training with Subtractive Unlabeled-Benefit Bounds

2026-04-29 · 更新于 2026-05-19 · 1 min · 159 words

Multimodal Fusion-Based IPCLIP Network for Mixed Reality Surgical Assistance

2026-04-29 · 更新于 2026-05-19 · 2 min · 250 words

Multimodal LLMs as Expert Speech Annotators: Acoustic Macro-Descriptors for Parkinson’s Detection

2026-04-29 · 更新于 2026-05-19 · 1 min · 208 words

Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 295 words

Multimodal Transformer with Multiperspective Training for Predicting Self-Expression Skills from Video Interview

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

Multimodal Variational Graph Network for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 410 words

MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Musicdetr: A Position-Aware Spectral Note Detection Model for Singing Transcription

2026-04-29 · 更新于 2026-05-19 · 2 min · 315 words

MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 403 words

Natural Language to Spatial Audio Parameters: Lightweight Deterministic Rendering for Creative Authoring

2026-04-29 · 更新于 2026-05-19 · 2 min · 422 words

NCF-TTS: Enhancing Flow Matching Based Text-To-Speech with Neighborhood Consistency Flow

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

2026-04-29 · 更新于 2026-05-19 · 4 min · 852 words

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

Neuromamba: Adaptive Frequency Filtering with a Pyramid Mamba for sEEG-driven Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 327 words

NeuroSIFT: A Biologically-Inspired Framework with Explicit Signal-Noise Separation for Robust Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 277 words

nGPT as a Scalable Architecture for Speech Recognition and Translation

2026-04-29 · 更新于 2026-05-19 · 2 min · 328 words

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Noise-Robust AV-ASR Using Visual Features both in the Whisper Encoder and Decoder

2026-04-29 · 更新于 2026-05-19 · 3 min · 435 words

Noise-Robust Contrastive Learning with an MFCC-Conformer for Coronary Artery Disease Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 290 words

Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

Non-Line-of-Sight Vehicle Detection via Audio-Visual Fusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Obstructive Sleep Apnea Endotype Prediction During Wakefulness Using Voice Biomarkers

2026-04-29 · 更新于 2026-05-19 · 1 min · 171 words

Off-The-Grid Multi-Pitch Estimation Using Optimal Transport

2026-04-29 · 更新于 2026-05-19 · 2 min · 224 words

OMNI-AVSR: Towards Unified Multimodal Speech Recognition With Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 395 words

On deepfake voice detection - It’s all in the presentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

On The Design of Efficient Neural Methods for Geometry-Agnostic Multichannel Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

On the Design of Higher-Order Time-Intensity Microphone Arrays for Panoramic Audio Recording and Reproduction

2026-04-29 · 更新于 2026-05-19 · 2 min · 369 words

One Model–Three Tasks: Discovering a Shared Winning Ticket for Low-Complexity Audio Intelligence

2026-04-29 · 更新于 2026-05-19 · 2 min · 258 words

Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context

2026-04-29 · 更新于 2026-05-19 · 2 min · 369 words

Optimizing Domain-Adaptive Self-Supervised Learning for Clinical Voice-Based Disease Classification

2026-04-29 · 更新于 2026-05-19 · 3 min · 470 words

Optimizing Speech Language Models for Acoustic Consistency

2026-04-29 · 更新于 2026-05-19 · 2 min · 335 words

OV-INSTRUCTTTS: Towards Open-Vocabulary Instruct Text-to-Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

PAC: Pronunciation-Aware Contextualized Large Language Model-Based Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 384 words

PADAM: Perceptual Audio Defect Assessment Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 369 words

ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-Based Neural Speech Codec

2026-04-29 · 更新于 2026-05-19 · 2 min · 415 words

Parametric Neural Amp Modeling with Active Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 214 words

PC-MCL: Patient-Consistent Multi-Cycle Learning with Multi-Label Bias Correction for Respiratory Sound Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

Peeking Into the Future for Contextual Biasing

2026-04-29 · 更新于 2026-05-19 · 2 min · 327 words

Perceptual Loss Optimized HRTF Personalization in Spherical Harmonic Domain

2026-04-29 · 更新于 2026-05-19 · 2 min · 330 words

Perceptual Quality Assessment for Stylized Talking Heads

2026-04-29 · 更新于 2026-05-19 · 2 min · 303 words

PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

2026-04-29 · 更新于 2026-05-19 · 1 min · 104 words

Personal Sound Zones with Flexible Bright Zone Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 295 words

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 401 words

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 411 words

PG-SE: Predictive Acceleration and Correction for Generative Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 407 words

Phase-Retrieval-Based Physics-Informed Neural Networks For Acoustic Magnitude Field Reconstruction

2026-04-29 · 更新于 2026-05-19 · 2 min · 251 words

Phase-Space Signal Processing of Acoustic Data for Advanced Manufacturing In-Situ Monitoring

2026-04-29 · 更新于 2026-05-19 · 1 min · 157 words

PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

2026-04-29 · 更新于 2026-05-19 · 2 min · 343 words

Phonological Tokenizer: Prosody-Aware Phonetic Token Via Multi-Objective Fine-Tuning with Differentiable K-Means

2026-04-29 · 更新于 2026-05-19 · 3 min · 510 words

Phrased: Phrase Dictionary Biasing for Speech Translation

2026-04-29 · 更新于 2026-05-19 · 2 min · 266 words

Physics-Informed Neural Networks for Ocean Acoustic Field Reconstruction and Source Localization

2026-04-29 · 更新于 2026-05-19 · 2 min · 235 words

Pianoroll-Event: A Novel Score Representation for Symbolic Music

2026-04-29 · 更新于 2026-05-19 · 2 min · 340 words

PICOAUDIO2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 360 words

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Polynomial Mixing for Efficient Self-Supervised Speech Encoders

2026-04-29 · 更新于 2026-05-19 · 2 min · 379 words

Position-Invariant Fine-Tuning Of Speech Enhancement Models With Self-Supervised Speech Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

2026-04-29 · 更新于 2026-05-19 · 2 min · 400 words

Principled Coarse-Grained Acceptance For Speculative Decoding In Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 279 words

PRoADS: Provably Secure And Robust Audio Diffusion Steganography With Latent Optimization And Backward Euler Inversion

2026-04-29 · 更新于 2026-05-19 · 2 min · 239 words

Probing the Hidden Talent of ASR foundation models for L2 English Oral Assessment

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

Probing Whisper for Dysarthric Speech in Detection and Assessment

2026-04-29 · 更新于 2026-05-19 · 1 min · 174 words

Production-Scale Dynamic Vocabulary ASR Biasing with Word-Level FST and Robust Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 248 words

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

2026-04-29 · 更新于 2026-05-19 · 1 min · 186 words

Prompt-Guided Mixture-of-Experts for Robust Multimodal Sentiment Analysis with Missing Modalities

2026-04-29 · 更新于 2026-05-19 · 3 min · 597 words

PromptSep: Generative Audio Separation Via Multimodal Prompting

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

2026-04-29 · 更新于 2026-05-19 · 2 min · 247 words

PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

2026-04-29 · 更新于 2026-05-19 · 2 min · 305 words

Prototype-Guided Cross-Modal Contrastive Learning for Continual Audio-Visual Sound Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 292 words

PRSA: Preventing Malicious Speaker Recognition and Speech Synthesis Simultaneously with Adversarial Examples

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

PSTalker: Realistic 3D Talking Head Synthesis via a Semantic-Aware Audio-Driven Point-Based Shape

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 362 words

Qastanet: A DNN-Based Quality Metric for Spatial Audio

2026-04-29 · 更新于 2026-05-19 · 2 min · 282 words

QE-XVC: Zero-Shot Cross-Lingual Voice Conversion via Query-Enhancement and Conditional Flow Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 320 words

QFOCUS: Controllable Synthesis for Automated Speech Stress Editing to Deliver Human-Like Emphatic Intent

2026-04-29 · 更新于 2026-05-19 · 1 min · 160 words

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for Voicemos 2024

2026-04-29 · 更新于 2026-05-19 · 2 min · 386 words

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 281 words

Random Matrix-Driven Graph Representation Learning For Bioacoustic Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 272 words

Ranking The Impact of Contextual Specialization in Neural Speech Enhancement

2026-04-29 · 更新于 2026-05-19 · 3 min · 489 words

RAP: Real-Time Audio-Driven Portrait Animation with Video Diffusion Transformer

2026-04-29 · 更新于 2026-05-19 · 3 min · 454 words

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 226 words

RASD-SR: A Robust Anomalous Sound Detection Framework with Score Recalibration

2026-04-29 · 更新于 2026-05-19 · 2 min · 293 words

Rationale-Guided Learning for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 402 words

RCAL: Reinforced Cross-Modal Alignment for Multimodal Sentiment Analysis with Sparse Visual Frames

2026-04-29 · 更新于 2026-05-19 · 2 min · 409 words

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

2026-04-29 · 更新于 2026-05-19 · 3 min · 431 words

Real-Time Streaming MEL Vocoding with Generative Flow Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

Reasoning Driven Captions to Assist Noise Robust Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

2026-04-29 · 更新于 2026-05-19 · 2 min · 362 words

Reconstruction of Spherical Sound Source Radiation Characteristics with Graph Signal Processing

2026-04-29 · 更新于 2026-05-19 · 2 min · 244 words

Recovering Performance in Speech Emotion Recognition from Discrete Tokens Via Multi-Layer Fusion and Paralinguistic Feature Integration

2026-04-29 · 更新于 2026-05-19 · 2 min · 416 words

Reducing Prompt Sensitivity in LLM-Based Speech Recognition Through Learnable Projection

2026-04-29 · 更新于 2026-05-19 · 2 min · 310 words

Reference Microphone Selection for Guided Source Separation Based on The Normalized L-P Norm

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

Reference-Aware SFM Layers for Intrusive Intelligibility Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

Refgen: Reference-Guided Synthetic Data Generation for Anomalous Sound Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 264 words

Regularized Inverse Filter Design for Rigid Spherical Microphone Array Processing: Laplace- And Time-Domain Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 231 words

Relative Time Intervals Representation For Word-Level Timestamping With Masked Training

2026-04-29 · 更新于 2026-05-19 · 3 min · 482 words

Reliable AI via Age-Balanced Validation: Fair Model Selection for Parkinson’s Detection from Voice

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

Representation-Based Data Quality Audits for Audio

2026-04-29 · 更新于 2026-05-19 · 3 min · 433 words

Representation-Diverse Self-Supervision for Cross-Domain Bioacoustic Learning in Low-Resource Settings

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

2026-04-29 · 更新于 2026-05-19 · 2 min · 425 words

Respire-Mamba C-UNet: Consistency-Trained Autoencoder for High-Fidelity Respiratory Sound Compression

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

Rethinking Entity Disambiguation in Complex Modalities

2026-04-29 · 更新于 2026-05-19 · 3 min · 471 words

Rethinking Music Captioning with Music Metadata LLMS

2026-04-29 · 更新于 2026-05-19 · 3 min · 470 words

Retrieval-Based Speculative Decoding For Autoregressive Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 1 min · 203 words

Revisiting Direct Speech-to-Text Translation with Speech LLMS: Better Scaling than Cot Prompting?

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

RFM-Editing: Rectified Flow Matching for Text-Guided Audio Editing

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

RHO-PERFECT: Correlation Ceiling for Subjective Evaluation Datasets

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses

2026-04-29 · 更新于 2026-05-19 · 2 min · 272 words

RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

RMODGDF: A Robust STFT-Derived Feature for Musical Instrument Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 412 words

Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings

2026-04-29 · 更新于 2026-05-19 · 1 min · 159 words

Robust and Lightweight F0 Estimation Through Mid-Level Fusion of DSP-Informed Features

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Robust Deepfake Audio Detection via Multi-Level Intermediate Feature Fusion

2026-04-29 · 更新于 2026-05-19 · 2 min · 295 words

Robust Online Overdetermined Independent Vector Analysis Based on Bilinear Decomposition

2026-04-29 · 更新于 2026-05-19 · 1 min · 203 words

RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack

2026-04-29 · 更新于 2026-05-19 · 3 min · 522 words

RRPO: Robust Reward Policy Optimization for LLM-Based Emotional TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 244 words

S-PRESSO: Ultra Low Bitrate Sound Effect Compression with Diffusion Autoencoders and Offline Quantization

2026-04-29 · 更新于 2026-05-19 · 2 min · 410 words

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 483 words

S2Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

2026-04-29 · 更新于 2026-05-19 · 3 min · 492 words

SA-SSL-MOS: Self-Supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

2026-04-29 · 更新于 2026-05-19 · 3 min · 526 words

SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 354 words

SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution

2026-04-29 · 更新于 2026-05-19 · 2 min · 339 words

Salad-VAE: Semantic Audio Compression with Language-Audio Distillation

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems with Deep Kernel Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 320 words

SAUNA: Song-Level Audio & User-Listening Data Neural Alignment

2026-04-29 · 更新于 2026-05-19 · 2 min · 216 words

Savgbench: Benchmarking Spatially Aligned Audio-Video Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 216 words

Scalable Evaluation for Audio Identification Via Synthetic Latent Fingerprint Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 314 words

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

2026-04-29 · 更新于 2026-05-19 · 2 min · 257 words

Scaling Spoken Language Models with Syllabic Speech Tokenization

2026-04-29 · 更新于 2026-05-19 · 2 min · 272 words

SceneRAG: Scene-Level Retrieval-Augmented Generation for Video Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

2026-04-29 · 更新于 2026-05-19 · 2 min · 369 words

Secondary Source Placement for Sound Field Control Based on Ising Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 218 words

SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR

2026-04-29 · 更新于 2026-05-19 · 2 min · 377 words

Segmentwise Pruning in Audio-Language Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 488 words

SELD-MOHA: A Fine-Tuning Method with the Mixture of Heterogeneous Adapters for Sound Event Localization and Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 400 words

Selective Hub Fusion with Modality-Heterogeneous Experts for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 460 words

Self-Supervised Note Tracking and Multi-Pitch Estimation Via Reconstruction-Based Learning

2026-04-29 · 更新于 2026-05-19 · 3 min · 628 words

Semantic Anchor Transfer from Short to Long Speech in a Distillation-Based Summarization Framework

2026-04-29 · 更新于 2026-05-19 · 2 min · 418 words

Semantic-Guided Pseudo-Feature Attention Network for Audio-Visual Zero-Shot Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 402 words

SEP-ST: Incorporating Speech Entity Prompt Into Large Language Models for Speech Translation

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

Separate this, and all of these Things Around It: Music Source Separation Via Hyperellipsoidal Queries

2026-04-29 · 更新于 2026-05-19 · 2 min · 339 words

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

2026-04-29 · 更新于 2026-05-19 · 2 min · 222 words

Sequential and Simultaneous Optimization of Microphone Array Geometry and Region-of-Interest Beamforming

2026-04-29 · 更新于 2026-05-19 · 1 min · 209 words

Session-Level Spoken Language Assessment with A Multimodal Foundation Model Via Multi-Target Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 296 words

SFM-TTS: Lightweight and Rapid Speech Synthesis with Flexible Shortcut Flow Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 409 words

Shared Representation Learning for Reference-Guided Targeted Sound Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

Shortcut Flow Matching for Speech Enhancement: Step-Invariant Flows via Single Stage Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 363 words

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-Scale Dataset Cleansing

2026-04-29 · 更新于 2026-05-19 · 2 min · 302 words

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

Sing What You Fit: A Perception-Based Dataset and Benchmark for Vocal-Song Suitability Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 226 words

Sing2Song: An Accompaniment Generation System Based on Solo Singing

2026-04-29 · 更新于 2026-05-19 · 2 min · 393 words

Single-Microphone Audio Point Source Discriminative Localization from Reverberation Late Tail Estimation

2026-04-29 · 更新于 2026-05-19 · 2 min · 259 words

Single-Step Controllable Music Bandwidth extension with Flow Matching

2026-04-29 · 更新于 2026-05-19 · 3 min · 433 words

SingMOS-Pro: An Comprehensive Benchmark For Singing Quality Assessment

2026-04-29 · 更新于 2026-05-19 · 2 min · 246 words

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

2026-04-29 · 更新于 2026-05-19 · 3 min · 489 words

SIRUP: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics

2026-04-29 · 更新于 2026-05-19 · 2 min · 342 words

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

2026-04-29 · 更新于 2026-05-19 · 2 min · 315 words

SLM-SS: Speech Language Model for Generative Speech Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 325 words

SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Slot Filling as a Reasoning Task for Speechllms

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

SmoothCLAP: Soft-Target Enhanced Contrastive Language-Audio Pretraining for Affective Computing

2026-04-29 · 更新于 2026-05-19 · 2 min · 353 words

Snore Sound Classification Based on Physiological Features and Adaptive Loss Function

2026-04-29 · 更新于 2026-05-19 · 2 min · 324 words

Solving the Helmholtz Equation Via Physics-Informed Neural Networks with an Adaptive Weighting Strategy

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

SONAR: Self-Distilled Continual Pre-Training for Domain Adaptive Audio Representation

2026-04-29 · 更新于 2026-05-19 · 2 min · 276 words

SoundCompass: Navigating Target Sound Extraction with Effective Directional Clue Integration in Complex Acoustic Scenes

2026-04-29 · 更新于 2026-05-19 · 2 min · 247 words

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

2026-04-29 · 更新于 2026-05-19 · 3 min · 496 words

Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

2026-04-29 · 更新于 2026-05-19 · 2 min · 288 words

Source Separation For A Cappella Music

2026-04-29 · 更新于 2026-05-19 · 2 min · 310 words

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

2026-04-29 · 更新于 2026-05-19 · 3 min · 470 words

SPAM: Style Prompt Adherence Metric for Prompt-Based TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

Sparse Autoencoders Make Audio Foundation Models More Explainable

2026-04-29 · 更新于 2026-05-19 · 2 min · 364 words

Sparse-View Visual-Acoustic Latent Learning for Novel-View Audio Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 424 words

Spatial Covariance Matrix Reconstruction for Speech Enhancement in Reverberant Multi-Source Environments

2026-04-29 · 更新于 2026-05-19 · 2 min · 401 words

Spatial-CLAP: Learning Spatially-Aware Audio–Text Embeddings for Multi-Source Conditions

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Spatially Aware Self-Supervised Models for Multi-Channel Neural Speaker Diarization

2026-04-29 · 更新于 2026-05-19 · 2 min · 288 words

SpatialNet-Echo: Real-Time Acoustic Echo Cancellation via Integrated Narrow-Band and Cross-Band Processing

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Speaker Anonymisation for Speech-Based Suicide Risk Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 259 words

Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

2026-04-29 · 更新于 2026-05-19 · 2 min · 397 words

Spectral or Spatial? Leveraging Both for Speaker Extraction in Challenging Data Conditions

2026-04-29 · 更新于 2026-05-19 · 2 min · 261 words

Spectrogram Event Based Feature Representation for Generalizable Automatic Music Transcription

2026-04-29 · 更新于 2026-05-19 · 3 min · 430 words

Speech Emotion Recognition based on Hierarchical Transformer with Shifted Windows

2026-04-29 · 更新于 2026-05-19 · 2 min · 286 words

Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts

2026-04-29 · 更新于 2026-05-19 · 2 min · 359 words

SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

SpeechMapper: Speech-To-Text Embedding Projector for LLMs

2026-04-29 · 更新于 2026-05-19 · 3 min · 482 words

Spike-Driven Low-Power Speech Bandwidth Extension

2026-04-29 · 更新于 2026-05-19 · 2 min · 398 words

Spiking Attention Network: A Hybrid Neuromorphic Approach to Underwater Acoustic Localization and Zero-Shot Adaptation

2026-04-29 · 更新于 2026-05-19 · 2 min · 308 words

Spiking Temporal-Enhanced Network for Zero-Shot Audio-Visual Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 332 words

Spring Reverb Emulation with Hybrid Gated Convolutional Networks and State Space Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 442 words

SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 396 words

ST-HNTM: Joint Speech-Text Neural Topic Modeling on the Hypersphere

2026-04-29 · 更新于 2026-05-19 · 3 min · 539 words

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

2026-04-29 · 更新于 2026-05-19 · 2 min · 356 words

Staged Diffusion with Hybrid Mixture-of-Experts (MOE) for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 313 words

Stemphonic: All-At-Once Flexible Multi-Stem Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 423 words

Step-Audio-R1.5 Technical Report

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

StereoFoley: Object-Aware Stereo Audio Generation from Video

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

Stereophonic Acoustic Echo Cancellation Using an Improved Affine Projection Algorithm with Adaptive Multiple Sub-Filters

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Still Thinking or Stopped Talking? Dialogue Silence Intention Classification Using Multimodal Large Language Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 318 words

Str-DiffSep: Streamable Diffusion Model for Speech Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 343 words

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization Via Neural Audio Codec and Language Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 456 words

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

2026-04-29 · 更新于 2026-05-19 · 2 min · 362 words

Streamingbench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 262 words

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

Stress Prediction from Temporal Emotion Trajectories in Clinical Patient-Physician Conversations

2026-04-29 · 更新于 2026-05-19 · 3 min · 430 words

Structure-Aware Diffusion Schrödinger Bridge

2026-04-29 · 更新于 2026-05-19 · 1 min · 209 words

StyHarmo: Efficient Style-Specific Video Generation with Music Synchronization

2026-04-29 · 更新于 2026-05-19 · 2 min · 266 words

Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

2026-04-29 · 更新于 2026-05-19 · 3 min · 512 words

Style-Disentangled Diffusion for Controllable and Identity-Generalized Speech-Driven Body Motion Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 245 words

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

2026-04-29 · 更新于 2026-05-19 · 3 min · 463 words

StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

Subgraph Localization in the Subbands for Partially Spoofed Speech Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Subsequence SDTW: Differentiable Alignment with Flexible Boundary Conditions

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Subspace Hybrid Adaptive Filtering for Phonocardiogram Signal Denoising

2026-04-29 · 更新于 2026-05-19 · 2 min · 297 words

Sunac: Source-Aware Unified Neural Audio Codec

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

SURE: Synergistic Uncertainty-Aware Reasoning for Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-05-19 · 2 min · 285 words

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

Symphony Rendering: Midi and Composer-Conditioned Auto Orchestration with Flow-Matching Transformers

2026-04-29 · 更新于 2026-05-19 · 3 min · 482 words

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

2026-04-29 · 更新于 2026-05-19 · 2 min · 355 words

SynaSpot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

2026-04-29 · 更新于 2026-05-19 · 2 min · 330 words

Synchronous Secondary Path Modeling and Kronecker-Factorized Adaptive Algorithm for Multichannel Active Noise Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

2026-04-29 · 更新于 2026-05-19 · 3 min · 456 words

Synthcloner: Synthesizer-Style Audio Transfer via Factorized Codec with ADSR Envelope Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 324 words

Synthesized Data Selection via Score Distribution Matching for Te Reo Māori Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 262 words

Synthetic Data Domain Adaptation for ASR via LLM-Based Text and Phonetic Respelling Augmentation

2026-04-29 · 更新于 2026-05-19 · 3 min · 473 words

Synthetic yet Striking? Assessing Vocal Charisma in TTS via Perceptual and Algorithmic Measures

2026-04-29 · 更新于 2026-05-19 · 2 min · 227 words

T-Cache: Fast Inference For Masked Generative Transformer-Based TTS Via Prompt-Aware Feature Caching

2026-04-29 · 更新于 2026-05-19 · 2 min · 357 words

T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS

2026-04-29 · 更新于 2026-05-19 · 2 min · 292 words

TAG: Structured Temporal Audio Generation via LLM-Guided Manual Scription and Control

2026-04-29 · 更新于 2026-05-19 · 2 min · 343 words

TAGARELA - A Portuguese Speech Dataset from Podcasts

2026-04-29 · 更新于 2026-05-19 · 2 min · 284 words

Taming Audio VAEs via Target-KL Regularization

2026-04-29 · 更新于 2026-05-19 · 2 min · 352 words

Target Speaker Anonymization in Multi-Speaker Recordings

2026-04-29 · 更新于 2026-05-19 · 2 min · 280 words

Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder

2026-04-29 · 更新于 2026-05-19 · 2 min · 344 words

Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

Task-Oriented Sound Privacy Preservation for Sound Event Detection Via End-to-End Adversarial Multi-Task Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 387 words

TASU: Text-only Alignment for Speech Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 366 words

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

2026-04-29 · 更新于 2026-05-19 · 2 min · 335 words

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

2026-04-29 · 更新于 2026-05-19 · 3 min · 504 words

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-Wise Distillation

2026-04-29 · 更新于 2026-05-19 · 2 min · 278 words

Teaching the Teachers: Boosting Unsupervised Domain Adaptation In Speech Recognition By Ensemble Update

2026-04-29 · 更新于 2026-05-19 · 2 min · 400 words

Temporal Distillation for Music Representation Learning

2026-04-29 · 更新于 2026-05-19 · 3 min · 433 words

Temporal Graph Modeling for Speech Emotion Recognition Using LSTM-Aggregated Multigraph Networks

2026-04-29 · 更新于 2026-05-19 · 2 min · 229 words

Temporal-Spatial Decouple Before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-05-19 · 4 min · 737 words

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic Event Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 278 words

Test Time Adaptation for Speech Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 241 words

Test-Time Scaling for Auditory Cognition in Audio Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 292 words

Testing The Efficient Coding Hypothesis Beyond Humans: The Auditory Kernels of Bat Vocalizations

2026-04-29 · 更新于 2026-05-19 · 2 min · 236 words

Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

2026-04-29 · 更新于 2026-05-19 · 2 min · 324 words

Text2Move: Text-To-Moving Sound Generation via Trajectory Prediction and Temporal Alignment

2026-04-29 · 更新于 2026-05-19 · 2 min · 243 words

TextlessRAG: End-to-End Visual Document RAG by Speech without Text

2026-04-29 · 更新于 2026-05-19 · 2 min · 375 words

The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction

2026-04-29 · 更新于 2026-05-19 · 1 min · 190 words

The Curious Case of Visual Grounding: Different Effects for Speech-and Text-Based Language Encoders

2026-04-29 · 更新于 2026-05-19 · 2 min · 277 words

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

2026-04-29 · 更新于 2026-05-19 · 2 min · 390 words

The Muse Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMs

2026-04-29 · 更新于 2026-05-19 · 2 min · 307 words

The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

2026-04-29 · 更新于 2026-05-19 · 2 min · 255 words

The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion

2026-04-29 · 更新于 2026-05-19 · 1 min · 208 words

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 276 words

The Synergistic Role of Audio and Large Video-Language Model in Source-Free Video Domain Adaptation

2026-04-29 · 更新于 2026-05-19 · 2 min · 360 words

Theory and Application of Circular Relative Harmonic Coefficients

2026-04-29 · 更新于 2026-05-19 · 2 min · 334 words

Thinking While Listening: Simple Test Time Scaling for Audio Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 252 words

Three Seconds is Sufficient: A Multi-Pronged Framework for Model-Based Speaker Adaptation in ASR Under Data-Scarce Conditions

2026-04-29 · 更新于 2026-05-19 · 3 min · 493 words

TICL: Text-Embedding KNN for Speech in-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

Timbre-Aware Audio Difference Captioning for Anomalous Machine Sounds without Paired Training Data via Synthetic Perturbations

2026-04-29 · 更新于 2026-05-19 · 2 min · 352 words

Timbre-Based Pretraining with Pseudo-Labels for Multi-Instrument Automatic Music Transcription

2026-04-29 · 更新于 2026-05-19 · 3 min · 628 words

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

Time-Domain Synthesis of Virtual Sound Source Within Personalized Sound Zone using a Linear Loudspeaker Array

2026-04-29 · 更新于 2026-05-19 · 2 min · 221 words

Time-Shifted Token Scheduling for Symbolic Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 214 words

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 350 words

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for Ü-Tsang, Amdo and Kham Speech Dataset Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 323 words

Tokenchain: A Discrete Speech Chain via Semantic Token Modeling

2026-04-29 · 更新于 2026-05-19 · 3 min · 529 words

Toward Faithful Explanations in Acoustic Anomaly Detection

2026-04-29 · 更新于 2026-05-19 · 1 min · 207 words

Toward Robust And Efficient Beat Tracking Via Beat-Aware Attention

2026-04-29 · 更新于 2026-05-19 · 2 min · 384 words

Towards Blind Data Cleaning: A Case Study in Music Source Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 305 words

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

2026-04-29 · 更新于 2026-05-19 · 2 min · 384 words

Towards Data Drift Monitoring for Speech Deepfake Detection in the Context of MLOps

2026-04-29 · 更新于 2026-05-19 · 2 min · 248 words

Towards Distance-Aware Synthetic Audio Mixtures for Universal Sound Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 272 words

Towards Effective Negation Modeling in Joint Audio-Text Models for Music

2026-04-29 · 更新于 2026-05-19 · 2 min · 248 words

Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances

2026-04-29 · 更新于 2026-05-19 · 2 min · 339 words

Towards Fair ASR for Second Language Speakers using Fairness Prompted Finetuning

2026-04-29 · 更新于 2026-05-19 · 2 min · 273 words

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

2026-04-29 · 更新于 2026-05-19 · 3 min · 442 words

Towards Multi-View Hierarchical Video-to-Piano Generation with MIDI Guidance

2026-04-29 · 更新于 2026-05-19 · 2 min · 346 words

Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

2026-04-29 · 更新于 2026-05-19 · 2 min · 399 words

Towards Real-Time Generative Speech Restoration with Flow-Matching

2026-04-29 · 更新于 2026-05-19 · 2 min · 280 words

Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

2026-04-29 · 更新于 2026-05-19 · 2 min · 343 words

Tpeformer: Temporal Patch Embedding Transformer

2026-04-29 · 更新于 2026-05-19 · 2 min · 290 words

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

2026-04-29 · 更新于 2026-05-19 · 3 min · 454 words

Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction

2026-04-29 · 更新于 2026-05-19 · 2 min · 294 words

Training Flow Matching Models with Reliable Labels via Self-Purification

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Training-Free Inference-Time Scaling for Audio Source Separation

2026-04-29 · 更新于 2026-05-19 · 2 min · 281 words

Training-Free Multimodal Guidance for Video to Audio Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 321 words

Transfer Learning for Paediatric Sleep Apnoea Detection using Physiology-Guided Acoustic Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 285 words

Transferable Audio Lottery Tickets: Gradient Accumulation for Extreme Sparsity

2026-04-29 · 更新于 2026-05-19 · 2 min · 265 words

Tri-Attention Fusion: Joint Temporal-Spectral and Bidirectional Modeling for Speech Spoofing Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 336 words

Triad: Tri-Head with Auxiliary Duplicating Permutation Invariant Training for Multi-Task Sound Event Localization and Detection

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

Triage Knowledge Distillation for Speaker Verification

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

TTA: Transcribe, Translate and Alignment for Cross-Lingual Speech Representation

2026-04-29 · 更新于 2026-05-19 · 2 min · 389 words

TVP-UNet: Threshold Variance Penalty U-Net for Voice Activity Detection in Dysarthric Speech

2026-04-29 · 更新于 2026-05-19 · 2 min · 263 words

Two-Stage Language Model Framework for Acoustic Echo Cancellation

2026-04-29 · 更新于 2026-05-19 · 2 min · 359 words

UJCodec: An End-to-end Unet-Style Codec for Joint Speech Compression and Enhancement

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

UMA-SPLIT: Unimodal Aggregation for Both English and Mandarin Non-Autoregressive Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 463 words

UMV: A Mixture-Of-Experts Vision Transformer with Multi-Spectrogram Fusion for Underwater Ship Noise Classification

2026-04-29 · 更新于 2026-05-19 · 2 min · 253 words

Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Understanding Textual Capability Degradation in Speech LLMS via Parameter Importance Analysis

2026-04-29 · 更新于 2026-05-19 · 2 min · 365 words

Understanding the Strengths and Weaknesses of SSL Models for Audio Deepfake Model Attribution

2026-04-29 · 更新于 2026-05-19 · 2 min · 304 words

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 348 words

Universr: Unified and Versatile Audio Super-Resolution Via Vocoder-Free Flow Matching

2026-04-29 · 更新于 2026-05-19 · 3 min · 445 words

UNMIXX: Untangling Highly Correlated Singing Voices Mixtures

2026-04-29 · 更新于 2026-05-19 · 2 min · 373 words

Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

2026-04-29 · 更新于 2026-05-19 · 2 min · 225 words

Unseen but Not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 279 words

Unsupervised Discovery and Analysis of the Vocal Repertoires and Patterns of Select Corvid Species

2026-04-29 · 更新于 2026-05-19 · 2 min · 316 words

Unsupervised Lexicon Learning from Speech is Limited by Representations Rather than Clustering

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

USVexplorer: Robust Detection of Ultrasonic Vocalizations with Cross Species Generalization

2026-04-29 · 更新于 2026-05-19 · 2 min · 268 words

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 383 words

Utilizing Information Theoretic Approach to Study Cochlear Neural Degeneration

2026-04-29 · 更新于 2026-05-19 · 2 min · 241 words

UVT-LM: Unifying Visual and Tactile Perception with Language Model

2026-04-29 · 更新于 2026-05-19 · 2 min · 411 words

V2A-DPO: Omni-Preference Optimization for Video-To-Audio Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 368 words

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 575 words

VBx for End-to-End Neural and Clustering-Based Diarization

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication

2026-04-29 · 更新于 2026-05-19 · 3 min · 460 words

Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation

2026-04-29 · 更新于 2026-05-19 · 2 min · 282 words

Vib2Sound: Separation Of Multimodal Sound Sources

2026-04-29 · 更新于 2026-05-19 · 2 min · 361 words

Vioptt: Violin Technique-Aware Transcription from Synthetic Data Augmentation

2026-04-29 · 更新于 2026-05-19 · 2 min · 395 words

Virtual Consistency for Audio Editing

2026-04-29 · 更新于 2026-05-19 · 3 min · 453 words

Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation

2026-04-29 · 更新于 2026-05-19 · 2 min · 238 words

ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 223 words

VividTalker: A Modular Framework for Expressive 3D Talking Avatars with Controllable Gaze and Blink

2026-04-29 · 更新于 2026-05-19 · 2 min · 408 words

VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays

2026-04-29 · 更新于 2026-05-19 · 3 min · 603 words

VMSP: Video-to-Music Generation with Two-Stage Alignment and Synthesis

2026-04-29 · 更新于 2026-05-19 · 2 min · 260 words

Vocalnet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

2026-04-29 · 更新于 2026-05-19 · 2 min · 319 words

Voting-Based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection

2026-04-29 · 更新于 2026-05-19 · 3 min · 449 words

VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings

2026-04-29 · 更新于 2026-05-19 · 2 min · 399 words

VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency

2026-04-29 · 更新于 2026-05-19 · 3 min · 482 words

VT-Heads: Voice Cloning and Talking Head Generation from Text Based on V-DiT

2026-04-29 · 更新于 2026-05-19 · 2 min · 341 words

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 250 words

WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error

2026-04-29 · 更新于 2026-05-19 · 1 min · 199 words

Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features

2026-04-29 · 更新于 2026-05-19 · 2 min · 338 words

Wavenext 2: Convnext-Based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for Gan And Diffusion Models

2026-04-29 · 更新于 2026-05-19 · 3 min · 553 words

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

2026-04-29 · 更新于 2026-05-19 · 3 min · 612 words

WaveSpikeNet: A Wavelet-Spiking Fusion Architecture for Audio Classification on Edge Devices

2026-04-29 · 更新于 2026-05-19 · 3 min · 498 words

WavLink: Compact Audio–Text Embeddings with a Global Whisper Token

2026-04-29 · 更新于 2026-05-19 · 2 min · 333 words

What the student learns in knowledge distillation: A subspace view and evidence on Convolutional Recurrent Network

2026-04-29 · 更新于 2026-05-19 · 2 min · 298 words

When Audio Matters: A Lightweight, Hierarchical Fusion Model for Speech and Non-Verbal Emotion Recognition

2026-04-29 · 更新于 2026-05-19 · 2 min · 380 words

When Children Talk and Machines Listen: Toward an Interpretable Speech-Based Screener for Dutch Developmental Language Disorder

2026-04-29 · 更新于 2026-05-19 · 2 min · 374 words

When Noise Lowers the Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 306 words

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 311 words

When Voice Matters: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

2026-04-29 · 更新于 2026-05-19 · 2 min · 381 words

Whisper-FEST: Single-Channel Far-Field Enhanced Speech-to-text without Parallel Data

2026-04-29 · 更新于 2026-05-19 · 2 min · 425 words

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models Based on MHA2MLA Conversion

2026-04-29 · 更新于 2026-05-19 · 2 min · 312 words

Whisper-QF: Leveraging Dual Cross-Attention Q-Former for Speech Emotion Recognition With Multi-Task Learning

2026-04-29 · 更新于 2026-05-19 · 2 min · 329 words

Whisper: Courtside Edition - Enhancing ASR Performance through LLM-Driven Context Generation

2026-04-29 · 更新于 2026-05-19 · 1 min · 195 words

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 1 min · 178 words

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

2026-04-29 · 更新于 2026-05-19 · 2 min · 258 words

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition

2026-04-29 · 更新于 2026-05-19 · 3 min · 434 words

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

2026-04-29 · 更新于 2026-05-19 · 2 min · 248 words

ZK-VSA: Zero-Knowledge Verifiable Speaker Anonymization Leveraging Phase Vocoder with Time-Scale Modification

2026-04-29 · 更新于 2026-05-19 · 2 min · 340 words

ZSV2C-MLLM: Zero-Shot Visual Voice Cloning Via Multimodal Large Language Models

2026-04-29 · 更新于 2026-05-19 · 2 min · 334 words

β-AVSDNET: A Novel End-To-End Neural Network Architecture For Audio-Visual Speaker Diarization

2026-04-29 · 更新于 2026-05-19 · 3 min · 487 words

语音/音频论文速递 2026-04-29

2026-04-29 · 更新于 2026-05-19 · 19 min · 3856 words

A Functorial Formulation of Neighborhood Aggregating Deep Learning

2026-04-28 · 更新于 2026-05-19 · 1 min · 148 words

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

2026-04-28 · 更新于 2026-05-19 · 2 min · 368 words

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

2026-04-28 · 更新于 2026-05-19 · 2 min · 276 words

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

2026-04-28 · 更新于 2026-05-19 · 2 min · 265 words

Come Together: Analyzing Popular Songs Through Statistical Embeddings

2026-04-28 · 更新于 2026-05-19 · 2 min · 243 words

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-28 · 更新于 2026-05-19 · 1 min · 180 words

Explainable AI in Speaker Recognition – Making Latent Representations Understandable

2026-04-28 · 更新于 2026-05-19 · 2 min · 232 words

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

2026-04-28 · 更新于 2026-05-19 · 3 min · 491 words

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

2026-04-28 · 更新于 2026-05-19 · 2 min · 366 words

Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions

2026-04-28 · 更新于 2026-05-19 · 2 min · 344 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-28 · 更新于 2026-05-19 · 2 min · 368 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-28 · 更新于 2026-05-19 · 2 min · 411 words

Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

2026-04-28 · 更新于 2026-05-19 · 2 min · 362 words

Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

2026-04-28 · 更新于 2026-05-19 · 1 min · 194 words

Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

2026-04-28 · 更新于 2026-05-19 · 1 min · 206 words

Psychologically-Grounded Graph Modeling for Interpretable Depression Detection

2026-04-28 · 更新于 2026-05-19 · 3 min · 503 words

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-28 · 更新于 2026-05-19 · 2 min · 287 words

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

2026-04-28 · 更新于 2026-05-19 · 3 min · 431 words

RTCFake: Speech Deepfake Detection in Real-Time Communication

2026-04-28 · 更新于 2026-05-19 · 2 min · 337 words

Scaling Properties of Continuous Diffusion Spoken Language Models

2026-04-28 · 更新于 2026-05-19 · 2 min · 415 words

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

2026-04-28 · 更新于 2026-05-19 · 1 min · 208 words

Speech Enhancement Based on Drifting Models

2026-04-28 · 更新于 2026-05-19 · 2 min · 361 words

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

2026-04-28 · 更新于 2026-05-19 · 3 min · 612 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-28 · 更新于 2026-05-19 · 2 min · 409 words

语音/音频论文速递 2026-04-28

2026-04-28 · 更新于 2026-05-19 · 12 min · 2428 words

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

2026-04-27 · 更新于 2026-05-19 · 2 min · 343 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-27 · 更新于 2026-05-19 · 2 min · 266 words

Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues

2026-04-27 · 更新于 2026-05-19 · 1 min · 159 words

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

2026-04-27 · 更新于 2026-05-19 · 3 min · 592 words

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

2026-04-27 · 更新于 2026-05-19 · 2 min · 395 words

Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation

2026-04-27 · 更新于 2026-05-19 · 2 min · 270 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-27 · 更新于 2026-05-19 · 2 min · 318 words

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2026-04-27 · 更新于 2026-05-19 · 2 min · 260 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-27 · 更新于 2026-05-19 · 2 min · 377 words

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012

2026-04-27 · 更新于 2026-05-19 · 2 min · 236 words

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

2026-04-27 · 更新于 2026-05-19 · 2 min · 273 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-27 · 更新于 2026-05-19 · 2 min · 326 words

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2026-04-27 · 更新于 2026-05-19 · 4 min · 707 words

语音/音频论文速递 2026-04-27

2026-04-27 · 更新于 2026-05-19 · 8 min · 1673 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-25 · 更新于 2026-05-19 · 2 min · 320 words

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-25 · 更新于 2026-05-19 · 1 min · 176 words

语音/音频论文速递 2026-04-25

2026-04-25 · 更新于 2026-05-19 · 2 min · 225 words

“This Wasn’t Made for Me”: Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

2026-04-24 · 更新于 2026-05-19 · 1 min · 113 words

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-24 · 更新于 2026-05-19 · 3 min · 428 words

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

2026-04-24 · 更新于 2026-05-19 · 1 min · 132 words

Beyond Rules: Towards Basso Continuo Personal Style Identification

2026-04-24 · 更新于 2026-05-19 · 1 min · 133 words

DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline

2026-04-24 · 更新于 2026-05-19 · 2 min · 255 words

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

2026-04-24 · 更新于 2026-05-19 · 1 min · 117 words

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

2026-04-24 · 更新于 2026-05-19 · 2 min · 333 words

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

2026-04-24 · 更新于 2026-05-19 · 1 min · 153 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-24 · 更新于 2026-05-19 · 1 min · 204 words

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

2026-04-24 · 更新于 2026-05-19 · 1 min · 178 words

Low-Rank Adaptation Redux for Large Models

2026-04-24 · 更新于 2026-05-19 · 1 min · 103 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-24 · 更新于 2026-05-19 · 3 min · 439 words

Materialistic RIR: Material Conditioned Realistic RIR Generation

2026-04-24 · 更新于 2026-05-19 · 2 min · 400 words

MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding

2026-04-24 · 更新于 2026-05-19 · 2 min · 296 words

Misinformation Span Detection in Videos via Audio Transcripts

2026-04-24 · 更新于 2026-05-19 · 2 min · 285 words

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

2026-04-24 · 更新于 2026-05-19 · 1 min · 18 words

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

2026-04-24 · 更新于 2026-05-19 · 2 min · 280 words

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-24 · 更新于 2026-05-19 · 3 min · 487 words

Sema: Semantic Transport for Real-Time Multimodal Agents

2026-04-24 · 更新于 2026-05-19 · 2 min · 266 words

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

2026-04-24 · 更新于 2026-05-19 · 2 min · 402 words

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-24 · 更新于 2026-05-19 · 3 min · 483 words

语音/音频论文速递 2026-04-24

2026-04-24 · 更新于 2026-05-19 · 11 min · 2180 words

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

2026-04-23 · 更新于 2026-05-19 · 1 min · 165 words

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

2026-04-23 · 更新于 2026-05-19 · 1 min · 170 words

Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials

2026-04-23 · 更新于 2026-05-19 · 2 min · 236 words

Centering Ecological Goals in Automated Identification of Individual Animals

2026-04-23 · 更新于 2026-05-19 · 2 min · 233 words

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

2026-04-23 · 更新于 2026-05-19 · 2 min · 276 words

Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis

2026-04-23 · 更新于 2026-05-19 · 2 min · 311 words

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

2026-04-23 · 更新于 2026-05-19 · 2 min · 221 words

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

2026-04-23 · 更新于 2026-05-19 · 2 min · 293 words

Enhancing Speaker Verification with Whispered Speech via Post-Processing

2026-04-23 · 更新于 2026-05-19 · 2 min · 259 words

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-23 · 更新于 2026-05-19 · 2 min · 267 words

Explicit Dropout: Deterministic Regularization for Transformer Architectures

2026-04-23 · 更新于 2026-05-19 · 1 min · 111 words

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

2026-04-23 · 更新于 2026-05-19 · 2 min · 302 words

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-23 · 更新于 2026-05-19 · 2 min · 266 words

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

2026-04-23 · 更新于 2026-05-19 · 2 min · 386 words

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-23 · 更新于 2026-05-19 · 1 min · 201 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-23 · 更新于 2026-05-19 · 2 min · 215 words

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

2026-04-23 · 更新于 2026-05-19 · 1 min · 207 words

Qwen3.5-Omni Technical Report

2026-04-23 · 更新于 2026-05-19 · 2 min · 251 words

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-23 · 更新于 2026-05-19 · 2 min · 231 words

SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

2026-04-23 · 更新于 2026-05-19 · 1 min · 182 words

Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

2026-04-23 · 更新于 2026-05-19 · 2 min · 237 words

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

2026-04-23 · 更新于 2026-05-19 · 1 min · 200 words

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-23 · 更新于 2026-05-19 · 1 min · 191 words

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-23 · 更新于 2026-05-19 · 2 min · 266 words

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-23 · 更新于 2026-05-19 · 2 min · 316 words

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

2026-04-23 · 更新于 2026-05-19 · 2 min · 223 words

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-23 · 更新于 2026-05-19 · 2 min · 307 words

语音/音频论文速递 2026-04-23

2026-04-23 · 更新于 2026-05-19 · 13 min · 2679 words

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

2026-04-22 · 更新于 2026-05-19 · 3 min · 428 words

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-22 · 更新于 2026-05-19 · 3 min · 465 words

Audio Spoof Detection with GaborNet

2026-04-22 · 更新于 2026-05-19 · 4 min · 689 words

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

2026-04-22 · 更新于 2026-05-19 · 2 min · 335 words

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-22 · 更新于 2026-05-19 · 2 min · 277 words

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-22 · 更新于 2026-05-19 · 2 min · 221 words

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

2026-04-22 · 更新于 2026-05-19 · 3 min · 465 words

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

2026-04-22 · 更新于 2026-05-19 · 2 min · 290 words

Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

2026-04-22 · 更新于 2026-05-19 · 2 min · 419 words

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-22 · 更新于 2026-05-19 · 2 min · 276 words

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

2026-04-22 · 更新于 2026-05-19 · 2 min · 305 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-22 · 更新于 2026-05-19 · 1 min · 24 words

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

2026-04-22 · 更新于 2026-05-19 · 2 min · 237 words

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-22 · 更新于 2026-05-19 · 2 min · 269 words

Qwen3.5-Omni Technical Report

2026-04-22 · 更新于 2026-05-19 · 2 min · 392 words

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-22 · 更新于 2026-05-19 · 2 min · 405 words

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-22 · 更新于 2026-05-19 · 2 min · 327 words

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-22 · 更新于 2026-05-19 · 2 min · 397 words

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-22 · 更新于 2026-05-19 · 2 min · 338 words

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

2026-04-22 · 更新于 2026-05-19 · 3 min · 435 words

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

2026-04-22 · 更新于 2026-05-19 · 2 min · 385 words

语音/音频论文速递 2026-04-22

2026-04-22 · 更新于 2026-05-19 · 8 min · 1620 words

A novel LSTM music generator based on the fractional time-frequency feature extraction

2026-04-21 · 更新于 2026-05-19 · 1 min · 209 words

A state-space representation of the boundary integral equation for room acoustic modelling

2026-04-21 · 更新于 2026-05-19 · 2 min · 251 words

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-21 · 更新于 2026-05-19 · 2 min · 390 words

Anonymization, Not Elimination: Utility-Preserved Speech Anonymization

2026-04-21 · 更新于 2026-05-19 · 3 min · 568 words

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-21 · 更新于 2026-05-19 · 2 min · 311 words

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-21 · 更新于 2026-05-19 · 2 min · 278 words

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

2026-04-21 · 更新于 2026-05-19 · 3 min · 497 words

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

2026-04-21 · 更新于 2026-05-19 · 2 min · 384 words

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-21 · 更新于 2026-05-19 · 2 min · 230 words

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

2026-04-21 · 更新于 2026-05-19 · 1 min · 140 words

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-21 · 更新于 2026-05-19 · 1 min · 213 words

Coexisting Tempo Traditions in Beethoven’s Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012

2026-04-21 · 更新于 2026-05-19 · 2 min · 246 words

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-21 · 更新于 2026-05-19 · 3 min · 447 words

FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

2026-04-21 · 更新于 2026-05-19 · 2 min · 367 words

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-21 · 更新于 2026-05-19 · 2 min · 223 words

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

2026-04-21 · 更新于 2026-05-19 · 2 min · 348 words

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

2026-04-21 · 更新于 2026-05-19 · 3 min · 483 words

ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

2026-04-21 · 更新于 2026-05-19 · 2 min · 385 words

Incremental learning for audio classification with Hebbian Deep Neural Networks

2026-04-21 · 更新于 2026-05-19 · 2 min · 280 words

Latent Fourier Transform

2026-04-21 · 更新于 2026-05-19 · 2 min · 342 words

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

2026-04-21 · 更新于 2026-05-19 · 2 min · 391 words

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

2026-04-21 · 更新于 2026-05-19 · 3 min · 472 words

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

2026-04-21 · 更新于 2026-05-19 · 2 min · 284 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-21 · 更新于 2026-05-19 · 2 min · 303 words

Neural Encoding Detection is Not All You Need for Synthetic Speech Detection

2026-04-21 · 更新于 2026-05-19 · 2 min · 263 words

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

2026-04-21 · 更新于 2026-05-19 · 2 min · 257 words

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

2026-04-21 · 更新于 2026-05-19 · 2 min · 271 words

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-21 · 更新于 2026-05-19 · 3 min · 617 words

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

2026-04-21 · 更新于 2026-05-19 · 2 min · 370 words

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

2026-04-21 · 更新于 2026-05-19 · 1 min · 187 words

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

2026-04-21 · 更新于 2026-05-19 · 2 min · 276 words

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-21 · 更新于 2026-05-19 · 2 min · 421 words

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-21 · 更新于 2026-05-19 · 2 min · 321 words

Where Do Self-Supervised Speech Models Become Unfair?

2026-04-21 · 更新于 2026-05-19 · 1 min · 166 words

语音/音频论文速递 2026-04-21

2026-04-21 · 更新于 2026-05-19 · 13 min · 2659 words

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

2026-04-20 · 更新于 2026-05-19 · 2 min · 386 words

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-20 · 更新于 2026-05-19 · 2 min · 225 words

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

2026-04-20 · 更新于 2026-05-19 · 3 min · 447 words

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

2026-04-20 · 更新于 2026-05-19 · 3 min · 528 words

BlasBench: An Open Benchmark for Irish Speech Recognition

2026-04-20 · 更新于 2026-05-19 · 3 min · 435 words

Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

2026-04-20 · 更新于 2026-05-19 · 2 min · 388 words

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

2026-04-20 · 更新于 2026-05-19 · 3 min · 439 words

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

2026-04-20 · 更新于 2026-05-19 · 2 min · 372 words

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

2026-04-20 · 更新于 2026-05-19 · 3 min · 526 words

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

2026-04-20 · 更新于 2026-05-19 · 4 min · 779 words

Hierarchical Codec Diffusion for Video-to-Speech Generation

2026-04-20 · 更新于 2026-05-19 · 6 min · 1219 words

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

2026-04-20 · 更新于 2026-05-19 · 3 min · 588 words

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

2026-04-20 · 更新于 2026-05-19 · 2 min · 374 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-20 · 更新于 2026-05-19 · 2 min · 388 words

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

2026-04-20 · 更新于 2026-05-19 · 6 min · 1114 words

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

2026-04-20 · 更新于 2026-05-19 · 2 min · 377 words

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-20 · 更新于 2026-05-19 · 2 min · 238 words

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

2026-04-20 · 更新于 2026-05-19 · 1 min · 163 words

Qwen3.5-Omni Technical Report

2026-04-20 · 更新于 2026-05-19 · 2 min · 424 words

Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

2026-04-20 · 更新于 2026-05-19 · 4 min · 761 words

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

2026-04-20 · 更新于 2026-05-19 · 5 min · 999 words

The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

2026-04-20 · 更新于 2026-05-19 · 2 min · 402 words

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-20 · 更新于 2026-05-19 · 3 min · 611 words

VoxMind: An End-to-End Agentic Spoken Dialogue System

2026-04-20 · 更新于 2026-05-19 · 5 min · 909 words

语音/音频论文速递 2026-04-20

2026-04-20 · 更新于 2026-05-19 · 10 min · 2068 words

A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven’s Piano and Cello Sonatas

2026-04-19 · 更新于 2026-05-19 · 2 min · 253 words

Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

2026-04-19 · 更新于 2026-05-19 · 2 min · 423 words

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

2026-04-19 · 更新于 2026-05-19 · 2 min · 249 words

Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization

2026-04-19 · 更新于 2026-05-19 · 1 min · 123 words

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-19 · 更新于 2026-05-19 · 2 min · 314 words

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

2026-04-19 · 更新于 2026-05-19 · 2 min · 300 words

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

2026-04-19 · 更新于 2026-05-19 · 2 min · 237 words

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-19 · 更新于 2026-05-19 · 2 min · 325 words

Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

2026-04-19 · 更新于 2026-05-19 · 2 min · 294 words

Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals

2026-04-19 · 更新于 2026-05-19 · 1 min · 189 words

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-19 · 更新于 2026-05-19 · 3 min · 517 words

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

2026-04-19 · 更新于 2026-05-19 · 2 min · 370 words

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

2026-04-19 · 更新于 2026-05-19 · 3 min · 482 words

Diffusion Language Models for Speech Recognition

2026-04-19 · 更新于 2026-05-19 · 2 min · 253 words

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

2026-04-19 · 更新于 2026-05-19 · 2 min · 273 words

Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

2026-04-19 · 更新于 2026-05-19 · 2 min · 385 words

Enhancing time-frequency resolution with optimal transport and barycentric fusion of multiple spectrogram

2026-04-19 · 更新于 2026-05-19 · 3 min · 508 words

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

2026-04-19 · 更新于 2026-05-19 · 2 min · 234 words

Four Decades of Digital Waveguides

2026-04-19 · 更新于 2026-05-19 · 1 min · 190 words

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-19 · 更新于 2026-05-19 · 2 min · 289 words

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

2026-04-19 · 更新于 2026-05-19 · 3 min · 525 words

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

2026-04-19 · 更新于 2026-05-19 · 3 min · 430 words

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

2026-04-19 · 更新于 2026-05-19 · 2 min · 388 words

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

2026-04-19 · 更新于 2026-05-19 · 2 min · 258 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-19 · 更新于 2026-05-19 · 2 min · 339 words

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

2026-04-19 · 更新于 2026-05-19 · 2 min · 366 words

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

2026-04-19 · 更新于 2026-05-19 · 2 min · 351 words

Room compensation for loudspeaker reproduction using a supporting source

2026-04-19 · 更新于 2026-05-19 · 2 min · 225 words

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

2026-04-19 · 更新于 2026-05-19 · 2 min · 304 words

SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion

2026-04-19 · 更新于 2026-05-19 · 2 min · 401 words

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

2026-04-19 · 更新于 2026-05-19 · 2 min · 341 words

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-19 · 更新于 2026-05-19 · 2 min · 297 words

TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants

2026-04-19 · 更新于 2026-05-19 · 1 min · 128 words

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

2026-04-19 · 更新于 2026-05-19 · 3 min · 531 words

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

2026-04-19 · 更新于 2026-05-19 · 2 min · 387 words

Transformer Based Machine Fault Detection From Audio Input

2026-04-19 · 更新于 2026-05-19 · 1 min · 100 words

UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

2026-04-19 · 更新于 2026-05-19 · 3 min · 580 words

VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

2026-04-19 · 更新于 2026-05-19 · 3 min · 444 words

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-19 · 更新于 2026-05-19 · 1 min · 177 words

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

2026-04-19 · 更新于 2026-05-19 · 2 min · 284 words

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

2026-04-19 · 更新于 2026-05-19 · 2 min · 376 words

Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

2026-04-19 · 更新于 2026-05-19 · 3 min · 503 words

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-19 · 更新于 2026-05-19 · 2 min · 371 words

语音/音频论文速递 2026-04-19

2026-04-19 · 更新于 2026-05-19 · 15 min · 3104 words

语音/音频论文速递 2026-04-18

2026-04-18 · 更新于 2026-05-19 · 43 min · 9080 words