归档 | 语音/音乐/音频论文速递

2026 ³⁷³¹

July ⁶¹¹

An Evaluation Framework for Structured Audio Captions Validated by Controlled Perturbations

2026-07-24 · 更新于 2026-07-24 · 2 min · 325 words

Designed Vocalizations Dataset: Sound-Designed Human and Animal Voices for Non-human Voice Conversion

2026-07-24 · 更新于 2026-07-24 · 2 min · 294 words

DONDO: Open w2v-BERT Speech-Recognition Base Models for African Languages

2026-07-24 · 更新于 2026-07-24 · 3 min · 528 words

Faster IndexTTS-2: Accelerating and Streaming Autoregressive Zero-Shot Text-to-Speech Synthesis on GPUs

2026-07-24 · 更新于 2026-07-24 · 3 min · 526 words

From Read Speech to Spoken Digits: A Task-Specific Evaluation of Speech Privacy With Informed Attackers

2026-07-24 · 更新于 2026-07-24 · 2 min · 373 words

Improving the performance of an ASV system using hybrid speech features

2026-07-24 · 更新于 2026-07-24 · 3 min · 479 words

Instruct-FD: Can Your Full-Duplex Speech System Follow Turn-Taking Instructions?

2026-07-24 · 更新于 2026-07-24 · 2 min · 367 words

Investigating Codec-Internal Latent Audio Watermarking for Neural Codec Robustness

2026-07-24 · 更新于 2026-07-24 · 3 min · 440 words

OPOD: On-Policy Omni Distillation

2026-07-24 · 更新于 2026-07-24 · 3 min · 622 words

Phonetic forced alignment for low-resource language varieties: Model training and evaluation on Chengdu Mandarin

2026-07-24 · 更新于 2026-07-24 · 3 min · 564 words

Safeguards for Speech2Speech LLM-Assistants: A Case Study in Automotive Applications

2026-07-24 · 更新于 2026-07-24 · 2 min · 394 words

SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

2026-07-24 · 更新于 2026-07-24 · 2 min · 356 words

TF-MossFormer: Integrating Convolution Gated Local-Global Attentions for Enhanced Time-Frequency Domain Monaural Speech Separation

2026-07-24 · 更新于 2026-07-24 · 2 min · 421 words

Toward Generalizable Cognitive Impairment Detection with Speech-Based Multimodal Large Language Models

2026-07-24 · 更新于 2026-07-24 · 3 min · 586 words

Toward Interpretable Speech Deepfake Detection using Artifact-Specific Experts and Calibrated Detection Scores

2026-07-24 · 更新于 2026-07-24 · 4 min · 666 words

VibeVoice-ASR-BitNet Technical Report

2026-07-24 · 更新于 2026-07-24 · 3 min · 464 words

Word meaning co-determines vowel-inherent spectral change. A corpus-based investigation of conversational Mandarin

2026-07-24 · 更新于 2026-07-24 · 2 min · 389 words

X^3-OPD: Distilling Reasoning into Large Audio-Language Models via On-Policy Alignment

2026-07-24 · 更新于 2026-07-24 · 4 min · 781 words

语音/音乐/音频论文速递 2026-07-24

2026-07-24 · 更新于 2026-07-24 · 14 min · 2973 words

A Diagnostic Evaluation Framework for AI-Generated Cover Songs Using Music-Theoretic and Acoustic Features

2026-07-23 · 更新于 2026-07-24 · 2 min · 317 words

Audio-Zero: Label-Free Self-Evolution for Fine-Grained Audio Reasoning

2026-07-23 · 更新于 2026-07-24 · 3 min · 566 words

Black-Box Optimization for Identifying and Inverting Audio Dynamic Range Control Effects

2026-07-23 · 更新于 2026-07-24 · 2 min · 415 words

CAPS: A Cascaded Reconstruction Model to Power Saving in Hearables Using Sub-Nyquist Sampling with Bandwidth Extension

2026-07-23 · 更新于 2026-07-24 · 4 min · 824 words

Cross-Subject Semantic Decoding with Shared-Space Alignment for Generalized Neural Representation Learning

2026-07-23 · 更新于 2026-07-24 · 2 min · 371 words

Cumsum-Composable Phase Transport for Low-Cost Streaming Keyword Spotting

2026-07-23 · 更新于 2026-07-24 · 3 min · 449 words

Efficient Chain-of-Modality Reasoning via Progressive Compression for Spoken Language Models

2026-07-23 · 更新于 2026-07-24 · 6 min · 1117 words

Improved Monitoring of Honey bee Colony Strength via Audio IoT Sensors, Modulation Tensorgrams and Recurrent Neural Networks

2026-07-23 · 更新于 2026-07-24 · 4 min · 648 words

Layer-Wise Decision Fusion for Fake Audio Detection Using XLS-R

2026-07-23 · 更新于 2026-07-24 · 3 min · 482 words

Learning the Arabic Dialect Continuum as a Continuous Space: A Regression Approach to Speaker Origin Prediction

2026-07-23 · 更新于 2026-07-24 · 3 min · 603 words

Multimodal Speaker Verification as a Threat to Speaker Anonymization

2026-07-23 · 更新于 2026-07-24 · 3 min · 582 words

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

2026-07-23 · 更新于 2026-07-24 · 3 min · 543 words

Pushing the Frontier of Full-Song Generation: Hierarchical Autoregressive Planning Meets Flow-Matching Rendering

2026-07-23 · 更新于 2026-07-24 · 4 min · 747 words

RIME: Enabling Large-Scale Agentic Post-Production

2026-07-23 · 更新于 2026-07-24 · 3 min · 527 words

RPPNet: Perceptually-Grouped Rhythm-Pitch Primitives for Long-Term Structure Melody Generation via Boundary-Aware Modeling

2026-07-23 · 更新于 2026-07-24 · 2 min · 315 words

Scalable Keyword Spotting via Modular Network Expansion

2026-07-23 · 更新于 2026-07-24 · 3 min · 599 words

SimulS2ST-Omni: Data-Efficient Streaming Speech-to-Speech Translation via Explicit Trajectory Supervision

2026-07-23 · 更新于 2026-07-24 · 3 min · 576 words

StellarTTS: Sparse Temporal Embedding for Low-Latency and Robust Speech Synthesis

2026-07-23 · 更新于 2026-07-24 · 3 min · 544 words

The Giant Hippocampus: From Structural Monoculture to a System of Systems

2026-07-23 · 更新于 2026-07-24 · 1 min · 185 words

Ultra-Compact CNN Architectures for Tropical Bird Audio Detection on Microcontrollers

2026-07-23 · 更新于 2026-07-24 · 2 min · 385 words

Validating the Single Item Kawaii Measure

2026-07-23 · 更新于 2026-07-24 · 3 min · 512 words

语音/音乐/音频论文速递 2026-07-23

2026-07-23 · 更新于 2026-07-24 · 18 min · 3726 words

A Situational Speech Synthesizer for Yoruba: System Design, Phonological Rule Architecture, and Orthographic Extensions for Contour

2026-07-22 · 更新于 2026-07-24 · 2 min · 283 words

Addressing Limited Data in Auditory Attention Decoding with Diffusion Generative Models

2026-07-22 · 更新于 2026-07-24 · 2 min · 417 words

Benchmarking Human and Automatic Speech Recognition of Diverse Speech: Initial Results

2026-07-22 · 更新于 2026-07-24 · 2 min · 385 words

Comparing Spectrogram Front-Ends for Abnormal Heart-Sound Detection with a Convolutional Neural Network

2026-07-22 · 更新于 2026-07-24 · 2 min · 321 words

Constrained CTC Decoding for Efficient Diacritic Restoration

2026-07-22 · 更新于 2026-07-24 · 2 min · 423 words

Content is What Remains: Invariant Speech Tokenization from Parallel Utterances

2026-07-22 · 更新于 2026-07-24 · 4 min · 808 words

CS-ETS: Chaos-Inspired Samba-Based EMG-To-Speech Synthesis with Nonlinear Chaotic Losses

2026-07-22 · 更新于 2026-07-24 · 3 min · 626 words

EmoEUS: Uncertainty Supervision for Multimodal Emotion Recognition in Conversation

2026-07-22 · 更新于 2026-07-24 · 3 min · 561 words

End-to-End Markov State Sequence Learning for Auditory Attention Decoding

2026-07-22 · 更新于 2026-07-24 · 3 min · 528 words

Fretiq: Browser-Native Electric Guitar String Classification via Engineered Spectral Features and Held-Out Free-Play Evaluation

2026-07-22 · 更新于 2026-07-24 · 2 min · 384 words

From a Multilingual Streaming ASR Backbone to Kenyan-Language Systems: Data-Centric Adaptation of Nemotron 3.5 for Kikuyu, Dholuo, and Kalenjin

2026-07-22 · 更新于 2026-07-24 · 2 min · 385 words

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

2026-07-22 · 更新于 2026-07-24 · 3 min · 559 words

MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings

2026-07-22 · 更新于 2026-07-24 · 4 min · 713 words

Staged Depth-Pruning Distillation of a Flow-Matching Text-to-Speech Teacher: A Compact Hindi Speech Synthesizer

2026-07-22 · 更新于 2026-07-24 · 3 min · 563 words

Summary of DCASE 2026 Task 5: Audio-Dependent Question Answering

2026-07-22 · 更新于 2026-07-24 · 3 min · 618 words

Teleportation Game: Quantum Teleportation in Multi-Agent Systems for Interactive Music

2026-07-22 · 更新于 2026-07-24 · 2 min · 353 words

Towards a reproducible cross-venue method for quantifying crowd noise in stadiums

2026-07-22 · 更新于 2026-07-24 · 3 min · 444 words

Towards Array-Invariant Speech Enhancement via Geometry-Aware Dynamic Convolution

2026-07-22 · 更新于 2026-07-24 · 4 min · 643 words

Transcription Policy as a Latent Variable: Activating Controllable Verbatim ASR with Word-Level Timing

2026-07-22 · 更新于 2026-07-24 · 3 min · 604 words

What the Waveform Knows: Transparent-first Speech and Audio Intelligence with Caption Studio

2026-07-22 · 更新于 2026-07-24 · 2 min · 237 words

语音/音乐/音频论文速递 2026-07-22

2026-07-22 · 更新于 2026-07-24 · 17 min · 3461 words

Adaptive Momentum Enhanced Distributed Multichannel Active Noise Control for Faster Convergence under Communication Delays

2026-07-21 · 更新于 2026-07-24 · 2 min · 327 words

AI_LectureNote: A Retrospective Pilot Study of a Post-ASR Workflow for English-Script Rendering and Semantic Drift in Korean-English Medical Lectures

2026-07-21 · 更新于 2026-07-24 · 3 min · 550 words

AMECxSV: Adaptive Metadata-Driven Embedding-Fusion Calibration for X-Lingual Speaker Verification

2026-07-21 · 更新于 2026-07-24 · 4 min · 819 words

An Audio Language Model-Based Voice Concept Bottleneck Framework for Interpretable Health Assessment

2026-07-21 · 更新于 2026-07-24 · 2 min · 421 words

Audio Cross Verification Using Dual Alignment Likelihood Ratio Test

2026-07-21 · 更新于 2026-07-24 · 2 min · 372 words

Component-Level Ensemble Fusion for Speech and Environmental Sound Deepfake Detection

2026-07-21 · 更新于 2026-07-24 · 2 min · 418 words

Dense-Sparse Dynamic Time Warping for Customizing Piano Concerto Accompaniments

2026-07-21 · 更新于 2026-07-24 · 2 min · 389 words

Do Speech Tokens Leak Voiceprints? Speaker Inversion Attacks Against End-to-End Speech Language Models

2026-07-21 · 更新于 2026-07-24 · 3 min · 590 words

Efficient Audio-Visual Event Recognition via Knowledge Distillation and Dynamic INT8 Quantization of a Hybrid Cross-Attention Network

2026-07-21 · 更新于 2026-07-24 · 2 min · 353 words

EII-SCL: Harnessing Emotional Inertia for Multimodal Emotion Recognition in Conversation

2026-07-21 · 更新于 2026-07-24 · 3 min · 600 words

ESCUCHA: A Spanish Speech Benchmark for Heterogeneous Acoustic Conditions

2026-07-21 · 更新于 2026-07-24 · 2 min · 323 words

Explainable Lightweight Compact Deep Models for Speech Emotion Recognition

2026-07-21 · 更新于 2026-07-24 · 3 min · 533 words

FillGauss: Fine-Grained Filling-Aware Impact Sound Generation for 3D Gaussian Splatting

2026-07-21 · 更新于 2026-07-24 · 3 min · 539 words

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

2026-07-21 · 更新于 2026-07-24 · 3 min · 524 words

FlowSonic: Stable Zero-Shot Music Editing via High-Order Trajectory Integration

2026-07-21 · 更新于 2026-07-24 · 3 min · 492 words

Harness TTS: Towards Context-Aware Expressive Speech Synthesis with Harness Layer

2026-07-21 · 更新于 2026-07-24 · 3 min · 562 words

HARP: Harmonic-Aware Residual Partitioning for Neural Audio Codecs

2026-07-21 · 更新于 2026-07-24 · 3 min · 597 words

How Reliable Are Multimodal Signals of Conversational State? Evidence from Remote Dyadic Collaborative Tasks

2026-07-21 · 更新于 2026-07-24 · 4 min · 664 words

Is One Score Enough? Assessing Singing Quality of Songs with Temporal Score Curves

2026-07-21 · 更新于 2026-07-24 · 2 min · 420 words

Modeling turn-taking with distant viewing: investigating silence thresholds in human and AI-generated discourse

2026-07-21 · 更新于 2026-07-24 · 3 min · 503 words

Multi-Level Privacy-Preserving Dementia Detection from Speech via Targeted Adversarial Obfuscation and Representation Learning

2026-07-21 · 更新于 2026-07-24 · 3 min · 573 words

NABEATs: Noise-Aware Audio Representation Learning

2026-07-21 · 更新于 2026-07-24 · 4 min · 642 words

Pseudo-label distillation for discriminative anomalous sound detection

2026-07-21 · 更新于 2026-07-24 · 2 min · 411 words

Re-Sonance: A Dysarthric Asynchronous Real-Time Speech Conversion System Based on a Three-Stage Cascaded ASR-LLM-TTS Architecture

2026-07-21 · 更新于 2026-07-24 · 3 min · 462 words

RealDESED: A Real-World Domestic Sound Event Detection Benchmark

2026-07-21 · 更新于 2026-07-24 · 3 min · 618 words

Robust Summarization of Doctor-Patient Conversations: TalTech Systems for the Beyond Transcription Challenge

2026-07-21 · 更新于 2026-07-24 · 2 min · 264 words

SALMONN-2: Advancing General-Purpose Hearing Abilities with Self-Supervised Representations

2026-07-21 · 更新于 2026-07-24 · 4 min · 774 words

Should Missing Modalities Always Be Necessary to Repair for Multi-modal Sentiment Analysis?

2026-07-21 · 更新于 2026-07-24 · 3 min · 551 words

SSTMark: Robust Training-Free Semantic-Level Speech Watermarking

2026-07-21 · 更新于 2026-07-24 · 8 min · 1502 words

Team RAS in 11th ABAW Competition: Multimodal Ambivalence Recognition Approach

2026-07-21 · 更新于 2026-07-24 · 3 min · 466 words

The tttAI System for the TSA-ASR Task of the SmartGlasses Challenge 2026

2026-07-21 · 更新于 2026-07-24 · 2 min · 399 words

Time-Frequency Consistency Learning for Robust Speech Deepfake Detection

2026-07-21 · 更新于 2026-07-24 · 2 min · 376 words

When to Use Extra Context: Evidence-Grounded Terminology Adaptation for Simultaneous Speech Translation

2026-07-21 · 更新于 2026-07-24 · 2 min · 321 words

X-Translator: A Real-Time Multilingual Speaker-Aware Speech-to-Speech Translation System

2026-07-21 · 更新于 2026-07-24 · 5 min · 870 words

语音/音乐/音频论文速递 2026-07-21

2026-07-21 · 更新于 2026-07-24 · 29 min · 6176 words

A Geometry-Limited Identification Floor and Its Consequences for Voice-Clone Attribution in Professional Voice Actors

2026-07-20 · 更新于 2026-07-24 · 3 min · 432 words

A Study of Parallelizable Alternatives to Dynamic Time Warping for Aligning Long Sequences

2026-07-20 · 更新于 2026-07-24 · 2 min · 424 words

AnovaX: A Local, Multi-Agent Voice Assistant with LLM Planning, Typed Executors, and Adaptive Recovery

2026-07-20 · 更新于 2026-07-24 · 3 min · 445 words

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

2026-07-20 · 更新于 2026-07-24 · 4 min · 700 words

AuEmoChat: Authentic Emotion Understanding and Rendering for Conversational Speech Synthesis

2026-07-20 · 更新于 2026-07-24 · 4 min · 668 words

AV-JEPA: Extending LeJEPA to Audio-Visual Self-Supervised Learning

2026-07-20 · 更新于 2026-07-24 · 3 min · 576 words

Constrained Hebbian Learning Supports Efficient Representational Allocation under Structural Constraints

2026-07-20 · 更新于 2026-07-24 · 4 min · 732 words

Controlling Implicit Shortcut Reliance in L2 Spoken English Auto-markers

2026-07-20 · 更新于 2026-07-24 · 3 min · 522 words

Data-driven Video Codec with Implicit Neural Representations

2026-07-20 · 更新于 2026-07-24 · 3 min · 502 words

Estimating the Reliability of Dynamic Time Warping Alignments Using Circumstantial Evidence

2026-07-20 · 更新于 2026-07-24 · 2 min · 359 words

Natural Backdoor Attacks on Speech Recognition Models

2026-07-20 · 更新于 2026-07-24 · 3 min · 564 words

Proof-Carrying Multimodal Timelines: Finite-Trace Modal Certificates for Video-Audio Consistency

2026-07-20 · 更新于 2026-07-24 · 4 min · 659 words

Segmental DTW: A Parallelizable Alternative to Dynamic Time Warping

2026-07-20 · 更新于 2026-07-24 · 2 min · 318 words

SpeechGuard: Online Defense against Backdoor Attacks on Speech Recognition Models

2026-07-20 · 更新于 2026-07-24 · 3 min · 567 words

StemFX: Learning Mixing Style Representations via Autoregressive FX Chain Prediction on Source-Separated Stems

2026-07-20 · 更新于 2026-07-24 · 2 min · 373 words

语音/音乐/音频论文速递 2026-07-20

2026-07-20 · 更新于 2026-07-24 · 13 min · 2761 words

AlphaWiSE: Adaptive Weight Interpolation for Continual Multimodal Representation Learning

2026-07-17 · 更新于 2026-07-24 · 4 min · 682 words

Can Tokens Compete? Token Representations against Supervised CNN Backbones for BirdCLEF+ 2026

2026-07-17 · 更新于 2026-07-24 · 4 min · 682 words

Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants

2026-07-17 · 更新于 2026-07-24 · 2 min · 412 words

InCarEmo: A Multimodal Dataset for In-Cabin Emotion Recognition and Driver State Monitoring

2026-07-17 · 更新于 2026-07-24 · 3 min · 553 words

ITGPT: A Transformer Based Architecture for the Generation of Dance Dance Revolution and In the Groove Charts

2026-07-17 · 更新于 2026-07-24 · 2 min · 364 words

Large Audio Language Models for Spoofing-Aware Speaker Verification

2026-07-17 · 更新于 2026-07-24 · 3 min · 506 words

MIDI-RAE-JEPA: Hierarchical Representation Learning and Generation for Symbolic Music

2026-07-17 · 更新于 2026-07-24 · 3 min · 497 words

MultiRef-Compass: Towards Comprehensive Evaluation of Multi-Reference-to-Audio-Video Generation

2026-07-17 · 更新于 2026-07-24 · 2 min · 423 words

RW-Voice-EQ Bench: A Real World Benchmark for Evaluating Voice AI Systems

2026-07-17 · 更新于 2026-07-24 · 3 min · 474 words

SceneBind: Binding What and Where Across Vision, Audio and Language

2026-07-17 · 更新于 2026-07-24 · 2 min · 421 words

SLT 2026 REAL-TSE Challenge: Real-world Target Speaker Extraction from Conversational Recordings

2026-07-17 · 更新于 2026-07-24 · 3 min · 455 words

Stop Thinking, Start Looking: Efficient Post-Training for Multimodal Document Question Answering via Reasoning-Free Alignment

2026-07-17 · 更新于 2026-07-24 · 3 min · 442 words

Video = World + Event Stream

2026-07-17 · 更新于 2026-07-24 · 3 min · 438 words

WanSong v1.0 Technical Report

2026-07-17 · 更新于 2026-07-24 · 3 min · 529 words

What does the model actually see? Evaluation protocols and input availability in data-driven prediction of room acoustic parameters

2026-07-17 · 更新于 2026-07-24 · 3 min · 504 words

语音/音乐/音频论文速递 2026-07-17

2026-07-17 · 更新于 2026-07-24 · 13 min · 2598 words

A Hybrid Mamba for Audio-Visual Navigation

2026-07-16 · 更新于 2026-07-24 · 3 min · 565 words

Adapting a Diffusion-Based Music Synthesis Model to Human Voice Conversion

2026-07-16 · 更新于 2026-07-24 · 3 min · 575 words

Auditing Protocol-Level Shortcuts in Large Audio Language Model Judges for Speech Evaluation

2026-07-16 · 更新于 2026-07-24 · 4 min · 680 words

AVSCap: Orchestrating Audio-Visual Synergy for Omni-modal Video Captioning

2026-07-16 · 更新于 2026-07-24 · 4 min · 803 words

Bring Music The Horizon: Music-Driven 360^\circ Video Generation

2026-07-16 · 更新于 2026-07-24 · 2 min · 275 words

Cover First, Disagree Softly: Rethinking Mismatch-First Active Learning for Frame-Level Audio Classification

2026-07-16 · 更新于 2026-07-24 · 2 min · 360 words

Do LLMs Need Architectural Changes for Simultaneous Speech Translation? A Prefix-to-Prefix Data Driven Approach

2026-07-16 · 更新于 2026-07-24 · 3 min · 614 words

Efficient Text-to-Audio Generation via Pruning

2026-07-16 · 更新于 2026-07-24 · 2 min · 304 words

From Continuous Deployment to Queryable Dataset: Terabyte-Scale AIS-Aligned Passive Acoustic Labelling

2026-07-16 · 更新于 2026-07-24 · 2 min · 252 words

From Prediction to Collaboration: Interactive Symbolic Music Analysis

2026-07-16 · 更新于 2026-07-24 · 3 min · 523 words

Genre Bias or Aesthetic Perception? Identifying and Mitigating Shortcut Learning in Music Evaluation

2026-07-16 · 更新于 2026-07-24 · 3 min · 515 words

Greedy Volume Maximization of Gradient Embeddings for Long-Tailed Frame-Level Bioacoustic Active Learning

2026-07-16 · 更新于 2026-07-24 · 3 min · 429 words

Improving Text-to-Audio Instruction Following via Fine-Grained Feedback from Audio-Aware Large Language Models

2026-07-16 · 更新于 2026-07-24 · 5 min · 943 words

Live Gurbani Tracking: A Benchmark and Reference System for Captioning Sikh Kirtan

2026-07-16 · 更新于 2026-07-24 · 2 min · 390 words

MetaPerch: Learning from metadata for bioacoustics foundation models

2026-07-16 · 更新于 2026-07-24 · 3 min · 588 words

Music-to-Dance Generation via Atomic Movements

2026-07-16 · 更新于 2026-07-24 · 3 min · 453 words

Rethinking Speech Foundation Model Fine-tuning: Better SFT or Better Match?

2026-07-16 · 更新于 2026-07-24 · 5 min · 925 words

Self-supervised Speech Comparison for L2 Phone, Rhythm, and Intonation Scoring

2026-07-16 · 更新于 2026-07-24 · 3 min · 547 words

Task-Oriented Sensing and Covert Transmissions for Collaborative Multi-AUV Systems

2026-07-16 · 更新于 2026-07-24 · 2 min · 246 words

VIP-MINGLE: A Corpus for Videoconference and In-Person Multimodal Interaction in Group Language Engagement

2026-07-16 · 更新于 2026-07-24 · 2 min · 422 words

语音/音乐/音频论文速递 2026-07-16

2026-07-16 · 更新于 2026-07-24 · 15 min · 3017 words

Audio Diarization: A New Paradigm for Exploring Audio Recordings with Unknown Event Classes

2026-07-15 · 更新于 2026-07-24 · 2 min · 386 words

Audio-Native Speech Recognition with a Frozen Discrete-Diffusion Language Model

2026-07-15 · 更新于 2026-07-24 · 3 min · 517 words

Automated Synthesis of Facial Mechanisms for Conversational Animatronic Robots

2026-07-15 · 更新于 2026-07-24 · 2 min · 402 words

AutoSIFT: Automatic Style Sifting for Controllable Speech Generation with Arbitrary Style Infilling

2026-07-15 · 更新于 2026-07-24 · 4 min · 773 words

ChartGenEval: Corruption-Tested Multi-Dimensional Feedback for Rhythm-Game Chart Generation

2026-07-15 · 更新于 2026-07-24 · 2 min · 381 words

Contrasting statistical patterns in melodic and molecular evolution reveal distinctive constraints in a culturally evolving system

2026-07-15 · 更新于 2026-07-24 · 2 min · 320 words

Do We Really Need Multimodal Emotion Language Models Larger Than 1B Parameters?

2026-07-15 · 更新于 2026-07-24 · 3 min · 576 words

DOA Estimation from One-Bit Magnitude-Only Measurements via Sign-Consistency Optimization

2026-07-15 · 更新于 2026-07-24 · 4 min · 824 words

Explainable-by-Design Audio Deepfake Detection via Wiener-Hopf Linear Prediction

2026-07-15 · 更新于 2026-07-24 · 5 min · 890 words

HSEmotion Team at the 11th ABAW Challenge: Multi-Task Learning and Ambivalence/Hesitancy Video Recognition

2026-07-15 · 更新于 2026-07-24 · 4 min · 669 words

Hybrid Continual Learning for Low-Resource Australian Aboriginal Language Identification

2026-07-15 · 更新于 2026-07-24 · 3 min · 555 words

Investigating the Integration of Spatial Information in Foundation-Model-Based Speaker Diarization

2026-07-15 · 更新于 2026-07-24 · 2 min · 387 words

Listen first: Output-based multi-microphone speech enhancement

2026-07-15 · 更新于 2026-07-24 · 3 min · 562 words

Low-Latency Neural Models for Real-Time Music Enhancement

2026-07-15 · 更新于 2026-07-24 · 3 min · 431 words

Neural Morphing: Sequence-Optimized Token-Level Morphing in Neural Audio Codecs

2026-07-15 · 更新于 2026-07-24 · 3 min · 585 words

Open-Source Intelligence and Music Information Retrieval for Geographic Attribution of Musical Affect and the Ecological Limits of Population Inference

2026-07-15 · 更新于 2026-07-24 · 3 min · 496 words

PolarBM: Complex-valued Boltzmann Machine for Modeling Audio Signals in Polar and Log-polar Coordinates

2026-07-15 · 更新于 2026-07-24 · 3 min · 502 words

Real-time Generation of Listener Nodding via Prediction of Kinematic Parameters for Avatar Dialogue Systems

2026-07-15 · 更新于 2026-07-24 · 6 min · 1210 words

Segregate, Refine, Integrate: Decomposing Multimodal Fusion for Sentiment Analysis

2026-07-15 · 更新于 2026-07-24 · 4 min · 822 words

Spatial-Frequency Cued Generative Fixed-Filter Active Noise Control Based on Deep Learning in Reverberant Environments

2026-07-15 · 更新于 2026-07-24 · 3 min · 462 words

The Sound of Absence: Audio-Language Embedding Models Struggle with Negation

2026-07-15 · 更新于 2026-07-24 · 3 min · 448 words

Traceback Translators Against Forgetting in Continual Fake Speech Detection

2026-07-15 · 更新于 2026-07-24 · 3 min · 456 words

UD-ASD: A Unified Diffusion Model for Anomalous Sound Detection

2026-07-15 · 更新于 2026-07-24 · 3 min · 428 words

What is a Musical Scale? Regularity and Convention in the Organization of Pitch

2026-07-15 · 更新于 2026-07-24 · 2 min · 250 words

ZipL-Dialog: Memory-Efficient Long-Form Spoken Dialog Synthesis via Latent Flow Matching

2026-07-15 · 更新于 2026-07-24 · 3 min · 500 words

语音/音乐/音频论文速递 2026-07-15

2026-07-15 · 更新于 2026-07-24 · 23 min · 4806 words

A Closed-Form Noise-Sensitivity Asymmetry for Causal Branch Selection in Minimal-Array TDoA Localization

2026-07-14 · 更新于 2026-07-24 · 2 min · 411 words

A Production-Oriented Framework for Evaluation of SFX Generation

2026-07-14 · 更新于 2026-07-24 · 2 min · 426 words

An Objective Intelligibility Metric Evaluation on Spanish Speech

2026-07-14 · 更新于 2026-07-24 · 3 min · 501 words

Anamnesis: An Open-Source Platform for Large-Scale Backstory-Conditioned Survey Simulation

2026-07-14 · 更新于 2026-07-24 · 3 min · 538 words

Anysynth:Zero-Shot Instrument Cloning via In-Context Learning and Asymmetric Hierarchical Guidance

2026-07-14 · 更新于 2026-07-24 · 3 min · 627 words

ARIMA: Reconstruction-Grounded Predictive Representation Learning for Symbolic Music

2026-07-14 · 更新于 2026-07-24 · 2 min · 399 words

BackgroundMellow: A Multi-Modal Cohesive Framework for Narrative-Driven Rich Cinematic Soundscape Generation

2026-07-14 · 更新于 2026-07-24 · 2 min · 410 words

BeatEdit: Symbolic Music Generation as Explicit Editing

2026-07-14 · 更新于 2026-07-24 · 5 min · 893 words

Breaking the Quality–Intelligibility Trade-off in Streaming Target Speaker Extraction via Deep-Feature-Anchored Preference Optimization

2026-07-14 · 更新于 2026-07-24 · 4 min · 823 words

Casting Everything to Online API Services? A Survey of Integrating Localized Speech Recognition Models in Robotic Systems

2026-07-14 · 更新于 2026-07-24 · 2 min · 388 words

CHARM: Charge Calibration and Acoustic Rescue for LLM-based Multimodal Sarcasm Detection

2026-07-14 · 更新于 2026-07-24 · 5 min · 1065 words

CoFi-Lite: Pushing the Limits of Ultra-Lightweight Speech Enhancement

2026-07-14 · 更新于 2026-07-24 · 3 min · 607 words

Dance to Music Generation leveraging Pre-training with Unpaired data and Contrastive Alignment

2026-07-14 · 更新于 2026-07-24 · 3 min · 456 words

Data Augmentation for L2 English Speaking Assessment using TTS

2026-07-14 · 更新于 2026-07-24 · 3 min · 486 words

Difference-Driven Gating: Adaptive Feature Fusion for U-Net Decoder

2026-07-14 · 更新于 2026-07-24 · 6 min · 1124 words

ECHOv2: Two-Level Band-Splitting Representation Learning for Anomalous Sound Detection

2026-07-14 · 更新于 2026-07-24 · 5 min · 968 words

Efficiently Adapting Spoken Language Models for the Singaporean Context

2026-07-14 · 更新于 2026-07-24 · 3 min · 567 words

Encoder-Side Neuron Identification and Amplification for Acoustic Perception in Large Audio-Language Models

2026-07-14 · 更新于 2026-07-24 · 4 min · 643 words

Evaluating SSL and ViViT Architectures for Cross-Corpus Audio MOS Prediction via LODO Validation

2026-07-14 · 更新于 2026-07-24 · 2 min · 406 words

Evidence Subspace Projection: Measuring How Much Evidence Explains Deepfake Detection in Self-Supervised Speech Models

2026-07-14 · 更新于 2026-07-24 · 2 min · 358 words

FdAudio: MeanFlow-Anchored Fréchet-Distance Post-Training for One-Step Text-to-Audio Generation

2026-07-14 · 更新于 2026-07-24 · 4 min · 848 words

GigaAM Multilingual: Foundation Model for Underrepresented Languages

2026-07-14 · 更新于 2026-07-24 · 3 min · 533 words

GigaChat Audio: Time-aware Large Audio Language Model

2026-07-14 · 更新于 2026-07-24 · 2 min · 376 words

Graph Representation of RaagBase: A Unique Dataset for Hindustani Music

2026-07-14 · 更新于 2026-07-24 · 3 min · 481 words

Hearing Like Humans? Sound Symbolism and Perceptual Alignment in Speech Language Models

2026-07-14 · 更新于 2026-07-24 · 5 min · 949 words

Learn2Chat: Rethinking Dyadic Talking Heads via Interaction-Modulated Monologic Priors

2026-07-14 · 更新于 2026-07-24 · 2 min · 390 words

LightMem-Ego: Your AI Memory for Everyday Life

2026-07-14 · 更新于 2026-07-24 · 2 min · 361 words

Listen to the Features: Voice Anonymization Driven by Content Embedding Matching over Signal Reconstruction

2026-07-14 · 更新于 2026-07-24 · 3 min · 568 words

Local Multimodal Music Alignment from Global Supervision

2026-07-14 · 更新于 2026-07-24 · 3 min · 618 words

LOGOS: A Living Logic for AI Agent Teams That Evolve With Humans

2026-07-14 · 更新于 2026-07-24 · 6 min · 1240 words

MeloBottleneck: Self-Supervised Melody Skeleton Extraction with a Latent Subsequence Bottleneck

2026-07-14 · 更新于 2026-07-24 · 3 min · 606 words

MRUF: Multi-granularity Routing with Uncertainty-Aware Fusion for Robust Multimodal Sentiment Analysis

2026-07-14 · 更新于 2026-07-24 · 4 min · 750 words

MusicMark: A Robust Generative Watermarking Framework for Music Generation

2026-07-14 · 更新于 2026-07-24 · 8 min · 1524 words

Omni-Decision: A Progressive Evidence-State Agent System for Omni-Modal QA

2026-07-14 · 更新于 2026-07-24 · 4 min · 705 words

PC-Mix: Partial-Component Audio Spoofing Detection under Mixed Speech and Environmental Sound Conditions

2026-07-14 · 更新于 2026-07-24 · 2 min · 422 words

Perceived Annoyance in Multi-source Electric Vehicle AVAS Environments

2026-07-14 · 更新于 2026-07-24 · 3 min · 527 words

Qwen-Audio-VAE Technical Report

2026-07-14 · 更新于 2026-07-24 · 3 min · 619 words

Qwen-Music Technical Report

2026-07-14 · 更新于 2026-07-24 · 4 min · 657 words

Semantic Sampling via Learnable Observation Front Ends

2026-07-14 · 更新于 2026-07-24 · 4 min · 643 words

Simple Features and Honest Calibration for Ambivalence and Hesitancy Recognition in Video

2026-07-14 · 更新于 2026-07-24 · 3 min · 460 words

Synchronized Three-Dimensional Vocal-Tract Motion for Speech Synchronization via Joint-Embedding Predictive Architecture Alignment

2026-07-14 · 更新于 2026-07-24 · 2 min · 374 words

TabPFN beyond Tabular Data: Calibration and Accuracy on Multimodal Embeddings

2026-07-14 · 更新于 2026-07-24 · 6 min · 1216 words

Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation

2026-07-14 · 更新于 2026-07-24 · 5 min · 911 words

The SonicAGI System for the REAL-TSE Challenge

2026-07-14 · 更新于 2026-07-24 · 3 min · 592 words

Tight-Frame Reconstruction for Acoustic Intensity Estimation Using Cardioid Microphone Pairs

2026-07-14 · 更新于 2026-07-24 · 3 min · 524 words

Transcript-Free Lightweight Detection of Alzheimer’s Disease from Spontaneous Speech Using Handcrafted MFCC-Dominant Acoustic Biomarkers

2026-07-14 · 更新于 2026-07-24 · 3 min · 485 words

Unified Gradient Projection: Language-Balanced Continual Learning for Multilingual Low-Resource ASR

2026-07-14 · 更新于 2026-07-24 · 3 min · 563 words

Verifier-Guided Twelve-Tone Composition: A Generate-Verify-Repair Harness for Symbolic Music Generation

2026-07-14 · 更新于 2026-07-24 · 2 min · 405 words

VoxENES 2026: Benchmarking Generalization of Speech Spoofing Detectors Against LLM-Era TTS and Voice Conversion

2026-07-14 · 更新于 2026-07-24 · 3 min · 607 words

WaveNet-Style Guitar Amplifier Model Pruning for Real-Time iOS Deployment

2026-07-14 · 更新于 2026-07-24 · 2 min · 379 words

What You Train Is What You Get: Gender Bias, Training Composition, and Post-Hoc Mitigation in Audio Deepfake Detection

2026-07-14 · 更新于 2026-07-24 · 3 min · 577 words

Where Speech Enhancement Hurts Recognition: An Inference Time Polar Projection Diagnosis

2026-07-14 · 更新于 2026-07-24 · 3 min · 628 words

Which Languages Transfer Best to Warlpiri? A Similarity-Based Study for Low-Resource ASR

2026-07-14 · 更新于 2026-07-24 · 2 min · 406 words

语音/音乐/音频论文速递 2026-07-14

2026-07-14 · 更新于 2026-07-24 · 44 min · 9279 words

Beyond Time Shifts: Adapting Omni-LLM as a Reference-Free Evaluator for Generative Audio-Visual Models

2026-07-13 · 更新于 2026-07-24 · 4 min · 807 words

Clean2FX: Label-conditioned modeling for clean-to-effect guitar audio transformations

2026-07-13 · 更新于 2026-07-24 · 3 min · 464 words

Dual-BEATs: Unlocking Zero-Shot Stereo Audio Perception in Audio Large Language Models via Dithering

2026-07-13 · 更新于 2026-07-24 · 2 min · 350 words

Event-Based Token Sequences for Audio-Conditioned Music-Game Level Modeling

2026-07-13 · 更新于 2026-07-24 · 2 min · 397 words

FreyaTTS Technical Report

2026-07-13 · 更新于 2026-07-24 · 3 min · 529 words

Immersive Social Interaction with VR and LLM-Assisted Humanoids

2026-07-13 · 更新于 2026-07-24 · 2 min · 272 words

Optimal Transport-based Semantic Alignment for LLM-based Audio-Visual Speech Recognition

2026-07-13 · 更新于 2026-07-24 · 3 min · 469 words

Phone Segmentation and Recognition through Phonological Activation Mapping

2026-07-13 · 更新于 2026-07-24 · 4 min · 677 words

ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models

2026-07-13 · 更新于 2026-07-24 · 7 min · 1333 words

SVF-CR: Synchronized Visual-Facial Cross-Refinement for Multimodal Ambivalence and Hesitancy Recognition

2026-07-13 · 更新于 2026-07-24 · 3 min · 459 words

Technical Report for MERL’s Real-TSE Challenge Submission

2026-07-13 · 更新于 2026-07-24 · 3 min · 449 words

Tokenizer Transplantation: Mitigating Autoregressive Collapse in Edge-Efficient Bengali ASR

2026-07-13 · 更新于 2026-07-24 · 3 min · 545 words

Tonnetz-Driven Graph Wedgelet for Harmonic Complexity Reduction in Music Scores

2026-07-13 · 更新于 2026-07-24 · 2 min · 298 words

Wan-Dancer: A Hierarchical Framework for Minute-scale Coherent Music-to-Dance Generation

2026-07-13 · 更新于 2026-07-24 · 3 min · 457 words

语音/音乐/音频论文速递 2026-07-13

2026-07-13 · 更新于 2026-07-24 · 12 min · 2359 words

HeadRoom: Lightweight, Edge-deployable Pipeline for Adaptive Notification Routing

2026-07-11 · 更新于 2026-07-24 · 2 min · 377 words

语音/音乐/音频论文速递 2026-07-11

2026-07-11 · 更新于 2026-07-24 · 1 min · 209 words

A Quantized Native Runtime for On-Device Semantic Audio Generation

2026-07-10 · 更新于 2026-07-24 · 5 min · 914 words

A Quantized Native Runtime for On-Device Semantic Audio Generation

2026-07-10 · 更新于 2026-07-24 · 5 min · 906 words

A Reliability Assessment of LALM Audio Judges for Full-Duplex Voice Agents

2026-07-10 · 更新于 2026-07-24 · 2 min · 420 words

A Reliability Assessment of LALM Audio Judges for Full-Duplex Voice Agents

2026-07-10 · 更新于 2026-07-24 · 3 min · 559 words

A Self-Supervised Approach for Minimal-Annotation Hydroacoustic Data Exploration

2026-07-10 · 更新于 2026-07-24 · 3 min · 592 words

A Self-Supervised Approach for Minimal-Annotation Hydroacoustic Data Exploration

2026-07-10 · 更新于 2026-07-24 · 2 min · 381 words

Best-of-N TTS Evaluation is Confounded by ASR Family Alignment

2026-07-10 · 更新于 2026-07-24 · 3 min · 618 words

Best-of-N TTS Evaluation is Confounded by ASR Family Alignment

2026-07-10 · 更新于 2026-07-24 · 3 min · 579 words

COALA: Robust Contextualized Speech-augmented Language Modeling for ASR via Contrastive Regularizer and Biasing Score Estimation

2026-07-10 · 更新于 2026-07-24 · 2 min · 360 words

COALA: Robust Contextualized Speech-augmented Language Modeling for ASR via Contrastive Regularizer and Biasing Score Estimation

2026-07-10 · 更新于 2026-07-24 · 3 min · 469 words

Diarization-Guided Qwen-ASR Adaptation for Multilingual Two-Speaker Conversational Speech

2026-07-10 · 更新于 2026-07-24 · 2 min · 421 words

Diarization-Guided Qwen-ASR Adaptation for Multilingual Two-Speaker Conversational Speech

2026-07-10 · 更新于 2026-07-24 · 3 min · 577 words

Inverse-designed meta processing units for multi-task near-field photonic computing

2026-07-10 · 更新于 2026-07-24 · 2 min · 324 words

Inverse-designed meta processing units for multi-task near-field photonic computing

2026-07-10 · 更新于 2026-07-24 · 2 min · 317 words

It Takes Few to TANGO: A Quantized Distributed Model for Binaural Speech Enhancement

2026-07-10 · 更新于 2026-07-24 · 7 min · 1468 words

It Takes Few to TANGO: A Quantized Distributed Model for Binaural Speech Enhancement

2026-07-10 · 更新于 2026-07-24 · 4 min · 718 words

Multimodal Digital Biomarker for Asthma: Complementary Roles of Vocal, Clinical and Demographic Factors

2026-07-10 · 更新于 2026-07-24 · 4 min · 844 words

Multimodal Digital Biomarker for Asthma: Complementary Roles of Vocal, Clinical and Demographic Factors

2026-07-10 · 更新于 2026-07-24 · 4 min · 695 words

Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks

2026-07-10 · 更新于 2026-07-24 · 5 min · 954 words

Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks

2026-07-10 · 更新于 2026-07-24 · 3 min · 605 words

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

2026-07-10 · 更新于 2026-07-24 · 2 min · 313 words

MulTTiPop: A Multitrack Transcription Dataset for Pop Music

2026-07-10 · 更新于 2026-07-24 · 2 min · 301 words

MuScriptor: An Open Model for Multi-Instrument Music Transcription

2026-07-10 · 更新于 2026-07-24 · 3 min · 551 words

MuScriptor: An Open Model for Multi-Instrument Music Transcription

2026-07-10 · 更新于 2026-07-24 · 3 min · 498 words

On the Role of Conversational Timing in Synthetic Training Data for ASR

2026-07-10 · 更新于 2026-07-24 · 3 min · 441 words

On the Role of Conversational Timing in Synthetic Training Data for ASR

2026-07-10 · 更新于 2026-07-24 · 3 min · 527 words

PS4: Proxy-Supervised Joint Training for Real Target Speaker Extraction

2026-07-10 · 更新于 2026-07-24 · 2 min · 375 words

PS4: Proxy-Supervised Joint Training for Real Target Speaker Extraction

2026-07-10 · 更新于 2026-07-24 · 2 min · 314 words

SHAP-Weighted Cross-Modal Expert Fusion for Emotion and Sentiment Recognition: Evidence and Limits

2026-07-10 · 更新于 2026-07-24 · 3 min · 498 words

SHAP-Weighted Cross-Modal Expert Fusion for Emotion and Sentiment Recognition: Evidence and Limits

2026-07-10 · 更新于 2026-07-24 · 4 min · 731 words

Structural Bottlenecks on Frequency Representation in End-to-End Audio Models

2026-07-10 · 更新于 2026-07-24 · 3 min · 504 words

Structural Bottlenecks on Frequency Representation in End-to-End Audio Models

2026-07-10 · 更新于 2026-07-24 · 3 min · 607 words

Vidu S1: A Real-Time Interactive Video Generation Model

2026-07-10 · 更新于 2026-07-24 · 1 min · 203 words

Vidu S1: A Real-Time Interactive Video Generation Model

2026-07-10 · 更新于 2026-07-24 · 3 min · 578 words

When Synthetic Speech Is All You Have: Better Call GRPO

2026-07-10 · 更新于 2026-07-24 · 5 min · 1061 words

When Synthetic Speech Is All You Have: Better Call GRPO

2026-07-10 · 更新于 2026-07-24 · 7 min · 1292 words

Why Do You Say It Like That? A Phoneme-Level Framework for Explainable Speech Deepfake Detection

2026-07-10 · 更新于 2026-07-24 · 2 min · 354 words

Why Do You Say It Like That? A Phoneme-Level Framework for Explainable Speech Deepfake Detection

2026-07-10 · 更新于 2026-07-24 · 3 min · 427 words

语音/音乐/音频论文速递 2026-07-10

2026-07-10 · 更新于 2026-07-24 · 17 min · 3559 words

Audio Sentiment Analysis via Distillation and Cross-Modal Integration of Generated Multilingual Transcripts

2026-07-09 · 更新于 2026-07-24 · 3 min · 439 words

Compress the Cache, Not the Speech Embedding: KV Compression for Efficient Speech LLMs

2026-07-09 · 更新于 2026-07-24 · 3 min · 441 words

Decoupling Conversational Dynamics in Full-Duplex Spoken Models through Reinforcement Learning

2026-07-09 · 更新于 2026-07-24 · 4 min · 834 words

EscFOA: Enhancing Spatial Learning for Visually Impaired Learners via Generative Spatial Audio in 360-Degree Educational Environments

2026-07-09 · 更新于 2026-07-24 · 2 min · 229 words

Extending Xenakis: From Architectural Geometry to Sonification of the Philips Pavilion

2026-07-09 · 更新于 2026-07-24 · 1 min · 189 words

Gradient-Based Speech-to-Text Alignment for Any ASR Model: From CTC to Speech LLMs

2026-07-09 · 更新于 2026-07-24 · 4 min · 710 words

MADB: A Large-Scale Music Aesthetics Dataset with Professional and Multi-Dimensional Annotations

2026-07-09 · 更新于 2026-07-24 · 6 min · 1075 words

MMGenre: Benchmarking Singing Voice Synthesis across Multiple Musical Genres

2026-07-09 · 更新于 2026-07-24 · 4 min · 699 words

Multimodal Voice Activity Projection for Turn-Taking in Social Robots with Voice-Activity-Related Pretrained Encoders

2026-07-09 · 更新于 2026-07-24 · 6 min · 1259 words

Rag Classification of Tagore Songs using Symbolic Music Notation and Novel Weighted Distance Measures

2026-07-09 · 更新于 2026-07-24 · 3 min · 524 words

Text-Independent Speaker Verification Using Discrete Audio Tokens

2026-07-09 · 更新于 2026-07-24 · 2 min · 415 words

Transformer-based segmentation of prosodic boundaries in Brazilian Portuguese

2026-07-09 · 更新于 2026-07-24 · 2 min · 410 words

UBG-Net: An Uncertainty-aware Bayesian Gating Network for Robust Audio-Visual Speech Recognition

2026-07-09 · 更新于 2026-07-24 · 3 min · 596 words

语音/音乐/音频论文速递 2026-07-09

2026-07-09 · 更新于 2026-07-24 · 13 min · 2708 words

BlueMagpie-TTS: A Token-Efficient Tokenizer, Language Model, and TTS for Taiwanese-Accent Code-Switching Speech

2026-07-08 · 更新于 2026-07-24 · 4 min · 798 words

Designing Maintainable Hybrid Generative Systems: A Quantum-Inspired Approach to Automated Music Harmony Generation

2026-07-08 · 更新于 2026-07-24 · 3 min · 474 words

Determinantal point process sampling for bioacoustic active learning

2026-07-08 · 更新于 2026-07-24 · 2 min · 372 words

Distributed Multichannel Wiener Filtering for Topology-Unconstrained Wireless Acoustic Sensor Networks

2026-07-08 · 更新于 2026-07-24 · 2 min · 414 words

Escaping the Procrustean Bed: Groupwise Orthogonal Connectors for Audio-Language Models

2026-07-08 · 更新于 2026-07-24 · 2 min · 255 words

Few-Shot Class-Incremental Audio Classification Using Pseudo-Incrementally Trained Embedding Learner and Continually Updated Stochastic Classifier

2026-07-08 · 更新于 2026-07-24 · 7 min · 1399 words

Flow Matching-Based Speech Source Separation with Best-of-N Biometric Sampling

2026-07-08 · 更新于 2026-07-24 · 3 min · 458 words

ForestIR: Physics-Informed Forest Sound Simulation for Array-Based Bioacoustic Remote Sensing

2026-07-08 · 更新于 2026-07-24 · 2 min · 422 words

Fréchet Distance Loss on Speech Representations for Text-to-Speech Synthesis

2026-07-08 · 更新于 2026-07-24 · 2 min · 273 words

From Sinhala to Dhivehi: Cross-Lingual Transfer Learning for Low-Resource Speech Recognition

2026-07-08 · 更新于 2026-07-24 · 3 min · 479 words

From Textural Counterpoint to Feature Encoding: A Multi-Dimensional Machine Representation Study of Haydn’s “The Lark” Integrating Electroacoustic Analysis

2026-07-08 · 更新于 2026-07-24 · 2 min · 362 words

Gemma 4 Technical Report

2026-07-08 · 更新于 2026-07-24 · 7 min · 1286 words

Goodbye Equal Error Rate, Hello Local Information Disclosure: Evaluating Voice Anonymisation against 1-to-N Linkage Threats

2026-07-08 · 更新于 2026-07-24 · 3 min · 459 words

Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs

2026-07-08 · 更新于 2026-07-24 · 4 min · 731 words

InsideSSL: Understanding Self-Supervised Speech Representations using a Model-Centric Perspective

2026-07-08 · 更新于 2026-07-24 · 4 min · 731 words

Learning-based Physics-Constrained Neural Kernel for Sound Field Estimation With Source-Position-Dependent Directional Weighting

2026-07-08 · 更新于 2026-07-24 · 2 min · 221 words

Multimodal Video-to-Music Recommendation via Semantic Retrieval and Temporal Reranking

2026-07-08 · 更新于 2026-07-24 · 4 min · 644 words

Music I Care About: Automated Multimodal Benchmarking of LLM Music Perception Skills on (Almost) Any Music

2026-07-08 · 更新于 2026-07-24 · 2 min · 346 words

NAVER LABS System Re-implementation for the IWSLT 2026 Instruction-Following Task

2026-07-08 · 更新于 2026-07-24 · 4 min · 679 words

Precise Video-to-Audio Generation with Cross-Modal Alignment in Latent Space

2026-07-08 · 更新于 2026-07-24 · 3 min · 512 words

Propose and Attend: Training-free MLLM Grounding Confidence via Multi-Token Localized Attention

2026-07-08 · 更新于 2026-07-24 · 4 min · 677 words

Revisiting the Relation Between Language Model Perplexity and ASR Word Error Rate for Modern End-to-End Speech Recognition

2026-07-08 · 更新于 2026-07-24 · 3 min · 497 words

TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios

2026-07-08 · 更新于 2026-07-24 · 2 min · 277 words

Umm… With Transformers? Insights from Filled Pause Use across Four Slavic Parliaments

2026-07-08 · 更新于 2026-07-24 · 2 min · 273 words

Uncovering Latent Depression Severity for Binary Depression Detection via Advantage-weighting Ranking

2026-07-08 · 更新于 2026-07-24 · 5 min · 884 words

WordVoice: Explicit and Decoupled Multi-Dimensional Word-Level Control for LLM-Based TTS

2026-07-08 · 更新于 2026-07-24 · 2 min · 394 words

语音/音乐/音频论文速递 2026-07-08

2026-07-08 · 更新于 2026-07-24 · 20 min · 4152 words

Adaptive Diversity-Uncertainty Active Learning with Redundancy Control for Bioacoustic Event Classification

2026-07-07 · 更新于 2026-07-24 · 2 min · 361 words

Adaptive Loss Balancing for Multi-Task Bioacoustic Classification of Bird Species and Call Types

2026-07-07 · 更新于 2026-07-24 · 3 min · 616 words

An Intervention-Based Framework for Shortcut Diagnosis in Spoofing Countermeasures

2026-07-07 · 更新于 2026-07-24 · 4 min · 655 words

ASD: Multi-Level Consistency-Driven Representation Learning

2026-07-07 · 更新于 2026-07-24 · 2 min · 426 words

Auto-AEG: Scalable Data Construction for Open-Vocabulary Audio Event Grounding

2026-07-07 · 更新于 2026-07-24 · 3 min · 501 words

CARD: Cross-component Audio Representation Distillation for Encoder-Free Audio Captioning

2026-07-07 · 更新于 2026-07-24 · 3 min · 469 words

CaReCoS: A Spectrogram based Visual Benchmark for Cardiac, Respiratory and Cough Sounds

2026-07-07 · 更新于 2026-07-24 · 2 min · 374 words

CHILDES-Aligned: A Curated Children's Speech Dataset via Multi-Model Timestamp Ensembling

2026-07-07 · 更新于 2026-07-24 · 3 min · 599 words

Context-Aware ASR for Mandarin Technical Lectures

2026-07-07 · 更新于 2026-07-24 · 2 min · 339 words

DELTA-TTS: Adapting Autoregressive Model into Diffusion Language Model for Text-to-Speech

2026-07-07 · 更新于 2026-07-24 · 2 min · 379 words

Deriving Benchmarking Datasets from Long-Form Recordings: Challenges and Opportunities

2026-07-07 · 更新于 2026-07-24 · 2 min · 318 words

DETECT-3B-Omni is Agnostic of Content and Demographics

2026-07-07 · 更新于 2026-07-24 · 3 min · 493 words

Doppelganger: Sound Effects and Their Synthetic Twins

2026-07-07 · 更新于 2026-07-24 · 3 min · 433 words

DuplexChat: Constructing Speaker-Separated Full-Duplex Dialogue Speech at Scale for Spoken Dialogue Language Modeling

2026-07-07 · 更新于 2026-07-24 · 2 min · 291 words

Evaluating the Effect of Linguistic Relatedness on Cross-Lingual Transfer in Large Multilingual Automatic Speech Recognition

2026-07-07 · 更新于 2026-07-24 · 2 min · 353 words

Information-Geometric Superposed Vowel Evaluation: Part 1. Moraic Syllabary (Japanese)

2026-07-07 · 更新于 2026-07-24 · 2 min · 256 words

Jointly Improving Dialect Identification and ASR in Indian Languages using Multimodal Feature Fusion

2026-07-07 · 更新于 2026-07-24 · 2 min · 424 words

Layer-wise Cross-Lingual Depression Detection from Speech: Analysis with Contrastive Alignment

2026-07-07 · 更新于 2026-07-24 · 2 min · 424 words

Lights, Camera, Carbon: Architectural Scaling Laws for Video Generation Energy Consumption

2026-07-07 · 更新于 2026-07-24 · 3 min · 493 words

Listen, Think, Transcribe: Continuous Latent Test-Time Scaling for ASR

2026-07-07 · 更新于 2026-07-24 · 4 min · 756 words

Metronome: Bound the Cache, Keep the Beat for Real-Time Interaction Model Serving

2026-07-07 · 更新于 2026-07-24 · 2 min · 326 words

Mixture-Constrained Max Pooling Improves Separation-Based Bird Species Classification

2026-07-07 · 更新于 2026-07-24 · 3 min · 429 words

MOSAIC: Interpretable Multi-Token Cross-Attention of Biophonetic and Self-Supervised Representations for Unified Voice Anti-Spoofing

2026-07-07 · 更新于 2026-07-24 · 2 min · 415 words

Noisy Environment Adaptation of Neural Speech Codec via Focal Mask and Noise Feature Separation

2026-07-07 · 更新于 2026-07-24 · 3 min · 515 words

NouveauVoice: Generating Novel Pseudo Speakers for Voice Anonymization

2026-07-07 · 更新于 2026-07-24 · 1 min · 126 words

OmniFocus: Query-Guided Modality-Balanced Token Compression for Omni-Modal Large Language Models

2026-07-07 · 更新于 2026-07-24 · 4 min · 718 words

Open-Set Source Tracing as Compositional Factors via Structured Prototypes

2026-07-07 · 更新于 2026-07-24 · 5 min · 933 words

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

2026-07-07 · 更新于 2026-07-24 · 1 min · 201 words

Physics-Informed Direction-of-Arrival Estimation Over Distributed Edge Devices

2026-07-07 · 更新于 2026-07-24 · 2 min · 375 words

Physiological Noise Augmentation Improves Non-Invasive Brain-to-Speech

2026-07-07 · 更新于 2026-07-24 · 2 min · 414 words

Probing Low-Level Acoustic Attribute Encoding in CLAP Audio Embeddings

2026-07-07 · 更新于 2026-07-24 · 3 min · 515 words

Progressive Refinement: An Iterative Pseudo-Labeling Approach for Mandarin-English Code-Switching ASR

2026-07-07 · 更新于 2026-07-24 · 2 min · 385 words

ProPS: Prompted Profile Synthesis for Natural Language-Conditioned Speaker Embedding Distributions

2026-07-07 · 更新于 2026-07-24 · 2 min · 342 words

Quantum-Inspired Harmonic Decision Models: A Computational Framework for Music Generation

2026-07-07 · 更新于 2026-07-24 · 2 min · 406 words

QuaSR: Quality-Aware Sample Reweighting for Pacific Indigenous Speech Recognition

2026-07-07 · 更新于 2026-07-24 · 5 min · 864 words

RABBiT: Rapidly adaptive BOLD foundation model via brain-tuning for accurate zero-shot and few-shot prediction of speech-elicited responses in the brain

2026-07-07 · 更新于 2026-07-24 · 2 min · 345 words

Ranking the Impact of Contextual Specialization in Neural Speech Enhancement

2026-07-07 · 更新于 2026-07-24 · 3 min · 471 words

REDDIT: Correcting Model-Generated Timestamp Drift in ASR without Forgetting via Replay-Based Distribution Editing

2026-07-07 · 更新于 2026-07-24 · 2 min · 319 words

Reinforcement Learning for Data-Efficient Code-Switched ASR

2026-07-07 · 更新于 2026-07-24 · 3 min · 634 words

S-DiverSe: Spanish Diverse Speech

2026-07-07 · 更新于 2026-07-24 · 3 min · 470 words

Sampling Bias Compensation for Robust Evaluation of Audio Classification Systems with Partially Labeled Evaluation Datasets

2026-07-07 · 更新于 2026-07-24 · 3 min · 465 words

Speaker-Aware Temporal Aggregation Strategies on Segment Representations for Depression Detection in Dyadic Interaction: A Benchmark Study

2026-07-07 · 更新于 2026-07-24 · 3 min · 562 words

Speaker-Disentangled Chunk-Wise Regression for Syllabic Tokenization

2026-07-07 · 更新于 2026-07-24 · 4 min · 673 words

SPEARBench: A Benchmark for Naturalness Evaluation in Streaming Speech-to-Speech Language Models

2026-07-07 · 更新于 2026-07-24 · 4 min · 822 words

Streaming Neural Speech Codecs through Time-Invariant Representations

2026-07-07 · 更新于 2026-07-24 · 5 min · 926 words

SynSFX: Multi-Model Sound Effects Synthesis Dataset for Deepfake Detection and Evaluation

2026-07-07 · 更新于 2026-07-24 · 3 min · 457 words

Taste-aware music retrieval from audio embeddings

2026-07-07 · 更新于 2026-07-24 · 4 min · 770 words

TokAN: Accent Normalization Using Self-Supervised Speech Tokens

2026-07-07 · 更新于 2026-07-24 · 2 min · 350 words

Towards Digital Preservation of Efik: TTS for a Low-Resource African Language

2026-07-07 · 更新于 2026-07-24 · 2 min · 290 words

Towards Language-Agnostic Speech Inversion

2026-07-07 · 更新于 2026-07-24 · 2 min · 377 words

Towards Robust Uncertainty-Aware Speaker Modeling

2026-07-07 · 更新于 2026-07-24 · 4 min · 687 words

TRACE-EVC: Text-Guided Relative Affective Control for Zero-Shot Emotional Voice Conversion

2026-07-07 · 更新于 2026-07-24 · 3 min · 567 words

Training-Free Model Selection and Domain-Aware Score Calibration for First-Shot Anomalous Sound Detection

2026-07-07 · 更新于 2026-07-24 · 3 min · 563 words

Trajectory Variance: AnUnsupervised Measure of Developmental Vocal Plasticity in Birdsong

2026-07-07 · 更新于 2026-07-24 · 2 min · 286 words

Unified Audio Intelligence Without Regressing on Text Intelligence

2026-07-07 · 更新于 2026-07-24 · 2 min · 323 words

UniSkip-Mamba: A Frequency-Aware State Space Model for Audio-Visual Temporal Forgery Localization

2026-07-07 · 更新于 2026-07-24 · 3 min · 560 words

Wan-Streamer v0.2: Higher Resolution, Same Latency

2026-07-07 · 更新于 2026-07-24 · 2 min · 381 words

Weakly Guided and Autoregressive Beamformer Parameterization for Generalizable Moving Speaker Extraction in Higher-Order Ambisonics

2026-07-07 · 更新于 2026-07-24 · 2 min · 310 words

语音/音乐/音频论文速递 2026-07-07

2026-07-07 · 更新于 2026-07-24 · 47 min · 9986 words

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

2026-07-06 · 更新于 2026-07-24 · 3 min · 447 words

语音/音乐/音频论文速递 2026-07-06

2026-07-06 · 更新于 2026-07-24 · 1 min · 157 words

ICML 2026 语音/音频论文详细分析

2026-07-04 · 更新于 2026-07-24 · 108 min · 22980 words

-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

2026-07-04 · 更新于 2026-07-24 · 3 min · 576 words

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

2026-07-04 · 更新于 2026-07-24 · 4 min · 735 words

Abstraction Induces the Brain Alignment of Language and Speech Models

2026-07-04 · 更新于 2026-07-24 · 4 min · 658 words

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

2026-07-04 · 更新于 2026-07-24 · 3 min · 557 words

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning

2026-07-04 · 更新于 2026-07-24 · 3 min · 602 words

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

2026-07-04 · 更新于 2026-07-24 · 3 min · 447 words

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

2026-07-04 · 更新于 2026-07-24 · 4 min · 642 words

Alethia: a Foundational Encoder for Voice Deepfakes

2026-07-04 · 更新于 2026-07-24 · 2 min · 322 words

An Exterior Method for Nonnegative Matrix Factorization

2026-07-04 · 更新于 2026-07-24 · 2 min · 403 words

Ariadne’s Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses

2026-07-04 · 更新于 2026-07-24 · 4 min · 698 words

Attend to Anything: Foundation Model for Unified Human Attention Modeling

2026-07-04 · 更新于 2026-07-24 · 2 min · 414 words

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

2026-07-04 · 更新于 2026-07-24 · 3 min · 550 words

AudioMosaic: Contrastive Masked Audio Representation Learning

2026-07-04 · 更新于 2026-07-24 · 4 min · 794 words

AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

2026-07-04 · 更新于 2026-07-24 · 2 min · 342 words

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

2026-07-04 · 更新于 2026-07-24 · 4 min · 644 words

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

2026-07-04 · 更新于 2026-07-24 · 2 min · 354 words

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

2026-07-04 · 更新于 2026-07-24 · 2 min · 328 words

BAT: Better Audio Transformer Guided by Convex Gated Probing

2026-07-04 · 更新于 2026-07-24 · 8 min · 1519 words

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

2026-07-04 · 更新于 2026-07-24 · 4 min · 700 words

BFCL Audio: An Audio Function Calling Evaluation for Large Language Models

2026-07-04 · 更新于 2026-07-24 · 3 min · 623 words

Bioacoustic Geolocation: Species Sounds as Geographic Signals

2026-07-04 · 更新于 2026-07-24 · 4 min · 721 words

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

2026-07-04 · 更新于 2026-07-24 · 4 min · 641 words

Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

2026-07-04 · 更新于 2026-07-24 · 3 min · 590 words

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

2026-07-04 · 更新于 2026-07-24 · 2 min · 373 words

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

2026-07-04 · 更新于 2026-07-24 · 2 min · 365 words

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

2026-07-04 · 更新于 2026-07-24 · 6 min · 1131 words

ConsMSA: Semantic Distribution Consistency Learning for Multimodal Sentiment Analysis

2026-07-04 · 更新于 2026-07-24 · 2 min · 266 words

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

2026-07-04 · 更新于 2026-07-24 · 4 min · 820 words

Decoupling The “What” and “Where” With Polar Coordinate Positional Embedding

2026-07-04 · 更新于 2026-07-24 · 3 min · 624 words

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

2026-07-04 · 更新于 2026-07-24 · 2 min · 411 words

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

2026-07-04 · 更新于 2026-07-24 · 4 min · 832 words

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

2026-07-04 · 更新于 2026-07-24 · 3 min · 603 words

Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

2026-07-04 · 更新于 2026-07-24 · 2 min · 336 words

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

2026-07-04 · 更新于 2026-07-24 · 3 min · 587 words

EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

2026-07-04 · 更新于 2026-07-24 · 5 min · 883 words

Efficient Distributed MLLM Training with Cornstarch

2026-07-04 · 更新于 2026-07-24 · 3 min · 560 words

Efficient Multi-modal Dataset Distillation via Analytic Parameter Matching

2026-07-04 · 更新于 2026-07-24 · 4 min · 671 words

Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

2026-07-04 · 更新于 2026-07-24 · 3 min · 554 words

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

2026-07-04 · 更新于 2026-07-24 · 5 min · 959 words

FakeWorld 1.0: An Omni-modal Benchmark for Fake Media and Content

2026-07-04 · 更新于 2026-07-24 · 2 min · 230 words

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

2026-07-04 · 更新于 2026-07-24 · 3 min · 544 words

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

2026-07-04 · 更新于 2026-07-24 · 4 min · 746 words

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

2026-07-04 · 更新于 2026-07-24 · 2 min · 404 words

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2026-07-04 · 更新于 2026-07-24 · 3 min · 632 words

Group Cognition Learning: Making Everything Better Through Controlled Two-Stage Agents Collaboration

2026-07-04 · 更新于 2026-07-24 · 3 min · 626 words

Hearing Without Noticing? Attention-Aware Stealthy Black-Box Adversarial Audio Attacks

2026-07-04 · 更新于 2026-07-24 · 2 min · 367 words

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

2026-07-04 · 更新于 2026-07-24 · 6 min · 1087 words

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

2026-07-04 · 更新于 2026-07-24 · 2 min · 336 words

INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments

2026-07-04 · 更新于 2026-07-24 · 1 min · 161 words

IVQ: Structured and Lightweight Vector Quantization via Binary Hierarchical Composition Inspired by

2026-07-04 · 更新于 2026-07-24 · 4 min · 823 words

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

2026-07-04 · 更新于 2026-07-24 · 4 min · 742 words

Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

2026-07-04 · 更新于 2026-07-24 · 3 min · 444 words

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

2026-07-04 · 更新于 2026-07-24 · 4 min · 753 words

Language Model Augmented Semi-Supervised Statistical Inference

2026-07-04 · 更新于 2026-07-24 · 2 min · 328 words

Learning Tight Rejection Boundaries without Negatives for Strict One-Class Audio Deepfake Detection

2026-07-04 · 更新于 2026-07-24 · 3 min · 594 words

LightAVSeg: Lightweight Audio-Visual Segmentation

2026-07-04 · 更新于 2026-07-24 · 3 min · 624 words

Listening Through the Noise: Cauchy-Driven Diffusion Bridges for Robust Gastrointestinal Auscultation and Clinical Benchmarking

2026-07-04 · 更新于 2026-07-24 · 5 min · 1053 words

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

2026-07-04 · 更新于 2026-07-24 · 2 min · 346 words

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

2026-07-04 · 更新于 2026-07-24 · 2 min · 306 words

MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization

2026-07-04 · 更新于 2026-07-24 · 3 min · 463 words

MetaPerch: Learning from metadata for bioacoustics foundation models

2026-07-04 · 更新于 2026-07-24 · 2 min · 398 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-07-04 · 更新于 2026-07-24 · 3 min · 538 words

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

2026-07-04 · 更新于 2026-07-24 · 3 min · 439 words

Multimodal Fact-Level Attribution for Verifiable Reasoning

2026-07-04 · 更新于 2026-07-24 · 3 min · 558 words

Multimodal Fusion via Self-Consistent Task-Gradient Fields

2026-07-04 · 更新于 2026-07-24 · 3 min · 562 words

Multimodal Latent Language Modeling with Next-Token Diffusion

2026-07-04 · 更新于 2026-07-24 · 2 min · 384 words

Multimodal Meta-Verifier with Explicit Structured Recalibration

2026-07-04 · 更新于 2026-07-24 · 3 min · 441 words

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

2026-07-04 · 更新于 2026-07-24 · 4 min · 682 words

MusicDET: Zero-Shot AI-Generated Music Detection

2026-07-04 · 更新于 2026-07-24 · 4 min · 655 words

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

2026-07-04 · 更新于 2026-07-24 · 2 min · 377 words

Native Active Perception as Reasoning for Omni-Modal Understanding

2026-07-04 · 更新于 2026-07-24 · 2 min · 367 words

Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation

2026-07-04 · 更新于 2026-07-24 · 4 min · 662 words

NeuroCLUS: A Foundation Model with Functional Clustering for Intracranial Neural Decoding

2026-07-04 · 更新于 2026-07-24 · 4 min · 788 words

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

2026-07-04 · 更新于 2026-07-24 · 4 min · 720 words

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

2026-07-04 · 更新于 2026-07-24 · 3 min · 469 words

OmniFit: Bridging Modalities via Layer-Adaptive Token Compression for Omnimodal Large Language Models

2026-07-04 · 更新于 2026-07-24 · 3 min · 466 words

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

2026-07-04 · 更新于 2026-07-24 · 5 min · 1042 words

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

2026-07-04 · 更新于 2026-07-24 · 7 min · 1345 words

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

2026-07-04 · 更新于 2026-07-24 · 3 min · 439 words

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

2026-07-04 · 更新于 2026-07-24 · 4 min · 783 words

PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation

2026-07-04 · 更新于 2026-07-24 · 3 min · 451 words

PCRNet: Phase-aware Complex Refinement Network for EEG-based Auditory Attention Decoding

2026-07-04 · 更新于 2026-07-24 · 2 min · 385 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-07-04 · 更新于 2026-07-24 · 5 min · 886 words

PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

2026-07-04 · 更新于 2026-07-24 · 3 min · 512 words

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

2026-07-04 · 更新于 2026-07-24 · 4 min · 836 words

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

2026-07-04 · 更新于 2026-07-24 · 1 min · 161 words

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

2026-07-04 · 更新于 2026-07-24 · 3 min · 475 words

PRIM：Cooperative Dynamic Token Compression for Efficient Large Multimodal Models

2026-07-04 · 更新于 2026-07-24 · 2 min · 293 words

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

2026-07-04 · 更新于 2026-07-24 · 2 min · 344 words

Probing Cross-modal Information Hubs in Audio-Visual LLMs

2026-07-04 · 更新于 2026-07-24 · 1 min · 202 words

Quaternion Self-Attention with Shared Scores

2026-07-04 · 更新于 2026-07-24 · 5 min · 882 words

Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration

2026-07-04 · 更新于 2026-07-24 · 5 min · 856 words

Real-World Unsupervised Models Generalize to Predict Brain Responses to Out-of-Distribution Stimuli

2026-07-04 · 更新于 2026-07-24 · 2 min · 423 words

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

2026-07-04 · 更新于 2026-07-24 · 2 min · 409 words

ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models

2026-07-04 · 更新于 2026-07-24 · 3 min · 429 words

REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

2026-07-04 · 更新于 2026-07-24 · 4 min · 683 words

Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity

2026-07-04 · 更新于 2026-07-24 · 3 min · 471 words

Robust Signal Enhancement via Fractional Detail Views and Knowledge Guided Multi-view Fusion

2026-07-04 · 更新于 2026-07-24 · 3 min · 446 words

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

2026-07-04 · 更新于 2026-07-24 · 4 min · 771 words

SAM Audio: Segment Anything in Audio

2026-07-04 · 更新于 2026-07-24 · 4 min · 651 words

SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering

2026-07-04 · 更新于 2026-07-24 · 3 min · 483 words

Scaling Behavior in Model Fine-tuning for Audio DeepFake Detection

2026-07-04 · 更新于 2026-07-24 · 2 min · 279 words

Scaling Transformers for End-to-End Discrete Audio Tokenization

2026-07-04 · 更新于 2026-07-24 · 3 min · 638 words

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

2026-07-04 · 更新于 2026-07-24 · 2 min · 303 words

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

2026-07-04 · 更新于 2026-07-24 · 3 min · 596 words

Simultaneous Speech-to-Speech Translation Without Aligned Data

2026-07-04 · 更新于 2026-07-24 · 2 min · 326 words

SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection

2026-07-04 · 更新于 2026-07-24 · 2 min · 356 words

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

2026-07-04 · 更新于 2026-07-24 · 3 min · 556 words

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

2026-07-04 · 更新于 2026-07-24 · 3 min · 440 words

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

2026-07-04 · 更新于 2026-07-24 · 2 min · 323 words

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

2026-07-04 · 更新于 2026-07-24 · 5 min · 1028 words

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard

2026-07-04 · 更新于 2026-07-24 · 4 min · 738 words

Spherical Procrustes Alignment for Reliable Medical Audio Diagnosis

2026-07-04 · 更新于 2026-07-24 · 5 min · 901 words

Stable Spectral Copula Alignment for Robust Multimodal Learning

2026-07-04 · 更新于 2026-07-24 · 2 min · 356 words

STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation

2026-07-04 · 更新于 2026-07-24 · 7 min · 1335 words

STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

2026-07-04 · 更新于 2026-07-24 · 4 min · 761 words

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

2026-07-04 · 更新于 2026-07-24 · 3 min · 570 words

SURF: Separation via Unsupervised Remixing Flow

2026-07-04 · 更新于 2026-07-24 · 5 min · 1043 words

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

2026-07-04 · 更新于 2026-07-24 · 4 min · 663 words

TextME: Bridging Unseen Modalities Through Text Descriptions

2026-07-04 · 更新于 2026-07-24 · 3 min · 615 words

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

2026-07-04 · 更新于 2026-07-24 · 4 min · 782 words

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

2026-07-04 · 更新于 2026-07-24 · 3 min · 503 words

TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation

2026-07-04 · 更新于 2026-07-24 · 3 min · 636 words

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

2026-07-04 · 更新于 2026-07-24 · 2 min · 399 words

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

2026-07-04 · 更新于 2026-07-24 · 2 min · 389 words

Two-dimensional quantization for geometry-aware audio coding

2026-07-04 · 更新于 2026-07-24 · 4 min · 740 words

UltraLIF: Fully Differentiable Spiking Neural Networks via Ultradiscretization and Max-Plus Algebra

2026-07-04 · 更新于 2026-07-24 · 3 min · 544 words

UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation

2026-07-04 · 更新于 2026-07-24 · 2 min · 411 words

Universal Algorithm-Implicit Learning

2026-07-04 · 更新于 2026-07-24 · 4 min · 749 words

Unlocking Cross-Modal Biosignal Synthesis: A Temporally-Aware VAE-Diffusion Model

2026-07-04 · 更新于 2026-07-24 · 4 min · 700 words

Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

2026-07-04 · 更新于 2026-07-24 · 3 min · 547 words

V-LynX: Token Interface Alignment for Video+X LLMs

2026-07-04 · 更新于 2026-07-24 · 2 min · 419 words

VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion

2026-07-04 · 更新于 2026-07-24 · 4 min · 839 words

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

2026-07-04 · 更新于 2026-07-24 · 3 min · 523 words

VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

2026-07-04 · 更新于 2026-07-24 · 4 min · 657 words

WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention

2026-07-04 · 更新于 2026-07-24 · 3 min · 494 words

Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language

2026-07-04 · 更新于 2026-07-24 · 3 min · 436 words

A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

2026-07-03 · 更新于 2026-07-24 · 2 min · 320 words

A Multi-Branch Hierarchy-Aware Framework for Heterogeneous Audio Classification

2026-07-03 · 更新于 2026-07-24 · 3 min · 461 words

An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation

2026-07-03 · 更新于 2026-07-24 · 3 min · 626 words

Audio-Based Understanding of Audiobook Narration Appeal

2026-07-03 · 更新于 2026-07-24 · 2 min · 281 words

Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR

2026-07-03 · 更新于 2026-07-24 · 3 min · 441 words

CNN Models for Microphone Array Covariance Matrix Upsampling and Acoustic Imaging

2026-07-03 · 更新于 2026-07-24 · 2 min · 276 words

Cross Domain Few-Shot Class-Incremental Audio Classification Via Adversarial Contrastive Learning

2026-07-03 · 更新于 2026-07-24 · 2 min · 332 words

Decomposer: Learning to Decompile Symbolic Music to Programs

2026-07-03 · 更新于 2026-07-24 · 2 min · 323 words

DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning

2026-07-03 · 更新于 2026-07-24 · 5 min · 857 words

Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource Settings

2026-07-03 · 更新于 2026-07-24 · 6 min · 1175 words

Few-Shot Open-Set Audio Classification Using Attention Information-Fused Prototypes

2026-07-03 · 更新于 2026-07-24 · 2 min · 298 words

From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages

2026-07-03 · 更新于 2026-07-24 · 3 min · 599 words

H-SAGE: Holistic Speaker-Aware Guided Experts for MoE-based Multi-Talker ASR

2026-07-03 · 更新于 2026-07-24 · 2 min · 374 words

LMPAN: A Lightweight Multi-Path Alignment Network for Joint Full-Duplex Acoustic Echo Cancellation and Noise Suppression

2026-07-03 · 更新于 2026-07-24 · 4 min · 807 words

NAVER LABS Europe Submission to the Instruction-following 2026 Short Track

2026-07-03 · 更新于 2026-07-24 · 3 min · 464 words

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

2026-07-03 · 更新于 2026-07-24 · 3 min · 529 words

Pmeta-TLA: Backdoor Attacks for Speech Classification Models via Meta-Learning with Timbre Leakage Attack

2026-07-03 · 更新于 2026-07-24 · 3 min · 499 words

Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score

2026-07-03 · 更新于 2026-07-24 · 3 min · 480 words

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

2026-07-03 · 更新于 2026-07-24 · 3 min · 598 words

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

2026-07-03 · 更新于 2026-07-24 · 3 min · 530 words

RT-Tango: Real-Time Distributed Binaural Speech Enhancement for Low-Power Hearing Aid Devices

2026-07-03 · 更新于 2026-07-24 · 2 min · 417 words

SelectTSL: Prompt-Guided Selective Target Sound Localization in Complex Scenarios

2026-07-03 · 更新于 2026-07-24 · 5 min · 936 words

Self-Supervised Test-Time Tuning for Packet Loss Concealment

2026-07-03 · 更新于 2026-07-24 · 4 min · 686 words

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

2026-07-03 · 更新于 2026-07-24 · 2 min · 247 words

Spatial Speech Perception Systems: A Survey of Sound Source Localization, Directional Enhancement, and Speech Recognition

2026-07-03 · 更新于 2026-07-24 · 4 min · 737 words

Speaker head orientation estimation with a single microphone array using phase spectrogram features

2026-07-03 · 更新于 2026-07-24 · 2 min · 286 words

Towards a Phonology-Informed Evaluation of Multilingual TTS

2026-07-03 · 更新于 2026-07-24 · 2 min · 324 words

TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue

2026-07-03 · 更新于 2026-07-24 · 2 min · 284 words

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

2026-07-03 · 更新于 2026-07-24 · 2 min · 377 words

Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

2026-07-03 · 更新于 2026-07-24 · 2 min · 273 words

UT-AISTimprt submission for ICME 2026 Grand Challenge on Academic Text-to-Music Generation

2026-07-03 · 更新于 2026-07-24 · 2 min · 304 words

语音/音乐/音频论文速递 2026-07-03

2026-07-03 · 更新于 2026-07-24 · 25 min · 5320 words

A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models

2026-07-02 · 更新于 2026-07-24 · 3 min · 596 words

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

2026-07-02 · 更新于 2026-07-24 · 2 min · 257 words

Adaptive Perturbation Selection for Contrastive Audio Decoding

2026-07-02 · 更新于 2026-07-24 · 2 min · 367 words

AmbiDrop: Ambisonics-Based Array-Agnostic Neural Speech Enhancement

2026-07-02 · 更新于 2026-07-24 · 5 min · 928 words

Automatic Detection of Stress from Speech in the Trier Social Stress Test

2026-07-02 · 更新于 2026-07-24 · 4 min · 695 words

AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

2026-07-02 · 更新于 2026-07-24 · 4 min · 645 words

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

2026-07-02 · 更新于 2026-07-24 · 3 min · 536 words

Do Multimodal Large Language Models Need Reasoning to Classify Dementia from Speech?

2026-07-02 · 更新于 2026-07-24 · 2 min · 386 words

Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis

2026-07-02 · 更新于 2026-07-24 · 2 min · 357 words

Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition

2026-07-02 · 更新于 2026-07-24 · 3 min · 490 words

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

2026-07-02 · 更新于 2026-07-24 · 1 min · 194 words

MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization

2026-07-02 · 更新于 2026-07-24 · 4 min · 710 words

NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs

2026-07-02 · 更新于 2026-07-24 · 3 min · 435 words

ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

2026-07-02 · 更新于 2026-07-24 · 3 min · 468 words

Positive-Incentive Noise Predictor for Adversarial Purification in Speaker Verification

2026-07-02 · 更新于 2026-07-24 · 3 min · 509 words

Speech Playground: An Interactive Tool for Speech Analysis and Comparison

2026-07-02 · 更新于 2026-07-24 · 2 min · 252 words

语音/音乐/音频论文速递 2026-07-02

2026-07-02 · 更新于 2026-07-24 · 13 min · 2691 words

A Fair and Transparent Framework for Speech-Based Depression Detection: Balancing Interpretability and Performance

2026-07-01 · 更新于 2026-07-24 · 3 min · 537 words

A First Exploration of Neuromorphic OT-CFM for Multi-Speaker VSR

2026-07-01 · 更新于 2026-07-24 · 4 min · 685 words

Adapting Foundation ASR Models to Dysarthric Speech: A Case Study

2026-07-01 · 更新于 2026-07-24 · 1 min · 209 words

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models

2026-07-01 · 更新于 2026-07-24 · 2 min · 405 words

Amplifying Membership Signal Through Chained Regeneration

2026-07-01 · 更新于 2026-07-24 · 4 min · 659 words

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

2026-07-01 · 更新于 2026-07-24 · 3 min · 456 words

Attacking UTMOS: Probing the Robustness of a Speech Quality Assessment Model

2026-07-01 · 更新于 2026-07-24 · 2 min · 342 words

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

2026-07-01 · 更新于 2026-07-24 · 2 min · 380 words

BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations

2026-07-01 · 更新于 2026-07-24 · 2 min · 258 words

Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models

2026-07-01 · 更新于 2026-07-24 · 2 min · 243 words

Beyond Cross-Reconstruction: Probing-Based Disentanglement Evaluation for Acoustic Teleportation Codecs

2026-07-01 · 更新于 2026-07-24 · 2 min · 293 words

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

2026-07-01 · 更新于 2026-07-24 · 2 min · 344 words

Building an ASR Solution for Training and Assessing Children's Reading

2026-07-01 · 更新于 2026-07-24 · 2 min · 243 words

Detecting Audio Deepfakes on the Edge:Lightweight SSL-Based Detection in a Browser Plugin

2026-07-01 · 更新于 2026-07-24 · 3 min · 503 words

Dilemmadata: On the Interoperability of Heterogeneous Roman Numeral Datasets

2026-07-01 · 更新于 2026-07-24 · 1 min · 154 words

Enhancing BEST-RQ Pseudo-Label Quality through Online Refinement for Automatic Speech Recognition

2026-07-01 · 更新于 2026-07-24 · 2 min · 368 words

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

2026-07-01 · 更新于 2026-07-24 · 2 min · 324 words

Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimer's Disease Detection

2026-07-01 · 更新于 2026-07-24 · 2 min · 374 words

How Bilingual Are SSL Speech Models? Cross-Lingual Probing of Articulatory Encoding with Finnish and Russian EMA

2026-07-01 · 更新于 2026-07-24 · 2 min · 305 words

Improving multichannel speech enhancement through accurate room-acoustic simulations

2026-07-01 · 更新于 2026-07-24 · 2 min · 320 words

Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation

2026-07-01 · 更新于 2026-07-24 · 2 min · 327 words

Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

2026-07-01 · 更新于 2026-07-24 · 2 min · 410 words

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

2026-07-01 · 更新于 2026-07-24 · 2 min · 402 words

LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment

2026-07-01 · 更新于 2026-07-24 · 3 min · 573 words

LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish

2026-07-01 · 更新于 2026-07-24 · 2 min · 376 words

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

2026-07-01 · 更新于 2026-07-24 · 2 min · 342 words

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

2026-07-01 · 更新于 2026-07-24 · 1 min · 202 words

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

2026-07-01 · 更新于 2026-07-24 · 3 min · 594 words

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

2026-07-01 · 更新于 2026-07-24 · 2 min · 228 words

SwiftAudio: Data-Efficient Caption-Only Distillation for One-Step Text-to-Audio Diffusion-based Generation

2026-07-01 · 更新于 2026-07-24 · 4 min · 644 words

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

2026-07-01 · 更新于 2026-07-24 · 2 min · 420 words

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

2026-07-01 · 更新于 2026-07-24 · 3 min · 598 words

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

2026-07-01 · 更新于 2026-07-24 · 1 min · 143 words

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

2026-07-01 · 更新于 2026-07-24 · 4 min · 682 words

ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

2026-07-01 · 更新于 2026-07-24 · 3 min · 470 words

语音/音乐/音频论文速递 2026-07-01

2026-07-01 · 更新于 2026-07-24 · 20 min · 4207 words

June ⁸⁰⁵

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

2026-06-30 · 更新于 2026-07-24 · 2 min · 302 words

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

2026-06-30 · 更新于 2026-07-24 · 2 min · 315 words

Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models

2026-06-30 · 更新于 2026-07-24 · 2 min · 384 words

Clustering Unsupervised Representations as Defense against Poisoning Attacks on Speech Commands Classification System

2026-06-30 · 更新于 2026-07-24 · 2 min · 345 words

Comparing Human and Automatic Recognition of Dutch Dysarthric Continuous Speech: A Case Study

2026-06-30 · 更新于 2026-07-24 · 2 min · 294 words

CTC-Seeded Token Edit Refinement for Non-Autoregressive Speech Recognition

2026-06-30 · 更新于 2026-07-24 · 3 min · 479 words

DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information

2026-06-30 · 更新于 2026-07-24 · 7 min · 1334 words

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

2026-06-30 · 更新于 2026-07-24 · 7 min · 1345 words

EchoHawk: A Reproducible Acoustic Pipeline for Drone Detection, Classification, and Direction-Finding, with a Cautionary Study of Session-Level Data Leakage

2026-06-30 · 更新于 2026-07-24 · 1 min · 127 words

Effective Depth in Joint Source-Channel Coding: An Implicit Equilibrium Analysis

2026-06-30 · 更新于 2026-07-24 · 2 min · 221 words

Evaluation of Head-Related Transfer Functions Across Five Levels of Individualisation in Virtual Reality

2026-06-30 · 更新于 2026-07-24 · 2 min · 298 words

FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

2026-06-30 · 更新于 2026-07-24 · 3 min · 524 words

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

2026-06-30 · 更新于 2026-07-24 · 4 min · 723 words

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

2026-06-30 · 更新于 2026-07-24 · 2 min · 294 words

Improving Large-Scale Weakly Supervised ASR by Filtering and Selection

2026-06-30 · 更新于 2026-07-24 · 3 min · 482 words

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

2026-06-30 · 更新于 2026-07-24 · 1 min · 100 words

LoRA-Tuned Large Language Models for Dementia Detection via Multi-View Speech-Derived Features

2026-06-30 · 更新于 2026-07-24 · 2 min · 251 words

MeloDISinger: Melody-Aware & Duration-Preserving Singing Voice Editing with Audio Infilling

2026-06-30 · 更新于 2026-07-24 · 3 min · 493 words

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

2026-06-30 · 更新于 2026-07-24 · 5 min · 996 words

Position-Aware Target Speaker Extraction for Long-Form Multi-Party Conversations: A Diarization-Free Framework for ASR

2026-06-30 · 更新于 2026-07-24 · 2 min · 252 words

Predicting Timbre Traits for Interpretable Assessment of Musical Sound Synthesizers

2026-06-30 · 更新于 2026-07-24 · 2 min · 254 words

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

2026-06-30 · 更新于 2026-07-24 · 2 min · 351 words

Proteus: Automated Adversarial Robustness Testing for Audio Deepfake Detectors

2026-06-30 · 更新于 2026-07-24 · 2 min · 315 words

Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

2026-06-30 · 更新于 2026-07-24 · 2 min · 402 words

Semi-Supervised Sound Event Detection with Conditional Mixup and Embedding-Level Contrastive Loss

2026-06-30 · 更新于 2026-07-24 · 2 min · 348 words

SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset

2026-06-30 · 更新于 2026-07-24 · 2 min · 363 words

SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

2026-06-30 · 更新于 2026-07-24 · 3 min · 438 words

SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation

2026-06-30 · 更新于 2026-07-24 · 3 min · 501 words

TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation

2026-06-30 · 更新于 2026-07-24 · 3 min · 448 words

TRACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech

2026-06-30 · 更新于 2026-07-24 · 2 min · 348 words

Two kinds of robustness are not the same: disentangling fault tolerance and low-SNR robustness in multi-domain event detection on real data

2026-06-30 · 更新于 2026-07-24 · 2 min · 420 words

Underwater Source Detection and Classification for Signal-based Surveillance: Audio Dataset Curation and Cross-Domain Evaluation

2026-06-30 · 更新于 2026-07-24 · 2 min · 306 words

VeRe-Flow: Guiding Flow Matching toward Clean Speech via Velocity Contrastive Regularization and Representation Alignment for Noise-Robust Bandwidth Expansion

2026-06-30 · 更新于 2026-07-24 · 2 min · 408 words

VIB-AVSR: Variational Information Bottleneck for Noise-Robust LLM-Based Audio-Visual Speech Recognition

2026-06-30 · 更新于 2026-07-24 · 2 min · 387 words

wav2VOT: Automatic estimation of voice onset time, closure duration, and burst realisation with wav2vec2

2026-06-30 · 更新于 2026-07-24 · 2 min · 239 words

语音/音乐/音频论文速递 2026-06-30

2026-06-30 · 更新于 2026-07-24 · 22 min · 4475 words

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset

2026-06-29 · 更新于 2026-07-24 · 2 min · 312 words

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

2026-06-29 · 更新于 2026-07-24 · 3 min · 495 words

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

2026-06-29 · 更新于 2026-07-24 · 2 min · 276 words

DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions

2026-06-29 · 更新于 2026-07-24 · 2 min · 219 words

Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

2026-06-29 · 更新于 2026-07-24 · 2 min · 374 words

Do Speech Emphasis Models Generalize across Languages and Emotions?

2026-06-29 · 更新于 2026-07-24 · 2 min · 246 words

From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

2026-06-29 · 更新于 2026-07-24 · 2 min · 228 words

From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection

2026-06-29 · 更新于 2026-07-24 · 2 min · 366 words

Grammar-Guided Hierarchical Parsing for Long-form Audio Activity Recognition

2026-06-29 · 更新于 2026-07-24 · 2 min · 403 words

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

2026-06-29 · 更新于 2026-07-24 · 4 min · 781 words

HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models

2026-06-29 · 更新于 2026-07-24 · 2 min · 381 words

Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition

2026-06-29 · 更新于 2026-07-24 · 2 min · 406 words

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

2026-06-29 · 更新于 2026-07-24 · 3 min · 470 words

Room for Error: Large-Scale Simulation of Over-the-Air Acoustic Attacks

2026-06-29 · 更新于 2026-07-24 · 2 min · 282 words

Screening Matters: A Comparative Study of Conventional and Crowdsourced Listening Tests

2026-06-29 · 更新于 2026-07-24 · 4 min · 689 words

What Was That Again? Certified Robustness for Automatic Speech Recognition

2026-06-29 · 更新于 2026-07-24 · 5 min · 898 words

语音/音乐/音频论文速递 2026-06-29

2026-06-29 · 更新于 2026-07-24 · 9 min · 1914 words

A Large-Scale Database and Predictive Model of Listener-Rated Ease of Speech Understanding in Commercial Hearing Aids

2026-06-26 · 更新于 2026-07-24 · 2 min · 266 words

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

2026-06-26 · 更新于 2026-07-24 · 2 min · 351 words

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

2026-06-26 · 更新于 2026-07-24 · 3 min · 477 words

DNSMOS-C: Improving End-to-end Speech Quality Models via Contrastive Learning

2026-06-26 · 更新于 2026-07-24 · 2 min · 406 words

Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding

2026-06-26 · 更新于 2026-07-24 · 3 min · 457 words

FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

2026-06-26 · 更新于 2026-07-24 · 2 min · 335 words

Generative AI and Copyright Infringement: A Legal-Technical Analysis of AI Music Generation Systems Under 17 U.S.C. Title 17

2026-06-26 · 更新于 2026-07-24 · 1 min · 211 words

Listening Like a Judge: A Music-Aware Framework for Automatic Singing Performance Evaluation

2026-06-26 · 更新于 2026-07-24 · 4 min · 716 words

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

2026-06-26 · 更新于 2026-07-24 · 3 min · 551 words

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

2026-06-26 · 更新于 2026-07-24 · 2 min · 422 words

Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone

2026-06-26 · 更新于 2026-07-24 · 1 min · 41 words

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

2026-06-26 · 更新于 2026-07-24 · 2 min · 240 words

SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages

2026-06-26 · 更新于 2026-07-24 · 2 min · 362 words

Soroll-IA: A Weakly Labeled Audio Dataset for Real-World Industrial Port Monitoring

2026-06-26 · 更新于 2026-07-24 · 2 min · 330 words

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

2026-06-26 · 更新于 2026-07-24 · 2 min · 405 words

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

2026-06-26 · 更新于 2026-07-24 · 3 min · 508 words

VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

2026-06-26 · 更新于 2026-07-24 · 2 min · 368 words

voxmap-studio: An open-source speaker diarization annotation tool with built-in cost instrumentation

2026-06-26 · 更新于 2026-07-24 · 2 min · 296 words

wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

2026-06-26 · 更新于 2026-07-24 · 3 min · 548 words

What We are Missing in Multimodal LLM Evaluation?

2026-06-26 · 更新于 2026-07-24 · 1 min · 41 words

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

2026-06-26 · 更新于 2026-07-24 · 4 min · 814 words

WQ-Fusion: Dynamic Gated Attention for Cross-Domain Audio Representation

2026-06-26 · 更新于 2026-07-24 · 2 min · 383 words

语音/音乐/音频论文速递 2026-06-26

2026-06-26 · 更新于 2026-07-24 · 12 min · 2421 words

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

2026-06-25 · 更新于 2026-07-24 · 3 min · 637 words

Attractive and Repulsive Pattern Control in Sequence Generation

2026-06-25 · 更新于 2026-07-24 · 2 min · 399 words

BCoughBench: Benchmarking Respiratory Acoustic Foundation Models Under Body-Coupled Wearable Sensor Conditions

2026-06-25 · 更新于 2026-07-24 · 2 min · 377 words

CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations

2026-06-25 · 更新于 2026-07-24 · 2 min · 344 words

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

2026-06-25 · 更新于 2026-07-24 · 2 min · 400 words

EmotionAI: A Privacy-Preserving Computational Intelligence Pipeline for Speech-Emotion-Grounded Conversational Analysis

2026-06-25 · 更新于 2026-07-24 · 2 min · 351 words

End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users

2026-06-25 · 更新于 2026-07-24 · 2 min · 364 words

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

2026-06-25 · 更新于 2026-07-24 · 2 min · 221 words

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models

2026-06-25 · 更新于 2026-07-24 · 2 min · 368 words

FoleySet: A Multi-Level Human-Annotated Foley Sound Dataset

2026-06-25 · 更新于 2026-07-24 · 2 min · 341 words

Frequency-Aware Self-Supervised Music Representation Learning

2026-06-25 · 更新于 2026-07-24 · 3 min · 556 words

From Sounds to Scenes: A Benchmark for Evaluating Context-Aware Auditory Scene Understanding in Large Audio Language Models

2026-06-25 · 更新于 2026-07-24 · 3 min · 572 words

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

2026-06-25 · 更新于 2026-07-24 · 2 min · 405 words

Graph-Based Phonetic Error Correction of Noisy ASR

2026-06-25 · 更新于 2026-07-24 · 2 min · 339 words

Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS

2026-06-25 · 更新于 2026-07-24 · 3 min · 458 words

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

2026-06-25 · 更新于 2026-07-24 · 3 min · 509 words

One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

2026-06-25 · 更新于 2026-07-24 · 3 min · 558 words

Phoneme-Level Mispronunciation Screening in Polish-Speaking Children with an Explainable Assistant

2026-06-25 · 更新于 2026-07-24 · 4 min · 790 words

Real-Time Voice AI Hears but Does Not Listen

2026-06-25 · 更新于 2026-07-24 · 2 min · 241 words

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

2026-06-25 · 更新于 2026-07-24 · 1 min · 122 words

SE-AGCNet: An End-to-End Framework for Joint Speech Enhancement and Loudness Control in Meeting Scenarios

2026-06-25 · 更新于 2026-07-24 · 3 min · 616 words

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

2026-06-25 · 更新于 2026-07-24 · 2 min · 307 words

STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity

2026-06-25 · 更新于 2026-07-24 · 3 min · 567 words

Supervised Post-training of Speech Foundation Models for Robust Adaptation in Speech Deepfake Detection

2026-06-25 · 更新于 2026-07-24 · 3 min · 567 words

Velocity Prediction in Automatic Guitar Transcription

2026-06-25 · 更新于 2026-07-24 · 4 min · 735 words

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

2026-06-25 · 更新于 2026-07-24 · 1 min · 94 words

What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

2026-06-25 · 更新于 2026-07-24 · 1 min · 205 words

语音/音乐/音频论文速递 2026-06-25

2026-06-25 · 更新于 2026-07-24 · 16 min · 3249 words

A Fusion-Aware Two-Stage Framework for Mispronunciation Detection and Diagnosis in Low-Resource Modern Standard Arabic

2026-06-24 · 更新于 2026-07-24 · 2 min · 222 words

A Methodology for Characterizing Underwater Radiated Noise from Submerged Electric Vehicles in a Coastal Environment: An AUV Test Case

2026-06-24 · 更新于 2026-07-24 · 3 min · 440 words

A Multi-Stage Separation-and-Classification Framework Guided by Complementary Acoustic-to-Semantic Clues

2026-06-24 · 更新于 2026-07-24 · 2 min · 339 words

A Variational-Flow Analysis of StoRM under Noise-Power Mismatch

2026-06-24 · 更新于 2026-07-24 · 2 min · 344 words

Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

2026-06-24 · 更新于 2026-07-24 · 4 min · 715 words

Audio–Image Alignment as a Continued-Pretraining Stage Improves Low-Resource ASR

2026-06-24 · 更新于 2026-07-24 · 3 min · 524 words

Audio-visual Contrastive Alignment for Diffusion-based Visual-conditioned Speech Enhancement

2026-06-24 · 更新于 2026-07-24 · 2 min · 335 words

Autoencoder based optimized SSL representations: Complexity Minimization and improved Dysarthric ASR

2026-06-24 · 更新于 2026-07-24 · 2 min · 408 words

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

2026-06-24 · 更新于 2026-07-24 · 3 min · 495 words

BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset

2026-06-24 · 更新于 2026-07-24 · 2 min · 228 words

Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement

2026-06-24 · 更新于 2026-07-24 · 3 min · 493 words

Breaking Shortcut Learning for Cross-Trial EEG-Guided Target Speech Extraction via Two-Stage Training

2026-06-24 · 更新于 2026-07-24 · 3 min · 438 words

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

2026-06-24 · 更新于 2026-07-24 · 2 min · 399 words

Comparative Reasoning: Making an Audio Language Model Better at Comparing Emotions

2026-06-24 · 更新于 2026-07-24 · 2 min · 361 words

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

2026-06-24 · 更新于 2026-07-24 · 3 min · 560 words

Digital Revival: Acoustic Documentation and Digital Reactivation of Historical Woodwind Instruments

2026-06-24 · 更新于 2026-07-24 · 1 min · 163 words

DTT-BSR+: A Generative-Regression Cascade for Music Source Restoration

2026-06-24 · 更新于 2026-07-24 · 2 min · 379 words

Evaluation of Headrest-Integrated Loudspeakers for Enhanced Spatial Audio Immersion in Automotive Cabins

2026-06-24 · 更新于 2026-07-24 · 2 min · 369 words

Heterogeneous 2D/1D Signal Representation Fusion for Underwater Acoustic Modulation Recognition Under Distribution Shift

2026-06-24 · 更新于 2026-07-24 · 2 min · 314 words

It's Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces

2026-06-24 · 更新于 2026-07-24 · 1 min · 159 words

Joint Learning of Covariance Estimation and White Noise Gain for Robust MVDR Beamforming

2026-06-24 · 更新于 2026-07-24 · 2 min · 214 words

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

2026-06-24 · 更新于 2026-07-24 · 2 min · 269 words

Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

2026-06-24 · 更新于 2026-07-24 · 2 min · 260 words

Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks

2026-06-24 · 更新于 2026-07-24 · 2 min · 277 words

NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction

2026-06-24 · 更新于 2026-07-24 · 3 min · 534 words

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

2026-06-24 · 更新于 2026-07-24 · 3 min · 428 words

Perceptual Evaluation of Higher-Order Ambisonic Codecs on Both Synthetic Mixing and Native Recordings

2026-06-24 · 更新于 2026-07-24 · 2 min · 281 words

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

2026-06-24 · 更新于 2026-07-24 · 3 min · 492 words

Progressive Alignment Objectives for Aligner-Encoder based ASR

2026-06-24 · 更新于 2026-07-24 · 1 min · 118 words

Real-Time Interactive Music Generation via Data-Free Streaming Consistency Distillation

2026-06-24 · 更新于 2026-07-24 · 3 min · 438 words

Selective Capability Unlearning in End-to-End Spoken Language Understanding

2026-06-24 · 更新于 2026-07-24 · 1 min · 127 words

Sonus Health: Calibrated Heart-Murmur Detection from Smartphone-Based Veterinary Auscultation

2026-06-24 · 更新于 2026-07-24 · 2 min · 226 words

SphereVBx: Spherical Variational Bayes Clustering for Simplified EEND-VC Diarization

2026-06-24 · 更新于 2026-07-24 · 3 min · 501 words

Statistical validation and full-sphere extension of a Bayesian model for human static sound localisation

2026-06-24 · 更新于 2026-07-24 · 2 min · 258 words

Suppressing spectral edge effects in Schroeder Harmonic Complex

2026-06-24 · 更新于 2026-07-24 · 1 min · 211 words

The effect of micro-changes in the pluck trajectory on the sound of an acoustic guitar

2026-06-24 · 更新于 2026-07-24 · 2 min · 343 words

video-SALMONN-R: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

2026-06-24 · 更新于 2026-07-24 · 3 min · 575 words

VieSpeaker: A Large-Scale Vietnamese Speaker Recognition Dataset Beyond Visual Dependency

2026-06-24 · 更新于 2026-07-24 · 2 min · 278 words

ZONOS2 Technical Report

2026-06-24 · 更新于 2026-07-24 · 7 min · 1346 words

语音/音乐/音频论文速递 2026-06-24

2026-06-24 · 更新于 2026-07-24 · 21 min · 4472 words

A DDSP Framework for Adaptive Room Equalization

2026-06-23 · 更新于 2026-07-24 · 3 min · 607 words

A Generalized Formalism of Auto-Regressive Decoding for Speech Processing

2026-06-23 · 更新于 2026-07-24 · 2 min · 262 words

Acoustic Landmark Detector based on Conformer and HuBERT

2026-06-23 · 更新于 2026-07-24 · 3 min · 616 words

Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR

2026-06-23 · 更新于 2026-07-24 · 2 min · 345 words

An Acoustic Landmark Database of the English Lexicon via Articulatory Synthesis

2026-06-23 · 更新于 2026-07-24 · 2 min · 360 words

An Analysis of Untrained Deep Reservoir Networks for Audio Surveillance

2026-06-23 · 更新于 2026-07-24 · 2 min · 336 words

An Evaluation Framework for Text-to-Speech Voice Reconstruction

2026-06-23 · 更新于 2026-07-24 · 2 min · 284 words

An implicitization-based solution to the minimal 4s/6r ToA problem using Cayley–Menger determinants

2026-06-23 · 更新于 2026-07-24 · 2 min · 272 words

AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

2026-06-23 · 更新于 2026-07-24 · 2 min · 270 words

ATCCaps: A Call-Sign-Aware Speech Dataset for Air Traffic Control Recognition

2026-06-23 · 更新于 2026-07-24 · 2 min · 340 words

Audio Editing in the Era of Foundation Models: A Survey

2026-06-23 · 更新于 2026-07-24 · 1 min · 201 words

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

2026-06-23 · 更新于 2026-07-24 · 3 min · 436 words

AugCodec: A Low-Bitrate Disentangled Neural Speech Codec via Data Augmentation

2026-06-23 · 更新于 2026-07-24 · 2 min · 358 words

Backdoor Attacks on Speech Emotion Recognition via TTS-Generated Poisoning

2026-06-23 · 更新于 2026-07-24 · 2 min · 282 words

Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

2026-06-23 · 更新于 2026-07-24 · 3 min · 450 words

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

2026-06-23 · 更新于 2026-07-24 · 3 min · 467 words

Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

2026-06-23 · 更新于 2026-07-24 · 3 min · 481 words

Beyond ROC-AUC: Operating-Point Performance Reporting for Biometric Verification

2026-06-23 · 更新于 2026-07-24 · 2 min · 311 words

Bridging Self-Supervised Learning and Speech Enhancement: A Wav2Vec2-Conditioned Framework

2026-06-23 · 更新于 2026-07-24 · 2 min · 422 words

Bridging the Age Gap: Towards Detecting Neural Audio Codec Synthesized Elderly Speech Deepfake

2026-06-23 · 更新于 2026-07-24 · 2 min · 421 words

CAAD: Contrastive Audio-Aware Distillation for Efficient Speech Language Models

2026-06-23 · 更新于 2026-07-24 · 2 min · 356 words

CapRiCorn-1K: A Comprehensive Benchmark for Video Captioning and Subject Referential Consistency Across Temporal Scales

2026-06-23 · 更新于 2026-07-24 · 4 min · 793 words

Catching Lies Without Sending the Video: Privacy-Preserving Multimodal Deception Detection

2026-06-23 · 更新于 2026-07-24 · 2 min · 263 words

Compiling Differentiable Audio Graphs to Real-Time DSP

2026-06-23 · 更新于 2026-07-24 · 1 min · 185 words

CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

2026-06-23 · 更新于 2026-07-24 · 3 min · 487 words

CoughPhase-CLR: Designing an acoustics-informed foundation model for coughing sound classification

2026-06-23 · 更新于 2026-07-24 · 2 min · 407 words

Cross-lingual Retrieval-Augmented Classification for Dysarthria Severity Assessment

2026-06-23 · 更新于 2026-07-24 · 4 min · 672 words

Direct Raw Audio Signal Processing via Reservoir Computing: An Investigation into 'Feature-Free' Architectures

2026-06-23 · 更新于 2026-07-24 · 1 min · 206 words

DisSpeech: Low-Resource Controllable Mandarin Stuttered Speech Synthesis for ASR Augmentation

2026-06-23 · 更新于 2026-07-24 · 2 min · 373 words

Domain-incremental audio classification using domain-specific experts and prototype classifier

2026-06-23 · 更新于 2026-07-24 · 2 min · 276 words

Don't Listen to Me: A Lightweight, Low-Latency Model for Own-Voice Cancellation in Far-Field Speech Enhancement

2026-06-23 · 更新于 2026-07-24 · 3 min · 437 words

DSSCNet: A Transfer Learning Framework for Cross-Corpus Dysarthric Speech Severity Classification

2026-06-23 · 更新于 2026-07-24 · 2 min · 298 words

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

2026-06-23 · 更新于 2026-07-24 · 4 min · 642 words

ESPnet3: Infrastructure for Scalable Speech and Audio Research in the Foundation Model Era

2026-06-23 · 更新于 2026-07-24 · 4 min · 698 words

Explainable AI in Speaker Recognition – Attention Map Visualisation and Evaluation

2026-06-23 · 更新于 2026-07-24 · 2 min · 302 words

Exploiting Neural Audio Codec Latents for Adversarial Audio Attacks

2026-06-23 · 更新于 2026-07-24 · 3 min · 435 words

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

2026-06-23 · 更新于 2026-07-24 · 7 min · 1476 words

From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection

2026-06-23 · 更新于 2026-07-24 · 2 min · 426 words

Gradient-Based Learning of Parametric Engine Sound Representations for Real-Time Resynthesis and Tuning on Embedded Systems

2026-06-23 · 更新于 2026-07-24 · 1 min · 158 words

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

2026-06-23 · 更新于 2026-07-24 · 2 min · 425 words

How Well Do Self-Supervised Speech Models Encode Age and Gender in Children's Speech? A Layer-Wise Analysis Across Multiple Architectures

2026-06-23 · 更新于 2026-07-24 · 2 min · 420 words

Imitation Learning for Elder-Facing Speech Synthesis

2026-06-23 · 更新于 2026-07-24 · 2 min · 417 words

Improving Engine Sound Analysis in Hot-Test Environments via a RAB-U-Net (Residual Attention Block U-Net) Noise Removal Method

2026-06-23 · 更新于 2026-07-24 · 2 min · 302 words

Improving Text-to-Music Generation with Human Preference Rewards

2026-06-23 · 更新于 2026-07-24 · 2 min · 399 words

InstructFX2FX: A Multi-turn Text-to-Preset Demo for Iterative Audio Effect Refinement

2026-06-23 · 更新于 2026-07-24 · 1 min · 187 words

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

2026-06-23 · 更新于 2026-07-24 · 1 min · 124 words

Interleaved Speech Language Models Latently Work In Text

2026-06-23 · 更新于 2026-07-24 · 2 min · 383 words

ISCSLP 2026 CoT-TTS Challenge: Chain-of-Thought Reasoning for Context-Aware Text-to-Speech

2026-06-23 · 更新于 2026-07-24 · 2 min · 239 words

Kiwano: A Cutting-Edge Open-Source Toolkit for Speaker Verification

2026-06-23 · 更新于 2026-07-24 · 3 min · 561 words

LambdaMark: Semantic Audio Watermarking for Robustness and Radioactivity

2026-06-23 · 更新于 2026-07-24 · 5 min · 922 words

Learning from Audio-Dependency Errors: Data Curation Strategies Based on Model Confusion Patterns in Audio Question Answering

2026-06-23 · 更新于 2026-07-24 · 3 min · 446 words

Learning to Evade: Adaptive Attacks on Audio Watermarking

2026-06-23 · 更新于 2026-07-24 · 7 min · 1403 words

Libretto: Giving LLM Agents a Sense of Musical Structure

2026-06-23 · 更新于 2026-07-24 · 4 min · 696 words

LISE : Listenable Interpretable Speaker Embeddings

2026-06-23 · 更新于 2026-07-24 · 3 min · 515 words

LK Jam: System Architecture and Implementation of a Real-Time Human-AI Interactive Music Generation System using Role-Aware GRU

2026-06-23 · 更新于 2026-07-24 · 2 min · 246 words

MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data

2026-06-23 · 更新于 2026-07-24 · 1 min · 163 words

MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios

2026-06-23 · 更新于 2026-07-24 · 2 min · 371 words

Noise-Driven Instrument Based on Coherent Quantum and Stochastic Oscillator Models

2026-06-23 · 更新于 2026-07-24 · 1 min · 38 words

On the Effect of Segmentation Width and Cluster Size on Speech Resynthesis and Continuation in Generative Spoken Language Models

2026-06-23 · 更新于 2026-07-24 · 3 min · 608 words

Online Predictive Coding for Dual-Mode Self-Supervised Speech Model

2026-06-23 · 更新于 2026-07-24 · 4 min · 678 words

OpenWER: Improving Cross-Lingual ASR Evaluation and Enabling Token-Based Accuracy Metrics

2026-06-23 · 更新于 2026-07-24 · 2 min · 263 words

PHAST-Net: Attention-Guided, Physics-Informed Network for Unified Estimation of Ideal Time-Frequency Representations

2026-06-23 · 更新于 2026-07-24 · 2 min · 316 words

Physics-Informed Neural Operator for Speech Production Analysis

2026-06-23 · 更新于 2026-07-24 · 2 min · 358 words

PIVOTSBench: Evaluating Fine-Grained Interpersonal Relationship Reasoning in Multimodal Large Language Models

2026-06-23 · 更新于 2026-07-24 · 3 min · 451 words

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

2026-06-23 · 更新于 2026-07-24 · 3 min · 578 words

Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

2026-06-23 · 更新于 2026-07-24 · 2 min · 352 words

SDP-Codec: A Speaker-Decoupled Speech Codec with Pitch Injection for Low-Bitrate Coding and Zero-Shot Voice Conversion

2026-06-23 · 更新于 2026-07-24 · 2 min · 290 words

Sea-Scan: High-Accuracy, ML-based Dark Vessel Detection and Localisation via Weakly Supervised DAS Monitoring

2026-06-23 · 更新于 2026-07-24 · 3 min · 573 words

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

2026-06-23 · 更新于 2026-07-24 · 2 min · 390 words

STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation

2026-06-23 · 更新于 2026-07-24 · 5 min · 1004 words

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

2026-06-23 · 更新于 2026-07-24 · 3 min · 514 words

Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS

2026-06-23 · 更新于 2026-07-24 · 2 min · 282 words

The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

2026-06-23 · 更新于 2026-07-24 · 3 min · 586 words

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

2026-06-23 · 更新于 2026-07-24 · 1 min · 199 words

Time-Frequency Weighted Losses for Phoneme Reconstruction in DNN-Based Speech Enhancement

2026-06-23 · 更新于 2026-07-24 · 2 min · 408 words

Toward Open-Set Speaker Attribute Prediction with Keyword-Appended LLM Embeddings

2026-06-23 · 更新于 2026-07-24 · 2 min · 388 words

Towards Detecting Neural Audio Codec Synthesized Heart Sounds

2026-06-23 · 更新于 2026-07-24 · 3 min · 523 words

Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

2026-06-23 · 更新于 2026-07-24 · 2 min · 357 words

Using Phonological-Level Wav2Vec2 for Mandarin Automatic Mispronunciation Detection and Diagnosis

2026-06-23 · 更新于 2026-07-24 · 3 min · 447 words

Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi

2026-06-23 · 更新于 2026-07-24 · 3 min · 427 words

What Do Neural Networks Learn for TDOA Estimation? A Cross-Architecture Probing Study

2026-06-23 · 更新于 2026-07-24 · 3 min · 445 words

When EER Hides Deployment Failure: Auditing Threshold Transfer and Unlabeled Score Calibration for Speech Deepfake Detectors

2026-06-23 · 更新于 2026-07-24 · 2 min · 363 words

Word Lengthening as a Function of Utterance Position: A Multi-Corpus Study

2026-06-23 · 更新于 2026-07-24 · 2 min · 264 words

语音/音乐/音频论文速递 2026-06-23

2026-06-23 · 更新于 2026-07-24 · 48 min · 10123 words

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

2026-06-22 · 更新于 2026-07-24 · 2 min · 387 words

语音/音乐/音频论文速递 2026-06-22

2026-06-22 · 更新于 2026-07-24 · 1 min · 118 words

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

2026-06-19 · 更新于 2026-07-24 · 2 min · 317 words

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

2026-06-19 · 更新于 2026-07-24 · 3 min · 517 words

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

2026-06-19 · 更新于 2026-07-24 · 2 min · 397 words

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

2026-06-19 · 更新于 2026-07-24 · 2 min · 306 words

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

2026-06-19 · 更新于 2026-07-24 · 3 min · 461 words

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

2026-06-19 · 更新于 2026-07-24 · 2 min · 380 words

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

2026-06-19 · 更新于 2026-07-24 · 2 min · 333 words

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

2026-06-19 · 更新于 2026-07-24 · 2 min · 423 words

FlowFake: Liquid Networks for Audio Deepfake Detection

2026-06-19 · 更新于 2026-07-24 · 2 min · 411 words

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

2026-06-19 · 更新于 2026-07-24 · 2 min · 391 words

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

2026-06-19 · 更新于 2026-07-24 · 4 min · 658 words

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

2026-06-19 · 更新于 2026-07-24 · 3 min · 441 words

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

2026-06-19 · 更新于 2026-07-24 · 2 min · 379 words

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

2026-06-19 · 更新于 2026-07-24 · 3 min · 528 words

Interpreting Content and Speaker Characteristics in Factorised Self-Supervised Subspaces

2026-06-19 · 更新于 2026-07-24 · 2 min · 282 words

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

2026-06-19 · 更新于 2026-07-24 · 3 min · 467 words

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

2026-06-19 · 更新于 2026-07-24 · 2 min · 316 words

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

2026-06-19 · 更新于 2026-07-24 · 1 min · 169 words

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

2026-06-19 · 更新于 2026-07-24 · 2 min · 368 words

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

2026-06-19 · 更新于 2026-07-24 · 2 min · 419 words

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

2026-06-19 · 更新于 2026-07-24 · 1 min · 137 words

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

2026-06-19 · 更新于 2026-07-24 · 2 min · 386 words

NEST: Narrative Event Structures in Time for Long Video Understanding

2026-06-19 · 更新于 2026-07-24 · 2 min · 340 words

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

2026-06-19 · 更新于 2026-07-24 · 1 min · 197 words

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

2026-06-19 · 更新于 2026-07-24 · 3 min · 571 words

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

2026-06-19 · 更新于 2026-07-24 · 3 min · 473 words

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

2026-06-19 · 更新于 2026-07-24 · 3 min · 558 words

PolSeT: Polish Semantics of Timbre Dataset

2026-06-19 · 更新于 2026-07-24 · 1 min · 120 words

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

2026-06-19 · 更新于 2026-07-24 · 3 min · 469 words

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

2026-06-19 · 更新于 2026-07-24 · 2 min · 367 words

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

2026-06-19 · 更新于 2026-07-24 · 2 min · 246 words

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

2026-06-19 · 更新于 2026-07-24 · 2 min · 381 words

RIVET: Robust Idempotent Voice Attribute Editing

2026-06-19 · 更新于 2026-07-24 · 2 min · 292 words

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

2026-06-19 · 更新于 2026-07-24 · 3 min · 492 words

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

2026-06-19 · 更新于 2026-07-24 · 2 min · 416 words

Stuttering Classification and Segmentation with Attention-Based Multiple Instance Learning

2026-06-19 · 更新于 2026-07-24 · 2 min · 387 words

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

2026-06-19 · 更新于 2026-07-24 · 2 min · 253 words

Time-Unconditional Generative Speech Enhancement via Autonomous Rectified Flow

2026-06-19 · 更新于 2026-07-24 · 3 min · 535 words

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

2026-06-19 · 更新于 2026-07-24 · 2 min · 363 words

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

2026-06-19 · 更新于 2026-07-24 · 2 min · 292 words

语音/音乐/音频论文速递 2026-06-19

2026-06-19 · 更新于 2026-07-24 · 23 min · 4844 words

A Survey of Methods for the Discretization of Phonograph Record Playback Filters

2026-06-18 · 更新于 2026-07-24 · 2 min · 376 words

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

2026-06-18 · 更新于 2026-07-24 · 2 min · 387 words

Audio-to-Audio via Diffusion Warm Initialization

2026-06-18 · 更新于 2026-07-24 · 2 min · 360 words

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

2026-06-18 · 更新于 2026-07-24 · 2 min · 378 words

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

2026-06-18 · 更新于 2026-07-24 · 2 min · 262 words

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

2026-06-18 · 更新于 2026-07-24 · 2 min · 270 words

Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models

2026-06-18 · 更新于 2026-07-24 · 4 min · 800 words

Continuous Audio Thinking for Large Audio Language Models

2026-06-18 · 更新于 2026-07-24 · 4 min · 798 words

Continuous-Speech Parkinson's Disease Detection Using Acoustic and Inharmonicity Features

2026-06-18 · 更新于 2026-07-24 · 3 min · 501 words

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

2026-06-18 · 更新于 2026-07-24 · 3 min · 574 words

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

2026-06-18 · 更新于 2026-07-24 · 2 min · 347 words

Evaluating Dynamic Range Compressor Models Using Control-Voltage Measurements: an Approach and Dataset

2026-06-18 · 更新于 2026-07-24 · 2 min · 234 words

Fair Cognitive Impairment Detection Through Unlearning

2026-06-18 · 更新于 2026-07-24 · 3 min · 600 words

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

2026-06-18 · 更新于 2026-07-24 · 4 min · 751 words

Generalised Transcoding Framework for Arbitrary Spatial Audio Capture and Playback Formats

2026-06-18 · 更新于 2026-07-24 · 2 min · 240 words

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

2026-06-18 · 更新于 2026-07-24 · 3 min · 446 words

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

2026-06-18 · 更新于 2026-07-24 · 2 min · 292 words

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

2026-06-18 · 更新于 2026-07-24 · 3 min · 450 words

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

2026-06-18 · 更新于 2026-07-24 · 3 min · 433 words

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

2026-06-18 · 更新于 2026-07-24 · 2 min · 375 words

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

2026-06-18 · 更新于 2026-07-24 · 3 min · 429 words

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

2026-06-18 · 更新于 2026-07-24 · 2 min · 222 words

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

2026-06-18 · 更新于 2026-07-24 · 4 min · 763 words

Native Active Perception as Reasoning for Omni-Modal Understanding

2026-06-18 · 更新于 2026-07-24 · 3 min · 428 words

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

2026-06-18 · 更新于 2026-07-24 · 2 min · 338 words

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

2026-06-18 · 更新于 2026-07-24 · 3 min · 562 words

Reference-Based Recursive Least-Squares Mitigation of Real Interference in Stereo Audio Recordings

2026-06-18 · 更新于 2026-07-24 · 2 min · 309 words

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

2026-06-18 · 更新于 2026-07-24 · 2 min · 412 words

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

2026-06-18 · 更新于 2026-07-24 · 2 min · 382 words

Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings

2026-06-18 · 更新于 2026-07-24 · 3 min · 439 words

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

2026-06-18 · 更新于 2026-07-24 · 1 min · 204 words

Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

2026-06-18 · 更新于 2026-07-24 · 3 min · 559 words

SingFox: A Multi-Lingual Singfake Detection Corpus

2026-06-18 · 更新于 2026-07-24 · 2 min · 297 words

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

2026-06-18 · 更新于 2026-07-24 · 5 min · 999 words

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

2026-06-18 · 更新于 2026-07-24 · 2 min · 335 words

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

2026-06-18 · 更新于 2026-07-24 · 2 min · 285 words

语音/音乐/音频论文速递 2026-06-18

2026-06-18 · 更新于 2026-07-24 · 21 min · 4449 words

A 399uW 114.3 dB DR Companding Readout ASIC for MEMS Microphones Employing a Multirate Time-Domain ADC

2026-06-17 · 更新于 2026-07-24 · 2 min · 294 words

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

2026-06-17 · 更新于 2026-07-24 · 3 min · 524 words

A Neuromorphic Trigger for Efficient Audio Event Detection

2026-06-17 · 更新于 2026-07-24 · 4 min · 698 words

AI-based Cognitive-linguistic Features for Dementia Assessment in Picture Description

2026-06-17 · 更新于 2026-07-24 · 2 min · 298 words

An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages

2026-06-17 · 更新于 2026-07-24 · 3 min · 587 words

Are you speaking my languages? On spoken language adherence in multimodal LLMs

2026-06-17 · 更新于 2026-07-24 · 2 min · 401 words

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

2026-06-17 · 更新于 2026-07-24 · 4 min · 837 words

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

2026-06-17 · 更新于 2026-07-24 · 2 min · 217 words

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

2026-06-17 · 更新于 2026-07-24 · 2 min · 411 words

Direction of arrival estimation from distant microphone data using single frequency filtering

2026-06-17 · 更新于 2026-07-24 · 3 min · 540 words

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

2026-06-17 · 更新于 2026-07-24 · 3 min · 638 words

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

2026-06-17 · 更新于 2026-07-24 · 3 min · 548 words

From Signals to Patterns: Non-Invasive Tuberculosis Detection from Cough Audio using Bandit Weighted Hyperbolic Prototypes

2026-06-17 · 更新于 2026-07-24 · 3 min · 637 words

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

2026-06-17 · 更新于 2026-07-24 · 3 min · 432 words

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

2026-06-17 · 更新于 2026-07-24 · 2 min · 335 words

Intelligibility of Speech in Noise: Investigating Contribution of Magnitude and Phase Spectra

2026-06-17 · 更新于 2026-07-24 · 2 min · 419 words

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

2026-06-17 · 更新于 2026-07-24 · 2 min · 249 words

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

2026-06-17 · 更新于 2026-07-24 · 3 min · 444 words

Learning task-specific subspaces via interventional post-training of speech foundation models

2026-06-17 · 更新于 2026-07-24 · 4 min · 708 words

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

2026-06-17 · 更新于 2026-07-24 · 5 min · 917 words

MVEB: Massive Video Embedding Benchmark

2026-06-17 · 更新于 2026-07-24 · 3 min · 449 words

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction

2026-06-17 · 更新于 2026-07-24 · 3 min · 585 words

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

2026-06-17 · 更新于 2026-07-24 · 5 min · 914 words

OlfactProfile: Profile-Conditioned Odor Prediction from Audiovisual Content

2026-06-17 · 更新于 2026-07-24 · 2 min · 345 words

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

2026-06-17 · 更新于 2026-07-24 · 3 min · 500 words

Perceptual compensation for tonal context in self-supervised speech models

2026-06-17 · 更新于 2026-07-24 · 1 min · 203 words

PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

2026-06-17 · 更新于 2026-07-24 · 3 min · 580 words

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

2026-06-17 · 更新于 2026-07-24 · 5 min · 922 words

Single frequency filtering based multi-speaker direction of arrival estimation from stereo recordings

2026-06-17 · 更新于 2026-07-24 · 2 min · 262 words

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

2026-06-17 · 更新于 2026-07-24 · 2 min · 243 words

Synergizing Zero-Shot Cross-Lingual Alzheimer Detection with Language-Invariant Multimodal Bi-Geometric Adversarial Learning

2026-06-17 · 更新于 2026-07-24 · 3 min · 521 words

Transductive Zero-Shot Audio Classification with Audio-Language Models

2026-06-17 · 更新于 2026-07-24 · 2 min · 355 words

Turning music identification into a neural forward pass

2026-06-17 · 更新于 2026-07-24 · 4 min · 643 words

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

2026-06-17 · 更新于 2026-07-24 · 2 min · 377 words

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

2026-06-17 · 更新于 2026-07-24 · 2 min · 398 words

语音/音乐/音频论文速递 2026-06-17

2026-06-17 · 更新于 2026-07-24 · 21 min · 4445 words

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models

2026-06-16 · 更新于 2026-07-24 · 2 min · 252 words

Acoustic, VOC, and Multimodal Stress Source Localization in the Internet of Plants

2026-06-16 · 更新于 2026-07-24 · 2 min · 361 words

AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control

2026-06-16 · 更新于 2026-07-24 · 2 min · 358 words

An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence

2026-06-16 · 更新于 2026-07-24 · 4 min · 643 words

An auscultation location specific study on the relationship between expiratory-to-inspiratory acoustic patterns and spirometric airflow limitation across age and gender in asthmatic patients

2026-06-16 · 更新于 2026-07-24 · 2 min · 367 words

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

2026-06-16 · 更新于 2026-07-24 · 2 min · 298 words

AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

2026-06-16 · 更新于 2026-07-24 · 4 min · 729 words

ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

2026-06-16 · 更新于 2026-07-24 · 2 min · 365 words

ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition

2026-06-16 · 更新于 2026-07-24 · 2 min · 251 words

AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models

2026-06-16 · 更新于 2026-07-24 · 3 min · 528 words

Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

2026-06-16 · 更新于 2026-07-24 · 2 min · 292 words

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

2026-06-16 · 更新于 2026-07-24 · 2 min · 408 words

Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages

2026-06-16 · 更新于 2026-07-24 · 4 min · 849 words

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

2026-06-16 · 更新于 2026-07-24 · 1 min · 129 words

Closed-Loop Triplet Synergistic Generation for Long-Form Video

2026-06-16 · 更新于 2026-07-24 · 2 min · 288 words

Confidence Score Guided Incremental and Speaker Adaptive Pseudo-Labeling for Semi-Supervised Elderly Speech Recognition

2026-06-16 · 更新于 2026-07-24 · 4 min · 815 words

Connecting Speech to Words through Images

2026-06-16 · 更新于 2026-07-24 · 2 min · 306 words

CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech

2026-06-16 · 更新于 2026-07-24 · 2 min · 351 words

Data-Driven Decoding of Russell's Circumplex Model of Affect

2026-06-16 · 更新于 2026-07-24 · 2 min · 233 words

DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

2026-06-16 · 更新于 2026-07-24 · 4 min · 782 words

Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition

2026-06-16 · 更新于 2026-07-24 · 5 min · 876 words

Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection

2026-06-16 · 更新于 2026-07-24 · 5 min · 967 words

DuraMark: Duration-Embedded Watermarking in LLM-based TTS

2026-06-16 · 更新于 2026-07-24 · 3 min · 517 words

Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity

2026-06-16 · 更新于 2026-07-24 · 3 min · 535 words

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

2026-06-16 · 更新于 2026-07-24 · 3 min · 616 words

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

2026-06-16 · 更新于 2026-07-24 · 2 min · 380 words

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

2026-06-16 · 更新于 2026-07-24 · 3 min · 528 words

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

2026-06-16 · 更新于 2026-07-24 · 1 min · 168 words

From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation

2026-06-16 · 更新于 2026-07-24 · 2 min · 303 words

Geometrically Constrained Decentralized Independent Vector Analysis for Distributed Microphone Arrays

2026-06-16 · 更新于 2026-07-24 · 3 min · 474 words

Interpretable and Frugal Learning Systems Employing Multiresolution Pyramids and Volterra Kernels

2026-06-16 · 更新于 2026-07-24 · 2 min · 250 words

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

2026-06-16 · 更新于 2026-07-24 · 3 min · 458 words

Learning Input-Channel Permutation Equivariance for Multi-Channel Source Separation: Reducing Bleeding in Small Music Ensembles

2026-06-16 · 更新于 2026-07-24 · 2 min · 419 words

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

2026-06-16 · 更新于 2026-07-24 · 2 min · 300 words

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

2026-06-16 · 更新于 2026-07-24 · 4 min · 732 words

MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio

2026-06-16 · 更新于 2026-07-24 · 3 min · 435 words

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

2026-06-16 · 更新于 2026-07-24 · 3 min · 582 words

MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

2026-06-16 · 更新于 2026-07-24 · 3 min · 438 words

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

2026-06-16 · 更新于 2026-07-24 · 2 min · 318 words

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

2026-06-16 · 更新于 2026-07-24 · 2 min · 278 words

Phonetically Explainable Speech Deepfake Detection

2026-06-16 · 更新于 2026-07-24 · 4 min · 771 words

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

2026-06-16 · 更新于 2026-07-24 · 2 min · 411 words

Probing Low Frame Rate Degradation in Neural Audio Codecs

2026-06-16 · 更新于 2026-07-24 · 3 min · 634 words

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

2026-06-16 · 更新于 2026-07-24 · 2 min · 357 words

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

2026-06-16 · 更新于 2026-07-24 · 3 min · 547 words

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

2026-06-16 · 更新于 2026-07-24 · 2 min · 284 words

Scaling Human and G2P Supervision for Robust Phonetic Transcription

2026-06-16 · 更新于 2026-07-24 · 2 min · 315 words

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

2026-06-16 · 更新于 2026-07-24 · 2 min · 346 words

Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

2026-06-16 · 更新于 2026-07-24 · 1 min · 206 words

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

2026-06-16 · 更新于 2026-07-24 · 2 min · 349 words

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

2026-06-16 · 更新于 2026-07-24 · 2 min · 341 words

Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment

2026-06-16 · 更新于 2026-07-24 · 2 min · 332 words

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

2026-06-16 · 更新于 2026-07-24 · 2 min · 272 words

TMASC: Transmasculine Attitude and Speech Corpus

2026-06-16 · 更新于 2026-07-24 · 2 min · 248 words

Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec

2026-06-16 · 更新于 2026-07-24 · 4 min · 743 words

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

2026-06-16 · 更新于 2026-07-24 · 3 min · 429 words

Unified Audio Generation and Editing via Joint Condition Modeling and Progressive Training

2026-06-16 · 更新于 2026-07-24 · 1 min · 192 words

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

2026-06-16 · 更新于 2026-07-24 · 4 min · 711 words

Universal adaptive beamforming: A Bayesian approach

2026-06-16 · 更新于 2026-07-24 · 2 min · 283 words

VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations

2026-06-16 · 更新于 2026-07-24 · 2 min · 387 words

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

2026-06-16 · 更新于 2026-07-24 · 3 min · 579 words

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

2026-06-16 · 更新于 2026-07-24 · 2 min · 257 words

语音/音乐/音频论文速递 2026-06-16

2026-06-16 · 更新于 2026-07-24 · 36 min · 7668 words

A Deep Zero-Inflated Model of North Atlantic Right Whale Presence To Support Blue Economy Management in the U.S. East Coast

2026-06-15 · 更新于 2026-07-24 · 2 min · 422 words

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

2026-06-15 · 更新于 2026-07-24 · 4 min · 691 words

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

2026-06-15 · 更新于 2026-07-24 · 2 min · 304 words

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

2026-06-15 · 更新于 2026-07-24 · 2 min · 316 words

Beyond task performance: Decoding bioacoustic embeddings with speech features

2026-06-15 · 更新于 2026-07-24 · 2 min · 321 words

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

2026-06-15 · 更新于 2026-07-24 · 2 min · 317 words

Explainable and Trustworthy Speech Emotion Recognition Using Confidence Score and Reinforcement Learning Rectified Speech Emotion Descriptors

2026-06-15 · 更新于 2026-07-24 · 2 min · 405 words

FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding

2026-06-15 · 更新于 2026-07-24 · 2 min · 335 words

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

2026-06-15 · 更新于 2026-07-24 · 2 min · 218 words

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

2026-06-15 · 更新于 2026-07-24 · 2 min · 218 words

HIDVAS: A Hearing Instrument Dataset in Various Acoustical Scenarios for Algorithm Evaluation and Training

2026-06-15 · 更新于 2026-07-24 · 2 min · 289 words

Instantaneous Pitch Estimation via Wave-U-Net-Based Fundamental Waveform Enhancement

2026-06-15 · 更新于 2026-07-24 · 2 min · 305 words

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

2026-06-15 · 更新于 2026-07-24 · 3 min · 500 words

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

2026-06-15 · 更新于 2026-07-24 · 1 min · 119 words

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

2026-06-15 · 更新于 2026-07-24 · 4 min · 842 words

MaskedFOP: Polyglot Speaker Identification under Missing Visual Modality via Cascaded Graph Label Propagation

2026-06-15 · 更新于 2026-07-24 · 2 min · 301 words

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

2026-06-15 · 更新于 2026-07-24 · 2 min · 363 words

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

2026-06-15 · 更新于 2026-07-24 · 3 min · 500 words

Multimodal Speaker Identification in Classroom Environments

2026-06-15 · 更新于 2026-07-24 · 2 min · 247 words

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

2026-06-15 · 更新于 2026-07-24 · 2 min · 359 words

Orchestra-o1: Omnimodal Agent Orchestration

2026-06-15 · 更新于 2026-07-24 · 2 min · 366 words

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

2026-06-15 · 更新于 2026-07-24 · 3 min · 514 words

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

2026-06-15 · 更新于 2026-07-24 · 2 min · 232 words

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

2026-06-15 · 更新于 2026-07-24 · 2 min · 294 words

Unsupervised Approaches for Global Prosodic Embedding Extraction

2026-06-15 · 更新于 2026-07-24 · 2 min · 287 words

Who Spoke When in Multi-Conversation: Target Speaker Tagging Task and Benchmark

2026-06-15 · 更新于 2026-07-24 · 2 min · 321 words

语音/音乐/音频论文速递 2026-06-15

2026-06-15 · 更新于 2026-07-24 · 15 min · 3122 words

A Dual-Mode Faust-to-CLAP Compilation System

2026-06-12 · 更新于 2026-07-24 · 2 min · 275 words

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

2026-06-12 · 更新于 2026-07-24 · 2 min · 349 words

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

2026-06-12 · 更新于 2026-07-24 · 4 min · 720 words

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

2026-06-12 · 更新于 2026-07-24 · 4 min · 693 words

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

2026-06-12 · 更新于 2026-07-24 · 3 min · 480 words

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

2026-06-12 · 更新于 2026-07-24 · 2 min · 318 words

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

2026-06-12 · 更新于 2026-07-24 · 2 min · 314 words

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

2026-06-12 · 更新于 2026-07-24 · 2 min · 391 words

Endpoint Anticipation for Low-Latency Spoken Dialogue

2026-06-12 · 更新于 2026-07-24 · 2 min · 340 words

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

2026-06-12 · 更新于 2026-07-24 · 3 min · 448 words

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

2026-06-12 · 更新于 2026-07-24 · 2 min · 266 words

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

2026-06-12 · 更新于 2026-07-24 · 1 min · 205 words

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

2026-06-12 · 更新于 2026-07-24 · 2 min · 356 words

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

2026-06-12 · 更新于 2026-07-24 · 2 min · 238 words

M*: A Modular, Extensible, Serving System for Multimodal Models

2026-06-12 · 更新于 2026-07-24 · 2 min · 366 words

MiniMax Sparse Attention

2026-06-12 · 更新于 2026-07-24 · 5 min · 1003 words

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

2026-06-12 · 更新于 2026-07-24 · 2 min · 304 words

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

2026-06-12 · 更新于 2026-07-24 · 2 min · 274 words

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

2026-06-12 · 更新于 2026-07-24 · 2 min · 407 words

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

2026-06-12 · 更新于 2026-07-24 · 3 min · 629 words

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

2026-06-12 · 更新于 2026-07-24 · 2 min · 385 words

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

2026-06-12 · 更新于 2026-07-24 · 2 min · 306 words

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

2026-06-12 · 更新于 2026-07-24 · 3 min · 506 words

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

2026-06-12 · 更新于 2026-07-24 · 3 min · 545 words

The Moving Drone: Negotiating Agency Between the Voice and the Virtual

2026-06-12 · 更新于 2026-07-24 · 2 min · 318 words

Towards Personalized Federated Learning for Dysarthric Speech Recognition

2026-06-12 · 更新于 2026-07-24 · 2 min · 417 words

Vocal Identity Under Siege by AI Voice Cloning Technologies

2026-06-12 · 更新于 2026-07-24 · 1 min · 157 words

语音/音乐/音频论文速递 2026-06-12

2026-06-12 · 更新于 2026-07-24 · 16 min · 3281 words

Additive Noise, Shift Recovery, and Signed Signals in the Cumulative Distribution Transform

2026-06-11 · 更新于 2026-07-24 · 2 min · 350 words

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

2026-06-11 · 更新于 2026-07-24 · 3 min · 603 words

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

2026-06-11 · 更新于 2026-07-24 · 2 min · 229 words

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

2026-06-11 · 更新于 2026-07-24 · 2 min · 236 words

Context-Aware Multimodal Claim Verification in Spoken Dialogues

2026-06-11 · 更新于 2026-07-24 · 3 min · 433 words

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

2026-06-11 · 更新于 2026-07-24 · 2 min · 322 words

Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

2026-06-11 · 更新于 2026-07-24 · 2 min · 329 words

Fast Speech Foundation Model Distillation Using Interleaved Stacking

2026-06-11 · 更新于 2026-07-24 · 2 min · 365 words

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

2026-06-11 · 更新于 2026-07-24 · 2 min · 344 words

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

2026-06-11 · 更新于 2026-07-24 · 2 min · 308 words

Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

2026-06-11 · 更新于 2026-07-24 · 2 min · 352 words

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

2026-06-11 · 更新于 2026-07-24 · 2 min · 327 words

HALO: Half-Frame-Rate Adaptive Learnable Operator for Lightweight STFT-Based Speech Enhancement

2026-06-11 · 更新于 2026-07-24 · 3 min · 579 words

I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System

2026-06-11 · 更新于 2026-07-24 · 3 min · 449 words

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

2026-06-11 · 更新于 2026-07-24 · 2 min · 334 words

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

2026-06-11 · 更新于 2026-07-24 · 3 min · 437 words

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

2026-06-11 · 更新于 2026-07-24 · 3 min · 485 words

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

2026-06-11 · 更新于 2026-07-24 · 2 min · 290 words

Massive Open-Vocabulary Keyword Spotting

2026-06-11 · 更新于 2026-07-24 · 2 min · 347 words

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

2026-06-11 · 更新于 2026-07-24 · 2 min · 292 words

PianoKontext: Expressive Performance Rendering from Deadpan Context

2026-06-11 · 更新于 2026-07-24 · 2 min · 252 words

Pretrained self-supervised speech models can recognize unseen consonants

2026-06-11 · 更新于 2026-07-24 · 2 min · 362 words

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

2026-06-11 · 更新于 2026-07-24 · 4 min · 674 words

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

2026-06-11 · 更新于 2026-07-24 · 3 min · 551 words

Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation

2026-06-11 · 更新于 2026-07-24 · 4 min · 656 words

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

2026-06-11 · 更新于 2026-07-24 · 3 min · 429 words

Sensitivity Analysis of Generative Spatial Audio Metrics: A Study on Responsiveness, Smoothness, and Symmetry

2026-06-11 · 更新于 2026-07-24 · 2 min · 335 words

Snapping Matters: Context-Aware Onset Refinement for Automatic Music Transcription

2026-06-11 · 更新于 2026-07-24 · 4 min · 737 words

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

2026-06-11 · 更新于 2026-07-24 · 3 min · 550 words

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

2026-06-11 · 更新于 2026-07-24 · 2 min · 362 words

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

2026-06-11 · 更新于 2026-07-24 · 4 min · 767 words

The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing

2026-06-11 · 更新于 2026-07-24 · 2 min · 405 words

Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency

2026-06-11 · 更新于 2026-07-24 · 2 min · 264 words

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

2026-06-11 · 更新于 2026-07-24 · 3 min · 478 words

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

2026-06-11 · 更新于 2026-07-24 · 2 min · 355 words

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

2026-06-11 · 更新于 2026-07-24 · 3 min · 484 words

语音/音乐/音频论文速递 2026-06-11

2026-06-11 · 更新于 2026-07-24 · 22 min · 4642 words

A Lightweight Dual-Factor Acoustic Authentication System via Cascaded GMM-DTW Architecture for Edge Computing

2026-06-10 · 更新于 2026-07-24 · 2 min · 286 words

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

2026-06-10 · 更新于 2026-07-24 · 2 min · 318 words

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

2026-06-10 · 更新于 2026-07-24 · 3 min · 431 words

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

2026-06-10 · 更新于 2026-07-24 · 2 min · 370 words

AuRA: Internalizing Audio Understanding into LLMs as LoRA

2026-06-10 · 更新于 2026-07-24 · 1 min · 184 words

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

2026-06-10 · 更新于 2026-07-24 · 2 min · 317 words

ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding

2026-06-10 · 更新于 2026-07-24 · 1 min · 159 words

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

2026-06-10 · 更新于 2026-07-24 · 5 min · 929 words

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

2026-06-10 · 更新于 2026-07-24 · 2 min · 323 words

Deploying Speech-Driven 3D Facial Animation in Unreal Engine for Production-Ready Digital Humans

2026-06-10 · 更新于 2026-07-24 · 3 min · 494 words

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

2026-06-10 · 更新于 2026-07-24 · 2 min · 310 words

Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing

2026-06-10 · 更新于 2026-07-24 · 2 min · 379 words

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

2026-06-10 · 更新于 2026-07-24 · 2 min · 354 words

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

2026-06-10 · 更新于 2026-07-24 · 2 min · 353 words

Ethical and Technical Limits of Deepfake Speech Datasets

2026-06-10 · 更新于 2026-07-24 · 2 min · 370 words

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

2026-06-10 · 更新于 2026-07-24 · 2 min · 394 words

GC-LoRA: Gated Convolutional LoRA for Parameter-Efficient Acoustic Adaptation

2026-06-10 · 更新于 2026-07-24 · 2 min · 364 words

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

2026-06-10 · 更新于 2026-07-24 · 2 min · 381 words

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

2026-06-10 · 更新于 2026-07-24 · 2 min · 387 words

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

2026-06-10 · 更新于 2026-07-24 · 3 min · 429 words

Linguistically Augmented Audio Speech Data (LinguAS)

2026-06-10 · 更新于 2026-07-24 · 2 min · 259 words

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

2026-06-10 · 更新于 2026-07-24 · 3 min · 615 words

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

2026-06-10 · 更新于 2026-07-24 · 3 min · 582 words

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

2026-06-10 · 更新于 2026-07-24 · 3 min · 553 words

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

2026-06-10 · 更新于 2026-07-24 · 2 min · 389 words

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

2026-06-10 · 更新于 2026-07-24 · 2 min · 289 words

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

2026-06-10 · 更新于 2026-07-24 · 15 min · 3178 words

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

2026-06-10 · 更新于 2026-07-24 · 5 min · 925 words

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

2026-06-10 · 更新于 2026-07-24 · 1 min · 208 words

Phoneme-First Prediction for LLM-Based Speech Recognition

2026-06-10 · 更新于 2026-07-24 · 3 min · 435 words

Profy: Interpretable Visualization of Expertise-Dependent Motor Skills Toward Supporting Piano Practice

2026-06-10 · 更新于 2026-07-24 · 3 min · 525 words

RAT: Reference-Augmented Training for ASV Anti-Spoofing

2026-06-10 · 更新于 2026-07-24 · 2 min · 356 words

Recovering the Zipfian Distribution in Unsupervised Term Discovery

2026-06-10 · 更新于 2026-07-24 · 3 min · 427 words

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

2026-06-10 · 更新于 2026-07-24 · 3 min · 464 words

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

2026-06-10 · 更新于 2026-07-24 · 2 min · 275 words

Speaker Group Encoding in Self-supervised Speech Recognition Models

2026-06-10 · 更新于 2026-07-24 · 2 min · 234 words

Speech Encoder Fusion for LLM-based Automatic Speech Recognition

2026-06-10 · 更新于 2026-07-24 · 3 min · 521 words

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

2026-06-10 · 更新于 2026-07-24 · 3 min · 430 words

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

2026-06-10 · 更新于 2026-07-24 · 5 min · 972 words

Time-frequency localization of bird calls in dense soundscapes

2026-06-10 · 更新于 2026-07-24 · 2 min · 327 words

Towards Deep Contextual Reasoning from Broad Descriptions for ASR with Speech-LLM via Metadata-Driven Reasoning Chains

2026-06-10 · 更新于 2026-07-24 · 2 min · 252 words

Towards Robust Arabic Speech Emotion Recognition with Deep Learning

2026-06-10 · 更新于 2026-07-24 · 2 min · 361 words

TRADE: Transducer-Augmented Decoder for Speech LLM

2026-06-10 · 更新于 2026-07-24 · 2 min · 327 words

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

2026-06-10 · 更新于 2026-07-24 · 2 min · 414 words

What Do Deepfake Speech Detectors Actually Hear?

2026-06-10 · 更新于 2026-07-24 · 1 min · 58 words

语音/音乐/音频论文速递 2026-06-10

2026-06-10 · 更新于 2026-07-24 · 26 min · 5465 words

A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification

2026-06-09 · 更新于 2026-07-24 · 2 min · 306 words

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

2026-06-09 · 更新于 2026-07-24 · 5 min · 1017 words

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

2026-06-09 · 更新于 2026-07-24 · 3 min · 443 words

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

2026-06-09 · 更新于 2026-07-24 · 2 min · 381 words

A study on the impact of region specific data on the performance of Indic ASR

2026-06-09 · 更新于 2026-07-24 · 2 min · 261 words

AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals

2026-06-09 · 更新于 2026-07-24 · 2 min · 241 words

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

2026-06-09 · 更新于 2026-07-24 · 3 min · 515 words

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

2026-06-09 · 更新于 2026-07-24 · 2 min · 325 words

BareWave: Waveform-Native Flow-Matching Text-to-Speech

2026-06-09 · 更新于 2026-07-24 · 3 min · 591 words

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

2026-06-09 · 更新于 2026-07-24 · 2 min · 329 words

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

2026-06-09 · 更新于 2026-07-24 · 2 min · 352 words

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

2026-06-09 · 更新于 2026-07-24 · 3 min · 449 words

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

2026-06-09 · 更新于 2026-07-24 · 3 min · 482 words

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

2026-06-09 · 更新于 2026-07-24 · 2 min · 281 words

End-to-End Training for Discrete Token LLM based TTS System

2026-06-09 · 更新于 2026-07-24 · 3 min · 526 words

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

2026-06-09 · 更新于 2026-07-24 · 2 min · 394 words

Factors affecting ASR performance: A study using state of the art ASR models in Indic Languages

2026-06-09 · 更新于 2026-07-24 · 2 min · 297 words

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

2026-06-09 · 更新于 2026-07-24 · 2 min · 375 words

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

2026-06-09 · 更新于 2026-07-24 · 2 min · 257 words

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

2026-06-09 · 更新于 2026-07-24 · 2 min · 284 words

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

2026-06-09 · 更新于 2026-07-24 · 4 min · 704 words

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

2026-06-09 · 更新于 2026-07-24 · 1 min · 205 words

G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching

2026-06-09 · 更新于 2026-07-24 · 2 min · 329 words

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

2026-06-09 · 更新于 2026-07-24 · 3 min · 576 words

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

2026-06-09 · 更新于 2026-07-24 · 3 min · 437 words

Liberating LLM Capabilities in Full-Duplex Speech Models

2026-06-09 · 更新于 2026-07-24 · 3 min · 495 words

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

2026-06-09 · 更新于 2026-07-24 · 4 min · 702 words

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

2026-06-09 · 更新于 2026-07-24 · 4 min · 841 words

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

2026-06-09 · 更新于 2026-07-24 · 3 min · 500 words

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

2026-06-09 · 更新于 2026-07-24 · 3 min · 466 words

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

2026-06-09 · 更新于 2026-07-24 · 2 min · 393 words

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

2026-06-09 · 更新于 2026-07-24 · 3 min · 630 words

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

2026-06-09 · 更新于 2026-07-24 · 2 min · 360 words

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

2026-06-09 · 更新于 2026-07-24 · 2 min · 396 words

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

2026-06-09 · 更新于 2026-07-24 · 3 min · 447 words

Parameter-Efficient Continual Learning for Automatic Speech Recognition

2026-06-09 · 更新于 2026-07-24 · 3 min · 506 words

Predictive Fixed-Filter Active Noise Control (PFANC) Using Convolutional Recurrent Neural Networks for Dynamic Noises

2026-06-09 · 更新于 2026-07-24 · 2 min · 269 words

Probing Token Spaces under Generator Shift in AI-Generated Music Detection

2026-06-09 · 更新于 2026-07-24 · 3 min · 434 words

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

2026-06-09 · 更新于 2026-07-24 · 2 min · 410 words

Rethinking Depth: A study of the Recursive-Transformer for Speech Recognition

2026-06-09 · 更新于 2026-07-24 · 2 min · 415 words

SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation

2026-06-09 · 更新于 2026-07-24 · 3 min · 438 words

Sound Field Interpolation Using Physics-Informed Extreme Learning Machine with Pre-Training

2026-06-09 · 更新于 2026-07-24 · 2 min · 308 words

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

2026-06-09 · 更新于 2026-07-24 · 2 min · 291 words

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

2026-06-09 · 更新于 2026-07-24 · 3 min · 592 words

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

2026-06-09 · 更新于 2026-07-24 · 4 min · 653 words

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

2026-06-09 · 更新于 2026-07-24 · 2 min · 319 words

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

2026-06-09 · 更新于 2026-07-24 · 1 min · 190 words

Your U-Net Dereverberation Model is Secretly an RIR Encoder

2026-06-09 · 更新于 2026-07-24 · 2 min · 224 words

语音/音乐/音频论文速递 2026-06-09

2026-06-09 · 更新于 2026-07-24 · 29 min · 6000 words

A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization

2026-06-08 · 更新于 2026-07-24 · 2 min · 228 words

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

2026-06-08 · 更新于 2026-07-24 · 2 min · 359 words

Assessing True Generalisability of Audio-Visual Speech Recognisers

2026-06-08 · 更新于 2026-07-24 · 3 min · 480 words

Audio Imitator: Controlling Timbre and Tempo in Video2Audio Synthesis with Audio Reference

2026-06-08 · 更新于 2026-07-24 · 3 min · 552 words

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

2026-06-08 · 更新于 2026-07-24 · 3 min · 509 words

Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

2026-06-08 · 更新于 2026-07-24 · 4 min · 691 words

BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation

2026-06-08 · 更新于 2026-07-24 · 4 min · 741 words

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

2026-06-08 · 更新于 2026-07-24 · 2 min · 371 words

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

2026-06-08 · 更新于 2026-07-24 · 3 min · 530 words

dots.tts Technical Report

2026-06-08 · 更新于 2026-07-24 · 1 min · 188 words

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

2026-06-08 · 更新于 2026-07-24 · 2 min · 279 words

FIGMA: Towards FIne-Grained Music retrievAl

2026-06-08 · 更新于 2026-07-24 · 3 min · 566 words

FSC-Net: Integrating Fast Fourier Convolutions and Progressive Learning for Speech Bandwidth Extension

2026-06-08 · 更新于 2026-07-24 · 4 min · 791 words

Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

2026-06-08 · 更新于 2026-07-24 · 2 min · 404 words

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

2026-06-08 · 更新于 2026-07-24 · 3 min · 440 words

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

2026-06-08 · 更新于 2026-07-24 · 2 min · 340 words

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

2026-06-08 · 更新于 2026-07-24 · 2 min · 276 words

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

2026-06-08 · 更新于 2026-07-24 · 2 min · 420 words

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

2026-06-08 · 更新于 2026-07-24 · 1 min · 151 words

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

2026-06-08 · 更新于 2026-07-24 · 2 min · 412 words

Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

2026-06-08 · 更新于 2026-07-24 · 3 min · 601 words

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

2026-06-08 · 更新于 2026-07-24 · 3 min · 440 words

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

2026-06-08 · 更新于 2026-07-24 · 3 min · 438 words

MMAE: A Massive Multitask Audio Editing Benchmark

2026-06-08 · 更新于 2026-07-24 · 1 min · 148 words

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

2026-06-08 · 更新于 2026-07-24 · 4 min · 669 words

MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds

2026-06-08 · 更新于 2026-07-24 · 2 min · 312 words

Phonetic Error Analysis of Raw Waveform Acoustic Models

2026-06-08 · 更新于 2026-07-24 · 2 min · 301 words

SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

2026-06-08 · 更新于 2026-07-24 · 3 min · 436 words

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

2026-06-08 · 更新于 2026-07-24 · 2 min · 420 words

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

2026-06-08 · 更新于 2026-07-24 · 2 min · 337 words

TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

2026-06-08 · 更新于 2026-07-24 · 2 min · 319 words

Towards Event-Robust Acoustic Scene Classification

2026-06-08 · 更新于 2026-07-24 · 1 min · 212 words

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

2026-06-08 · 更新于 2026-07-24 · 2 min · 386 words

VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track

2026-06-08 · 更新于 2026-07-24 · 2 min · 415 words

VoxCPM2 Technical Report

2026-06-08 · 更新于 2026-07-24 · 5 min · 1038 words

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

2026-06-08 · 更新于 2026-07-24 · 2 min · 247 words

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

2026-06-08 · 更新于 2026-07-24 · 3 min · 625 words

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

2026-06-08 · 更新于 2026-07-24 · 3 min · 627 words

语音/音乐/音频论文速递 2026-06-08

2026-06-08 · 更新于 2026-07-24 · 23 min · 4800 words

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

2026-06-05 · 更新于 2026-07-24 · 1 min · 204 words

Age-Aware Adapter Tuning for Children's Speech Recognition

2026-06-05 · 更新于 2026-07-24 · 2 min · 408 words

An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism

2026-06-05 · 更新于 2026-07-24 · 3 min · 622 words

An Ultra-Low-Bitrate Neural Speech Codec with Plain-to-Pseudo Synergistic Vector Quantization

2026-06-05 · 更新于 2026-07-24 · 2 min · 302 words

Audio Interaction Model

2026-06-05 · 更新于 2026-07-24 · 4 min · 718 words

Automatic Labelling of Speech Translation Errors

2026-06-05 · 更新于 2026-07-24 · 2 min · 366 words

Beyond Generative Decoding: Discriminative Hidden-State Readout from a Native Omni-Modal LLM for Multimodal Sentiment Analysis

2026-06-05 · 更新于 2026-07-24 · 3 min · 634 words

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

2026-06-05 · 更新于 2026-07-24 · 2 min · 229 words

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

2026-06-05 · 更新于 2026-07-24 · 2 min · 408 words

Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes

2026-06-05 · 更新于 2026-07-24 · 2 min · 379 words

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

2026-06-05 · 更新于 2026-07-24 · 1 min · 160 words

DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement

2026-06-05 · 更新于 2026-07-24 · 2 min · 372 words

Do speech foundation models perceive speaker similarity as humans do?

2026-06-05 · 更新于 2026-07-24 · 2 min · 266 words

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

2026-06-05 · 更新于 2026-07-24 · 2 min · 340 words

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

2026-06-05 · 更新于 2026-07-24 · 3 min · 446 words

Enhancing Audio Captioning with Auxiliary AudioSet Semantics

2026-06-05 · 更新于 2026-07-24 · 4 min · 646 words

Exploring LLMs for South Asian Music Understanding and Generation

2026-06-05 · 更新于 2026-07-24 · 1 min · 187 words

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

2026-06-05 · 更新于 2026-07-24 · 2 min · 355 words

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

2026-06-05 · 更新于 2026-07-24 · 3 min · 514 words

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

2026-06-05 · 更新于 2026-07-24 · 5 min · 911 words

Forgive or forget: Understanding the context of hate in audio retrieval systems

2026-06-05 · 更新于 2026-07-24 · 3 min · 531 words

FORTE: FOL-guided Optimal Refinement for Text-audio rEtrieval

2026-06-05 · 更新于 2026-07-24 · 2 min · 381 words

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

2026-06-05 · 更新于 2026-07-24 · 3 min · 519 words

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

2026-06-05 · 更新于 2026-07-24 · 2 min · 376 words

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

2026-06-05 · 更新于 2026-07-24 · 2 min · 318 words

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

2026-06-05 · 更新于 2026-07-24 · 1 min · 195 words

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

2026-06-05 · 更新于 2026-07-24 · 2 min · 393 words

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

2026-06-05 · 更新于 2026-07-24 · 2 min · 256 words

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

2026-06-05 · 更新于 2026-07-24 · 2 min · 260 words

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

2026-06-05 · 更新于 2026-07-24 · 2 min · 258 words

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

2026-06-05 · 更新于 2026-07-24 · 2 min · 348 words

Probing Spatial Structure in Pretrained Audio Representations

2026-06-05 · 更新于 2026-07-24 · 1 min · 163 words

ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

2026-06-05 · 更新于 2026-07-24 · 3 min · 579 words

Revisiting Lexicon Evaluation in Unsupervised Word Discovery

2026-06-05 · 更新于 2026-07-24 · 2 min · 214 words

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

2026-06-05 · 更新于 2026-07-24 · 2 min · 341 words

SB-RF: Schrödinger Bridge Rectified Flow for One-Step Robust Speech Enhancement

2026-06-05 · 更新于 2026-07-24 · 3 min · 450 words

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

2026-06-05 · 更新于 2026-07-24 · 3 min · 486 words

Sound Effects Dataset Unification With the Universal Category System

2026-06-05 · 更新于 2026-07-24 · 2 min · 324 words

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

2026-06-05 · 更新于 2026-07-24 · 6 min · 1150 words

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

2026-06-05 · 更新于 2026-07-24 · 2 min · 383 words

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

2026-06-05 · 更新于 2026-07-24 · 3 min · 431 words

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

2026-06-05 · 更新于 2026-07-24 · 4 min · 782 words

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

2026-06-05 · 更新于 2026-07-24 · 3 min · 523 words

UniVoice: A Unified Model for Speech and Singing Voice Generation

2026-06-05 · 更新于 2026-07-24 · 2 min · 320 words

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

2026-06-05 · 更新于 2026-07-24 · 2 min · 399 words

VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization

2026-06-05 · 更新于 2026-07-24 · 3 min · 456 words

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

2026-06-05 · 更新于 2026-07-24 · 2 min · 250 words

语音/音乐/音频论文速递 2026-06-05

2026-06-05 · 更新于 2026-07-24 · 28 min · 5851 words

A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study

2026-06-04 · 更新于 2026-07-24 · 2 min · 260 words

Channel-Oriented Design for EEG-to-Music Reconstruction

2026-06-04 · 更新于 2026-07-24 · 2 min · 382 words

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

2026-06-04 · 更新于 2026-07-24 · 4 min · 720 words

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

2026-06-04 · 更新于 2026-07-24 · 2 min · 257 words

Differentiable Articulatory Copy-Synthesis of Biphonic Singing

2026-06-04 · 更新于 2026-07-24 · 4 min · 689 words

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

2026-06-04 · 更新于 2026-07-24 · 3 min · 443 words

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

2026-06-04 · 更新于 2026-07-24 · 5 min · 1043 words

Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid

2026-06-04 · 更新于 2026-07-24 · 3 min · 445 words

Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching

2026-06-04 · 更新于 2026-07-24 · 2 min · 255 words

Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses

2026-06-04 · 更新于 2026-07-24 · 1 min · 170 words

Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction

2026-06-04 · 更新于 2026-07-24 · 3 min · 455 words

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

2026-06-04 · 更新于 2026-07-24 · 3 min · 569 words

Neural Radiated-Noise Fields for Unmanned Underwater Vehicle Noise Spectrum Prediction in Three-Dimensional Scenes

2026-06-04 · 更新于 2026-07-24 · 2 min · 290 words

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

2026-06-04 · 更新于 2026-07-24 · 3 min · 577 words

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

2026-06-04 · 更新于 2026-07-24 · 1 min · 121 words

Representation Matters in Randomized Smoothing for Audio Classification

2026-06-04 · 更新于 2026-07-24 · 2 min · 321 words

SHB-AE: Spherical harmonic beamforming based Ambisonics encoding and upscaling method for smartphone microphone array

2026-06-04 · 更新于 2026-07-24 · 2 min · 305 words

SURF: Separation via Unsupervised Remixing Flow

2026-06-04 · 更新于 2026-07-24 · 2 min · 282 words

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

2026-06-04 · 更新于 2026-07-24 · 2 min · 282 words

The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids

2026-06-04 · 更新于 2026-07-24 · 2 min · 313 words

UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning

2026-06-04 · 更新于 2026-07-24 · 3 min · 613 words

Video2LoRA: Parametric Video Internalization for Vision-Language Models

2026-06-04 · 更新于 2026-07-24 · 1 min · 139 words

语音/音乐/音频论文速递 2026-06-04

2026-06-04 · 更新于 2026-07-24 · 14 min · 2920 words

A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination

2026-06-03 · 更新于 2026-07-24 · 4 min · 703 words

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

2026-06-03 · 更新于 2026-07-24 · 3 min · 572 words

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

2026-06-03 · 更新于 2026-07-24 · 3 min · 473 words

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

2026-06-03 · 更新于 2026-07-24 · 2 min · 366 words

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

2026-06-03 · 更新于 2026-07-24 · 3 min · 613 words

Audio Spotforming via Post-Filtering Using Cross-Array Non-target Estimates

2026-06-03 · 更新于 2026-07-24 · 4 min · 747 words

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

2026-06-03 · 更新于 2026-07-24 · 2 min · 254 words

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

2026-06-03 · 更新于 2026-07-24 · 2 min · 296 words

Benchmarking Speech-to-Speech Translation Models

2026-06-03 · 更新于 2026-07-24 · 2 min · 343 words

Breaking the Pair: Evaluating Dyadic Interaction via Speaker Switching

2026-06-03 · 更新于 2026-07-24 · 2 min · 337 words

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

2026-06-03 · 更新于 2026-07-24 · 2 min · 233 words

Cosmos 3: Omnimodal World Models for Physical AI

2026-06-03 · 更新于 2026-07-24 · 3 min · 629 words

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

2026-06-03 · 更新于 2026-07-24 · 3 min · 452 words

Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

2026-06-03 · 更新于 2026-07-24 · 2 min · 330 words

Domain-Agnostic Incremental Learning for Sound Classification. A DCASE 2026 Challenge task

2026-06-03 · 更新于 2026-07-24 · 1 min · 146 words

Efficient ASR Training with Conversations that Never Happened

2026-06-03 · 更新于 2026-07-24 · 3 min · 509 words

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

2026-06-03 · 更新于 2026-07-24 · 2 min · 349 words

Exploiting Noise Inseparability for Weakly-Supervised Discriminative Speech Denoising Using Noisy Targets

2026-06-03 · 更新于 2026-07-24 · 2 min · 406 words

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

2026-06-03 · 更新于 2026-07-24 · 3 min · 476 words

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

2026-06-03 · 更新于 2026-07-24 · 2 min · 366 words

In-the-Loop Training of Deep Feedback Cancellation for Hearing Aids

2026-06-03 · 更新于 2026-07-24 · 2 min · 269 words

Inference-Time Scaling for Joint Audio-Video Generation

2026-06-03 · 更新于 2026-07-24 · 2 min · 344 words

LiveBand: Live Accompaniment Generation in the Audio Domain

2026-06-03 · 更新于 2026-07-24 · 3 min · 502 words

Localizing broadband noise sources using the Loève spectrum and a 2.5D approach

2026-06-03 · 更新于 2026-07-24 · 2 min · 324 words

Logit Distillation on Manifolds: Mapping by Learning

2026-06-03 · 更新于 2026-07-24 · 3 min · 509 words

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

2026-06-03 · 更新于 2026-07-24 · 2 min · 400 words

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

2026-06-03 · 更新于 2026-07-24 · 3 min · 438 words

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

2026-06-03 · 更新于 2026-07-24 · 4 min · 720 words

SegTune: Structured and Fine-Grained Control for Song Generation

2026-06-03 · 更新于 2026-07-24 · 3 min · 451 words

SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment

2026-06-03 · 更新于 2026-07-24 · 2 min · 328 words

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

2026-06-03 · 更新于 2026-07-24 · 5 min · 933 words

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

2026-06-03 · 更新于 2026-07-24 · 3 min · 454 words

SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification

2026-06-03 · 更新于 2026-07-24 · 3 min · 508 words

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

2026-06-03 · 更新于 2026-07-24 · 3 min · 459 words

Stable Hybrid Cross-Attention Fusion for Audio-Visual Event Recognition

2026-06-03 · 更新于 2026-07-24 · 2 min · 399 words

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

2026-06-03 · 更新于 2026-07-24 · 2 min · 375 words

The DeepSpeak-Agentic Dataset

2026-06-03 · 更新于 2026-07-24 · 2 min · 333 words

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

2026-06-03 · 更新于 2026-07-24 · 2 min · 362 words

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

2026-06-03 · 更新于 2026-07-24 · 3 min · 437 words

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

2026-06-03 · 更新于 2026-07-24 · 3 min · 598 words

语音/音乐/音频论文速递 2026-06-03

2026-06-03 · 更新于 2026-07-24 · 26 min · 5337 words

A 1000-hour EEG-EMG-audio dataset of Japanese speech production

2026-06-02 · 更新于 2026-07-24 · 4 min · 663 words

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

2026-06-02 · 更新于 2026-07-24 · 3 min · 611 words

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

2026-06-02 · 更新于 2026-07-24 · 3 min · 598 words

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

2026-06-02 · 更新于 2026-07-24 · 2 min · 279 words

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

2026-06-02 · 更新于 2026-07-24 · 3 min · 618 words

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

2026-06-02 · 更新于 2026-07-24 · 3 min · 448 words

Context-aware child-directed speech detection from long-form recordings

2026-06-02 · 更新于 2026-07-24 · 2 min · 318 words

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

2026-06-02 · 更新于 2026-07-24 · 1 min · 165 words

Description and Discussion on DCASE 2026 Challenge Task 2: Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

2026-06-02 · 更新于 2026-07-24 · 2 min · 331 words

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

2026-06-02 · 更新于 2026-07-24 · 2 min · 376 words

Dynamic Interaction-Aware and Causality-Disentangled Framework for Multimodal Sentiment Analysis

2026-06-02 · 更新于 2026-07-24 · 3 min · 496 words

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

2026-06-02 · 更新于 2026-07-24 · 4 min · 672 words

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

2026-06-02 · 更新于 2026-07-24 · 3 min · 502 words

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

2026-06-02 · 更新于 2026-07-24 · 2 min · 357 words

Kinship Verification Using Voice

2026-06-02 · 更新于 2026-07-24 · 2 min · 310 words

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

2026-06-02 · 更新于 2026-07-24 · 2 min · 322 words

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

2026-06-02 · 更新于 2026-07-24 · 3 min · 500 words

MOSS-Audio Technical Report

2026-06-02 · 更新于 2026-07-24 · 3 min · 626 words

Multimodal Music Recommendation System using LLMs

2026-06-02 · 更新于 2026-07-24 · 2 min · 416 words

MURMUR: An Efficient Inference System for Long-Form ASR

2026-06-02 · 更新于 2026-07-24 · 1 min · 127 words

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

2026-06-02 · 更新于 2026-07-24 · 3 min · 567 words

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

2026-06-02 · 更新于 2026-07-24 · 2 min · 244 words

Privacy-preserving Prosody Representation Learning

2026-06-02 · 更新于 2026-07-24 · 2 min · 301 words

Project SPARROW and the Future of Conservation Technology

2026-06-02 · 更新于 2026-07-24 · 2 min · 356 words

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

2026-06-02 · 更新于 2026-07-24 · 1 min · 210 words

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

2026-06-02 · 更新于 2026-07-24 · 5 min · 854 words

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

2026-06-02 · 更新于 2026-07-24 · 1 min · 143 words

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

2026-06-02 · 更新于 2026-07-24 · 3 min · 488 words

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

2026-06-02 · 更新于 2026-07-24 · 4 min · 712 words

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

2026-06-02 · 更新于 2026-07-24 · 2 min · 366 words

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

2026-06-02 · 更新于 2026-07-24 · 2 min · 401 words

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

2026-06-02 · 更新于 2026-07-24 · 2 min · 324 words

UniVocal: Unified Speech-Singing Code-Switching Synthesis

2026-06-02 · 更新于 2026-07-24 · 1 min · 132 words

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

2026-06-02 · 更新于 2026-07-24 · 3 min · 561 words

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

2026-06-02 · 更新于 2026-07-24 · 2 min · 341 words

语音/音乐/音频论文速递 2026-06-02

2026-06-02 · 更新于 2026-07-24 · 21 min · 4469 words

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

2026-06-01 · 更新于 2026-07-24 · 3 min · 464 words

A Unified and Reproducible Experimentation Framework for Speech Understanding

2026-06-01 · 更新于 2026-07-24 · 3 min · 535 words

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

2026-06-01 · 更新于 2026-07-24 · 3 min · 529 words

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

2026-06-01 · 更新于 2026-07-24 · 3 min · 522 words

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

2026-06-01 · 更新于 2026-07-24 · 1 min · 190 words

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

2026-06-01 · 更新于 2026-07-24 · 3 min · 570 words

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

2026-06-01 · 更新于 2026-07-24 · 3 min · 592 words

Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels

2026-06-01 · 更新于 2026-07-24 · 3 min · 441 words

FiPA-SR – FiLM-Conditioned Perceptually Informed Audio Super-Resolution

2026-06-01 · 更新于 2026-07-24 · 2 min · 293 words

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

2026-06-01 · 更新于 2026-07-24 · 2 min · 350 words

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

2026-06-01 · 更新于 2026-07-24 · 2 min · 419 words

Improving acoustic drone detection generalization through pretraining and data augmentation

2026-06-01 · 更新于 2026-07-24 · 2 min · 301 words

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

2026-06-01 · 更新于 2026-07-24 · 2 min · 367 words

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

2026-06-01 · 更新于 2026-07-24 · 2 min · 272 words

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

2026-06-01 · 更新于 2026-07-24 · 2 min · 401 words

On the Use of Dereverberation for Acoustic Feedback Cancellation

2026-06-01 · 更新于 2026-07-24 · 2 min · 226 words

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

2026-06-01 · 更新于 2026-07-24 · 4 min · 731 words

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

2026-06-01 · 更新于 2026-07-24 · 3 min · 448 words

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

2026-06-01 · 更新于 2026-07-24 · 2 min · 299 words

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

2026-06-01 · 更新于 2026-07-24 · 3 min · 453 words

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

2026-06-01 · 更新于 2026-07-24 · 2 min · 426 words

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

2026-06-01 · 更新于 2026-07-24 · 3 min · 485 words

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

2026-06-01 · 更新于 2026-07-24 · 4 min · 838 words

语音/音乐/音频论文速递 2026-06-01

2026-06-01 · 更新于 2026-07-24 · 12 min · 2552 words

May ¹⁰⁰³

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

2026-05-30 · 更新于 2026-07-24 · 3 min · 569 words

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

2026-05-30 · 更新于 2026-07-24 · 2 min · 274 words

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

2026-05-30 · 更新于 2026-07-24 · 3 min · 510 words

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

2026-05-30 · 更新于 2026-07-24 · 2 min · 365 words

PiAnnotate: A Web Annotation Tool for Piano Fingering, with a Diagnostic Probe

2026-05-30 · 更新于 2026-07-24 · 2 min · 305 words

Raon-Speech Technical Report

2026-05-30 · 更新于 2026-07-24 · 4 min · 730 words

语音/音乐/音频论文速递 2026-05-30

2026-05-30 · 更新于 2026-07-24 · 3 min · 583 words

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

2026-05-29 · 更新于 2026-07-24 · 2 min · 298 words

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

2026-05-29 · 更新于 2026-07-24 · 2 min · 344 words

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

2026-05-29 · 更新于 2026-07-24 · 2 min · 323 words

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

2026-05-29 · 更新于 2026-07-24 · 2 min · 239 words

Benchmarking Single-Factor Physical Video-to-Audio Generation

2026-05-29 · 更新于 2026-07-24 · 3 min · 504 words

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

2026-05-29 · 更新于 2026-07-24 · 2 min · 264 words

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

2026-05-29 · 更新于 2026-07-24 · 4 min · 650 words

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

2026-05-29 · 更新于 2026-07-24 · 2 min · 234 words

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

2026-05-29 · 更新于 2026-07-24 · 2 min · 359 words

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

2026-05-29 · 更新于 2026-07-24 · 2 min · 358 words

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

2026-05-29 · 更新于 2026-07-24 · 1 min · 209 words

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

2026-05-29 · 更新于 2026-07-24 · 4 min · 673 words

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

2026-05-29 · 更新于 2026-07-24 · 1 min · 115 words

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

2026-05-29 · 更新于 2026-07-24 · 3 min · 481 words

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

2026-05-29 · 更新于 2026-07-24 · 3 min · 447 words

Native Audio-Visual Alignment for Generation

2026-05-29 · 更新于 2026-07-24 · 2 min · 386 words

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

2026-05-29 · 更新于 2026-07-24 · 2 min · 416 words

State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

2026-05-29 · 更新于 2026-07-24 · 5 min · 986 words

The WER Trap: Shattering the Illusion of Unified Tokens in Speech Language Models

2026-05-29 · 更新于 2026-07-24 · 3 min · 500 words

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

2026-05-29 · 更新于 2026-07-24 · 2 min · 425 words

语音/音乐/音频论文速递 2026-05-29

2026-05-29 · 更新于 2026-07-24 · 10 min · 2103 words

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

2026-05-28 · 更新于 2026-07-24 · 3 min · 586 words

Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

2026-05-28 · 更新于 2026-07-24 · 3 min · 434 words

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

2026-05-28 · 更新于 2026-07-24 · 2 min · 373 words

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

2026-05-28 · 更新于 2026-07-24 · 2 min · 350 words

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

2026-05-28 · 更新于 2026-07-24 · 2 min · 257 words

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

2026-05-28 · 更新于 2026-07-24 · 2 min · 239 words

Breaking the Script Barrier: Enabling Automatic Alignment for PoS-based ASR Error Analysis in Non-Latin Scripts

2026-05-28 · 更新于 2026-07-24 · 3 min · 446 words

Building Community-Centred NLP Resources for Puno Quechua

2026-05-28 · 更新于 2026-07-24 · 2 min · 385 words

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

2026-05-28 · 更新于 2026-07-24 · 3 min · 608 words

Cross-modal characterization of infant cry: validation of a chest-surface accelerometer in extracting acoustic vocal function measures

2026-05-28 · 更新于 2026-07-24 · 2 min · 354 words

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

2026-05-28 · 更新于 2026-07-24 · 3 min · 581 words

DEMON: Diffusion Engine for Musical Orchestrated Noise

2026-05-28 · 更新于 2026-07-24 · 2 min · 259 words

Diffusion Large Language Models for Visual Speech Recognition

2026-05-28 · 更新于 2026-07-24 · 2 min · 256 words

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

2026-05-28 · 更新于 2026-07-24 · 3 min · 554 words

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

2026-05-28 · 更新于 2026-07-24 · 2 min · 403 words

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

2026-05-28 · 更新于 2026-07-24 · 2 min · 384 words

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

2026-05-28 · 更新于 2026-07-24 · 3 min · 634 words

HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

2026-05-28 · 更新于 2026-07-24 · 2 min · 334 words

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

2026-05-28 · 更新于 2026-07-24 · 2 min · 405 words

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

2026-05-28 · 更新于 2026-07-24 · 2 min · 422 words

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

2026-05-28 · 更新于 2026-07-24 · 3 min · 486 words

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

2026-05-28 · 更新于 2026-07-24 · 2 min · 296 words

Robust Quantum-MUSIC for DoA Estimation Using Rydberg Atomic Receiver Arrays

2026-05-28 · 更新于 2026-07-24 · 2 min · 380 words

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

2026-05-28 · 更新于 2026-07-24 · 2 min · 359 words

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

2026-05-28 · 更新于 2026-07-24 · 3 min · 555 words

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

2026-05-28 · 更新于 2026-07-24 · 3 min · 506 words

Utilizing Missed Detections in Directional Sensitivity-Based DOA Estimation

2026-05-28 · 更新于 2026-07-24 · 2 min · 360 words

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

2026-05-28 · 更新于 2026-07-24 · 2 min · 389 words

When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

2026-05-28 · 更新于 2026-07-24 · 2 min · 225 words

Why We Need Speech to Evaluate Speech Translation

2026-05-28 · 更新于 2026-07-24 · 4 min · 684 words

语音/音乐/音频论文速递 2026-05-28

2026-05-28 · 更新于 2026-07-24 · 15 min · 3187 words

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

2026-05-27 · 更新于 2026-07-24 · 4 min · 675 words

An investigation of AI integration in sound designer workflows and experiences

2026-05-27 · 更新于 2026-07-24 · 1 min · 171 words

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

2026-05-27 · 更新于 2026-07-24 · 2 min · 331 words

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

2026-05-27 · 更新于 2026-07-24 · 2 min · 262 words

Can We Hear from Events? Generating Speech from Event Camera

2026-05-27 · 更新于 2026-07-24 · 3 min · 449 words

CFMDCTCodec: A Low-Bitrate Neural Speech Codec with Noise-Prior-aware Conditional Flow Matching for MDCT-Spectral Enhancement

2026-05-27 · 更新于 2026-07-24 · 3 min · 480 words

Continual Speaker Identity Unlearning with Minimal Interference

2026-05-27 · 更新于 2026-07-24 · 1 min · 126 words

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

2026-05-27 · 更新于 2026-07-24 · 2 min · 425 words

cSTMM: A Unified Complex Spherical Student’s Mixture Model for Directional Statistics in Mask-Based Blind Speech Separation

2026-05-27 · 更新于 2026-07-24 · 4 min · 716 words

Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets

2026-05-27 · 更新于 2026-07-24 · 3 min · 516 words

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

2026-05-27 · 更新于 2026-07-24 · 4 min · 708 words

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

2026-05-27 · 更新于 2026-07-24 · 2 min · 298 words

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

2026-05-27 · 更新于 2026-07-24 · 3 min · 564 words

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

2026-05-27 · 更新于 2026-07-24 · 2 min · 281 words

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

2026-05-27 · 更新于 2026-07-24 · 3 min · 508 words

From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models

2026-05-27 · 更新于 2026-07-24 · 2 min · 370 words

G-iMUSIC: Greedy Iterative MUSIC Algorithms for Multi-Target DoA Estimation

2026-05-27 · 更新于 2026-07-24 · 2 min · 230 words

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

2026-05-27 · 更新于 2026-07-24 · 2 min · 243 words

Learning When to Think While Listening in Large Audio-Language Models

2026-05-27 · 更新于 2026-07-24 · 1 min · 143 words

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

2026-05-27 · 更新于 2026-07-24 · 3 min · 530 words

LongCat-Video-Avatar 1.5 Technical Report

2026-05-27 · 更新于 2026-07-24 · 2 min · 279 words

MERIT: Learning Disentangled Music Representations for Audio Similarity

2026-05-27 · 更新于 2026-07-24 · 2 min · 410 words

Music Transcription with (Almost) No Supervision

2026-05-27 · 更新于 2026-07-24 · 3 min · 516 words

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

2026-05-27 · 更新于 2026-07-24 · 3 min · 456 words

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

2026-05-27 · 更新于 2026-07-24 · 3 min · 480 words

PitchBench: Measuring Pitch Hearing in Audio-Language Models

2026-05-27 · 更新于 2026-07-24 · 3 min · 467 words

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

2026-05-27 · 更新于 2026-07-24 · 2 min · 241 words

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

2026-05-27 · 更新于 2026-07-24 · 1 min · 197 words

Rubato: Transcribing Piano Music with Timestamps

2026-05-27 · 更新于 2026-07-24 · 3 min · 515 words

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

2026-05-27 · 更新于 2026-07-24 · 2 min · 217 words

Subspace Track-before-Detect for Passive Multi-Target Tracking with Unknown Emitted Signals

2026-05-27 · 更新于 2026-07-24 · 2 min · 368 words

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

2026-05-27 · 更新于 2026-07-24 · 4 min · 833 words

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

2026-05-27 · 更新于 2026-07-24 · 2 min · 307 words

Time Segmented Beamforming via Dynamic Programming: Theory and Implementation

2026-05-27 · 更新于 2026-07-24 · 2 min · 331 words

Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

2026-05-27 · 更新于 2026-07-24 · 2 min · 291 words

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

2026-05-27 · 更新于 2026-07-24 · 3 min · 540 words

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

2026-05-27 · 更新于 2026-07-24 · 3 min · 569 words

Why Can’t They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

2026-05-27 · 更新于 2026-07-24 · 1 min · 116 words

Zero-Shot Parkinson’s Disease Detection from Speech: Comparing Large Audio and Language Models

2026-05-27 · 更新于 2026-07-24 · 3 min · 500 words

语音/音乐/音频论文速递 2026-05-27

2026-05-27 · 更新于 2026-07-24 · 19 min · 3918 words

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

2026-05-26 · 更新于 2026-07-24 · 2 min · 365 words

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

2026-05-26 · 更新于 2026-07-24 · 2 min · 359 words

Continual Speaker Identity Unlearning with Minimal Interference

2026-05-26 · 更新于 2026-07-24 · 3 min · 455 words

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

2026-05-26 · 更新于 2026-07-24 · 2 min · 364 words

cSTMM: A Unified Complex Spherical Student’s Mixture Model for Directional Statistics in Mask-Based Blind Speech Separation

2026-05-26 · 更新于 2026-07-24 · 3 min · 595 words

Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets

2026-05-26 · 更新于 2026-07-24 · 3 min · 509 words

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

2026-05-26 · 更新于 2026-07-24 · 2 min · 356 words

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations

2026-05-26 · 更新于 2026-07-24 · 3 min · 452 words

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

2026-05-26 · 更新于 2026-07-24 · 3 min · 504 words

Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

2026-05-26 · 更新于 2026-07-24 · 3 min · 524 words

Music Transcription with (Almost) No Supervision

2026-05-26 · 更新于 2026-07-24 · 3 min · 491 words

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

2026-05-26 · 更新于 2026-07-24 · 4 min · 677 words

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

2026-05-26 · 更新于 2026-07-24 · 1 min · 142 words

Rubato: Transcribing Piano Music with Timestamps

2026-05-26 · 更新于 2026-07-24 · 2 min · 408 words

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

2026-05-26 · 更新于 2026-07-24 · 2 min · 272 words

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

2026-05-26 · 更新于 2026-07-24 · 2 min · 315 words

StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery

2026-05-26 · 更新于 2026-07-24 · 1 min · 165 words

Subspace Track-before-Detect for Passive Multi-Target Tracking with Unknown Emitted Signals

2026-05-26 · 更新于 2026-07-24 · 2 min · 281 words

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

2026-05-26 · 更新于 2026-07-24 · 5 min · 880 words

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

2026-05-26 · 更新于 2026-07-24 · 2 min · 323 words

The Symmetric Location Problem: a Song of Efficiency and Robustness

2026-05-26 · 更新于 2026-07-24 · 2 min · 317 words

Time Segmented Beamforming via Dynamic Programming: Theory and Implementation

2026-05-26 · 更新于 2026-07-24 · 2 min · 270 words

Toward Native Multimodal Modeling: A Roadmap

2026-05-26 · 更新于 2026-07-24 · 4 min · 803 words

Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

2026-05-26 · 更新于 2026-07-24 · 2 min · 287 words

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

2026-05-26 · 更新于 2026-07-24 · 4 min · 688 words

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

2026-05-26 · 更新于 2026-07-24 · 3 min · 552 words

Zero-Shot Parkinson’s Disease Detection from Speech: Comparing Large Audio and Language Models

2026-05-26 · 更新于 2026-07-24 · 3 min · 475 words

语音/音乐/音频论文速递 2026-05-26

2026-05-26 · 更新于 2026-07-24 · 13 min · 2671 words

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

2026-05-25 · 更新于 2026-07-24 · 1 min · 158 words

A study on weakly-supervised training approaches for phoneme-level pronunciation scoring

2026-05-25 · 更新于 2026-07-24 · 2 min · 355 words

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

2026-05-25 · 更新于 2026-07-24 · 5 min · 962 words

Articulatory strategy as a source of variation in acoustic vowel dynamics

2026-05-25 · 更新于 2026-07-24 · 2 min · 296 words

Broad learning system with robust adaptive kernel

2026-05-25 · 更新于 2026-07-24 · 3 min · 610 words

Comprehensive Dataset and Signal Processing Framework for Phonocardiogram-Based Heart Rate and Blood Pressure Estimation

2026-05-25 · 更新于 2026-07-24 · 3 min · 469 words

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

2026-05-25 · 更新于 2026-07-24 · 3 min · 452 words

Copula-Induced Correntropy for Robust Conjugate Gradient Learning

2026-05-25 · 更新于 2026-07-24 · 2 min · 249 words

Cost-Effective Model Evaluation with Meta-Learning

2026-05-25 · 更新于 2026-07-24 · 2 min · 289 words

Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models

2026-05-25 · 更新于 2026-07-24 · 2 min · 423 words

Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

2026-05-25 · 更新于 2026-07-24 · 2 min · 365 words

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

2026-05-25 · 更新于 2026-07-24 · 3 min · 454 words

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

2026-05-25 · 更新于 2026-07-24 · 2 min · 352 words

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

2026-05-25 · 更新于 2026-07-24 · 2 min · 382 words

Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection – Submission for WildSpoof 2026 TTS Track

2026-05-25 · 更新于 2026-07-24 · 2 min · 320 words

Self-Calibration DOA Estimation for Movable Antenna Systems with Antenna Position Errors

2026-05-25 · 更新于 2026-07-24 · 2 min · 331 words

StepAudio 2.5 Technical Report

2026-05-25 · 更新于 2026-07-24 · 2 min · 376 words

UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment

2026-05-25 · 更新于 2026-07-24 · 4 min · 724 words

Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

2026-05-25 · 更新于 2026-07-24 · 3 min · 511 words

语音/音乐/音频论文速递 2026-05-25

2026-05-25 · 更新于 2026-07-24 · 9 min · 1773 words

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Abstraction Induces the Brain Alignment of Language and Speech Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 37 words

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 28 words

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning

2026-05-23 · 更新于 2026-07-24 · 1 min · 29 words

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Alethia: a Foundational Encoder for Voice Deepfakes

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Any-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Ariadne’s Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

AudioMosaic: Contrastive Masked Audio Representation Learning

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

AVTrack: Audio-Visual Speaker Tracking in Complex Scenes

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

BAT: Better Audio Transformer Guided by Convex Gated Probing

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Bioacoustic Geolocation: Species Sounds as Geographic Signals

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 26 words

Bridging Your Imagination with Audio-Video Generation via a Unified Director

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

2026-05-23 · 更新于 2026-07-24 · 1 min · 33 words

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

2026-05-23 · 更新于 2026-07-24 · 1 min · 24 words

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

2026-05-23 · 更新于 2026-07-24 · 1 min · 24 words

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Group Cognition Learning: Making Everything Better Through Controlled Two-Stage Agents Collaboration

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Hearing Without Noticing? Attention-Aware Stealthy Black-box Adversarial Audio Attacks

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

IVQ: Structured and Lightweight Vector Quantization via Binary Hierarchical Composition Inspired by IChing

2026-05-23 · 更新于 2026-07-24 · 1 min · 24 words

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

Language Model Augmented Semi-Supervised Statistical Inference

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

Learning Tight Rejection Boundaries without Negatives for Strict One-Class Audio Deepfake Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

LightAVSeg: Lightweight Audio-Visual Segmentation

2026-05-23 · 更新于 2026-07-24 · 1 min · 15 words

Listening Through the Noise: Cauchy-Driven Diffusion Bridges for Robust Gastrointestinal Auscultation and Clinical Benchmarking

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

Long Grounded Thoughts: Synthesizing Grounded Visual Problems and Distilling Reasoning Chains at Scale

2026-05-23 · 更新于 2026-07-24 · 1 min · 24 words

LynX: Token Interface Alignment for Video+X LLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 34 words

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

MetaBio: Learning from metadata for bioacoustics foundation models

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

MFCL Audio: An Audio Function Calling Evaluation for Large Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

2026-05-23 · 更新于 2026-07-24 · 2 min · 345 words

Multimodal Fusion via Self-Consistent Task-Gradient Fields

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

Multimodal Latent Language Modeling with Next-Token Diffusion

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

MusicDET: Zero-Shot AI-Generated Music Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 16 words

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

OmniDenseCap: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

OmniShow: Orchestrating Multimodal Conditions for Human-Object Interaction Video Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Optimality of FSQ tokens for continuous diffusion for categorical data with application to text-to-speech

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

PCRNet: Phase-aware Complex Refinement Network for EEG-based Auditory Attention Decoding

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Polyphonia: Training-Free Context-Aware Music Editing with Acoustic-Informed Attention Calibration

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

Position: Beyond Text The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future

2026-05-23 · 更新于 2026-07-24 · 1 min · 27 words

Position: Towards Responsible Evaluation for Text-to-Speech

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

PRIM：Cooperative Dynamic Token Compression for Efficient Large Multimodal Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

Probing Cross-modal Information Hubs in Audio-Visual LLMs

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Quaternion Self-Attention with Shared Scores

2026-05-23 · 更新于 2026-07-24 · 1 min · 16 words

Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Real-World Unsupervised Models Generalize to Predict Brain Responses to Out-of-Distribution Stimuli

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

2026-05-23 · 更新于 2026-07-24 · 1 min · 26 words

Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Robust Signal Enhancement via Fractional Detail Views and Knowledge Guided Multi-view Fusion

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

S3Audio: Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

SAM Audio: Segment Anything in Audio

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering

2026-05-23 · 更新于 2026-07-24 · 1 min · 21 words

Scaling Laws in Model Fine-tuning for Audio DeepFake Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

Scaling Transformers for End-to-End Discrete Audio Tokenization

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

Simultaneous Speech-to-Speech Translation Without Aligned Data

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Spherical Procrustes Alignment for Reliable Medical Audio Diagnosis

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 20 words

STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

tau-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

2026-05-23 · 更新于 2026-07-24 · 0 min · 0 words

TextME: Bridging Unseen Modalities Through Text Descriptions

2026-05-23 · 更新于 2026-07-24 · 1 min · 18 words

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

2026-05-23 · 更新于 2026-07-24 · 1 min · 25 words

TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

2026-05-23 · 更新于 2026-07-24 · 1 min · 23 words

Two-dimensional quantization for geometry-aware audio coding

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

2026-05-23 · 更新于 2026-07-24 · 1 min · 17 words

VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

2026-05-23 · 更新于 2026-07-24 · 1 min · 22 words

WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention

2026-05-23 · 更新于 2026-07-24 · 1 min · 19 words

Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language

2026-05-23 · 更新于 2026-07-24 · 1 min · 24 words

语音/音乐/音频论文速递 2026-05-23

2026-05-23 · 更新于 2026-07-24 · 16 min · 3402 words

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

2026-05-22 · 更新于 2026-07-24 · 2 min · 372 words

Automatic Contextual Audio Denoising

2026-05-22 · 更新于 2026-07-24 · 2 min · 342 words

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

2026-05-22 · 更新于 2026-07-24 · 2 min · 396 words

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

2026-05-22 · 更新于 2026-07-24 · 2 min · 252 words

Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

2026-05-22 · 更新于 2026-07-24 · 3 min · 473 words

From Volterra Series to Kunchenko Stochastic Polynomials: Half a Century of Non-Gaussian Estimation Methodology

2026-05-22 · 更新于 2026-07-24 · 2 min · 310 words

In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

2026-05-22 · 更新于 2026-07-24 · 2 min · 260 words

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

2026-05-22 · 更新于 2026-07-24 · 3 min · 486 words

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

2026-05-22 · 更新于 2026-07-24 · 3 min · 541 words

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

2026-05-22 · 更新于 2026-07-24 · 2 min · 349 words

Neighbor-Consistent Neural Filters for Robust Personal Sound Zones Under Localization Uncertainty

2026-05-22 · 更新于 2026-07-24 · 3 min · 536 words

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

2026-05-22 · 更新于 2026-07-24 · 2 min · 405 words

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

2026-05-22 · 更新于 2026-07-24 · 4 min · 708 words

Real-time, EDM-inspired sonfication of the activity of a supercomputer

2026-05-22 · 更新于 2026-07-24 · 2 min · 227 words

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

2026-05-22 · 更新于 2026-07-24 · 3 min · 435 words

语音/音乐/音频论文速递 2026-05-22

2026-05-22 · 更新于 2026-07-24 · 8 min · 1596 words

A conceptual framework for learning to listen by reward: Curiosity-driven search for novel sources

2026-05-21 · 更新于 2026-07-24 · 2 min · 358 words

A strongly annotated passive acoustic dataset for tropical bird monitoring

2026-05-21 · 更新于 2026-07-24 · 3 min · 558 words

A Survey of Audio Reasoning in Multimodal Foundation Models

2026-05-21 · 更新于 2026-07-24 · 2 min · 320 words

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

2026-05-21 · 更新于 2026-07-24 · 3 min · 491 words

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

2026-05-21 · 更新于 2026-07-24 · 2 min · 406 words

Causal Spatio-Temporal Sound Field Reconstruction

2026-05-21 · 更新于 2026-07-24 · 2 min · 274 words

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

2026-05-21 · 更新于 2026-07-24 · 2 min · 323 words

Codec-Robust Attacks on Audio LLMs

2026-05-21 · 更新于 2026-07-24 · 3 min · 429 words

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

2026-05-21 · 更新于 2026-07-24 · 2 min · 401 words

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

2026-05-21 · 更新于 2026-07-24 · 3 min · 588 words

Cross-Talk Speech Reduction, by Separation, for Separation

2026-05-21 · 更新于 2026-07-24 · 5 min · 887 words

DASM: Domain-Aware Sharpness Minimization for Multi-Domain Voice Stream Steganalysis

2026-05-21 · 更新于 2026-07-24 · 3 min · 439 words

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

2026-05-21 · 更新于 2026-07-24 · 2 min · 416 words

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

2026-05-21 · 更新于 2026-07-24 · 2 min · 353 words

Executable Boundary Contracts for Sound Event Traces

2026-05-21 · 更新于 2026-07-24 · 3 min · 609 words

Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays

2026-05-21 · 更新于 2026-07-24 · 2 min · 362 words

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

2026-05-21 · 更新于 2026-07-24 · 3 min · 547 words

FormalASR: End-to-End Spoken Chinese to Formal Text

2026-05-21 · 更新于 2026-07-24 · 3 min · 473 words

From Numbers to Perception, Energy Decay Curves Prediction

2026-05-21 · 更新于 2026-07-24 · 2 min · 314 words

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

2026-05-21 · 更新于 2026-07-24 · 2 min · 418 words

Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches

2026-05-21 · 更新于 2026-07-24 · 2 min · 400 words

Linearly Constrained Deep Beamformer for Multi-Speaker Scenarios

2026-05-21 · 更新于 2026-07-24 · 2 min · 363 words

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

2026-05-21 · 更新于 2026-07-24 · 3 min · 481 words

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

2026-05-21 · 更新于 2026-07-24 · 2 min · 374 words

Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching

2026-05-21 · 更新于 2026-07-24 · 2 min · 264 words

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

2026-05-21 · 更新于 2026-07-24 · 3 min · 589 words

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

2026-05-21 · 更新于 2026-07-24 · 3 min · 559 words

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

2026-05-21 · 更新于 2026-07-24 · 4 min · 643 words

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

2026-05-21 · 更新于 2026-07-24 · 4 min · 644 words

PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

2026-05-21 · 更新于 2026-07-24 · 3 min · 511 words

Precise and Simple Audio-to-Score Alignment

2026-05-21 · 更新于 2026-07-24 · 2 min · 408 words

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

2026-05-21 · 更新于 2026-07-24 · 3 min · 542 words

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

2026-05-21 · 更新于 2026-07-24 · 3 min · 466 words

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

2026-05-21 · 更新于 2026-07-24 · 2 min · 358 words

Speech Quality Embeddings for Improved Detection and Classification of Degradations in Speech Signals

2026-05-21 · 更新于 2026-07-24 · 2 min · 400 words

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

2026-05-21 · 更新于 2026-07-24 · 3 min · 527 words

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

2026-05-21 · 更新于 2026-07-24 · 2 min · 285 words

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

2026-05-21 · 更新于 2026-07-24 · 3 min · 428 words

Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

2026-05-21 · 更新于 2026-07-24 · 3 min · 498 words

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

2026-05-21 · 更新于 2026-07-24 · 2 min · 227 words

语音/音乐/音频论文速递 2026-05-21

2026-05-21 · 更新于 2026-07-24 · 26 min · 5389 words

A conceptual framework for learning to listen by reward: Curiosity-driven search for novel sources

2026-05-20 · 更新于 2026-07-24 · 3 min · 449 words

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

2026-05-20 · 更新于 2026-07-24 · 2 min · 371 words

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

2026-05-20 · 更新于 2026-07-24 · 3 min · 500 words

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

2026-05-20 · 更新于 2026-07-24 · 3 min · 430 words

Cross-Talk Speech Reduction, by Separation, for Separation

2026-05-20 · 更新于 2026-07-24 · 2 min · 365 words

DASM: Domain-Aware Sharpness Minimization for Multi-Domain Voice Stream Steganalysis

2026-05-20 · 更新于 2026-07-24 · 3 min · 535 words

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

2026-05-20 · 更新于 2026-07-24 · 4 min · 775 words

Executable Boundary Contracts for Sound Event Traces

2026-05-20 · 更新于 2026-07-24 · 3 min · 617 words

Fast Multichannel NMF with Block-Diagonal Spatial Covariance Matrices for Efficient Blind Source Separation Using Distributed Microphone Arrays

2026-05-20 · 更新于 2026-07-24 · 2 min · 378 words

FormalASR: End-to-End Spoken Chinese to Formal Text

2026-05-20 · 更新于 2026-07-24 · 2 min · 303 words

GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

2026-05-20 · 更新于 2026-07-24 · 3 min · 548 words

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

2026-05-20 · 更新于 2026-07-24 · 3 min · 568 words

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

2026-05-20 · 更新于 2026-07-24 · 3 min · 517 words

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

2026-05-20 · 更新于 2026-07-24 · 4 min · 741 words

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

2026-05-20 · 更新于 2026-07-24 · 4 min · 647 words

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

2026-05-20 · 更新于 2026-07-24 · 4 min · 747 words

Precise and Simple Audio-to-Score Alignment

2026-05-20 · 更新于 2026-07-24 · 2 min · 358 words

Sparse Fluid Antenna Arrays: Continuous Position Design Beyond Classical DOF Limits

2026-05-20 · 更新于 2026-07-24 · 3 min · 460 words

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

2026-05-20 · 更新于 2026-07-24 · 2 min · 335 words

When Vision Speaks for Sound

2026-05-20 · 更新于 2026-07-24 · 3 min · 567 words

语音/音乐/音频论文速递 2026-05-20

2026-05-20 · 更新于 2026-07-24 · 15 min · 2985 words

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

2026-05-19 · 更新于 2026-07-24 · 3 min · 508 words

A Fast Robust Adaptive filter using Improved Data-Reuse Method

2026-05-19 · 更新于 2026-07-24 · 2 min · 401 words

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

2026-05-19 · 更新于 2026-07-24 · 3 min · 431 words

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

2026-05-19 · 更新于 2026-07-24 · 3 min · 615 words

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

2026-05-19 · 更新于 2026-07-24 · 3 min · 634 words

Audio-Image Cross-Modal Retrieval with Onomatopoeic Images

2026-05-19 · 更新于 2026-07-24 · 3 min · 508 words

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

2026-05-19 · 更新于 2026-07-24 · 3 min · 573 words

Bridging the Gap: Converting Read Text to Conversational Dialogue

2026-05-19 · 更新于 2026-07-24 · 2 min · 277 words

Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities

2026-05-19 · 更新于 2026-07-24 · 4 min · 645 words

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

2026-05-19 · 更新于 2026-07-24 · 3 min · 456 words

Contextual Biasing for Streaming ASR via CTC-based Word Spotting

2026-05-19 · 更新于 2026-07-24 · 2 min · 371 words

EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

2026-05-19 · 更新于 2026-07-24 · 2 min · 401 words

Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters

2026-05-19 · 更新于 2026-07-24 · 3 min · 547 words

Fractional-Order Subband p-Norm Adaptive Filter via Transformation Nearest Kronecker Product Decomposition for Active Noise Control

2026-05-19 · 更新于 2026-07-24 · 2 min · 277 words

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

2026-05-19 · 更新于 2026-07-24 · 3 min · 431 words

MusicDET: Zero-Shot AI-Generated Music Detection

2026-05-19 · 更新于 2026-07-24 · 3 min · 556 words

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

2026-05-19 · 更新于 2026-07-24 · 4 min · 673 words

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

2026-05-19 · 更新于 2026-07-24 · 3 min · 639 words

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

2026-05-19 · 更新于 2026-07-24 · 2 min · 411 words

Robust Audio Tagging under Class-wise Supervision Unreliability

2026-05-19 · 更新于 2026-07-24 · 3 min · 434 words

Robust Soft-Constrained Spatially Selective Active Noise Control for Hearables Under Secondary Path Variations

2026-05-19 · 更新于 2026-07-24 · 2 min · 364 words

S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation

2026-05-19 · 更新于 2026-07-24 · 3 min · 552 words

SAME: A Semantically-Aligned Music Autoencoder

2026-05-19 · 更新于 2026-07-24 · 3 min · 607 words

SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

2026-05-19 · 更新于 2026-07-24 · 3 min · 550 words

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

2026-05-19 · 更新于 2026-07-24 · 3 min · 515 words

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

2026-05-19 · 更新于 2026-07-24 · 3 min · 482 words

Sonalyzer-Moz: A Framework for Analyzing the Structure of Mozart’s Sonata Form

2026-05-19 · 更新于 2026-07-24 · 2 min · 401 words

Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations

2026-05-19 · 更新于 2026-07-24 · 3 min · 445 words

Stable Audio 3

2026-05-19 · 更新于 2026-07-24 · 3 min · 621 words

Taming Audio VAEs via Target-KL Regularization

2026-05-19 · 更新于 2026-07-24 · 3 min · 434 words

UrduSpeech: A 156-Hour Urdu Speech Corpus with 12-Dimension Paralinguistic Annotations

2026-05-19 · 更新于 2026-07-24 · 2 min · 386 words

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

2026-05-19 · 更新于 2026-07-24 · 2 min · 313 words

Voice ‘‘Cloning’’ is Style Transfer

2026-05-19 · 更新于 2026-07-24 · 2 min · 323 words

WavFlow: Audio Generation in Waveform Space

2026-05-19 · 更新于 2026-07-24 · 3 min · 524 words

语音/音乐/音频论文速递 2026-05-19

2026-05-19 · 更新于 2026-07-24 · 23 min · 4805 words

ARIA: A Diagnostic Framework for Music Training Data Attribution

2026-05-18 · 更新于 2026-07-24 · 4 min · 833 words

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

2026-05-18 · 更新于 2026-07-24 · 3 min · 606 words

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

2026-05-18 · 更新于 2026-07-24 · 2 min · 330 words

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

2026-05-18 · 更新于 2026-07-24 · 2 min · 382 words

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

2026-05-18 · 更新于 2026-07-24 · 8 min · 1698 words

Improving Automatic Speech Recognition for Speakers Treated for Oral Cancer using Data Augmentation and LLM Error Correction

2026-05-18 · 更新于 2026-07-24 · 2 min · 426 words

Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

2026-05-18 · 更新于 2026-07-24 · 4 min · 792 words

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

2026-05-18 · 更新于 2026-07-24 · 4 min · 654 words

Perforated Neural Networks for Keyword Spotting

2026-05-18 · 更新于 2026-07-24 · 2 min · 379 words

Real-time Speech Restoration using Data Prediction Mean Flows

2026-05-18 · 更新于 2026-07-24 · 3 min · 466 words

Scalable neuromorphic computing from autonomous spiking dynamics in a clockless reconfigurable chip

2026-05-18 · 更新于 2026-07-24 · 3 min · 458 words

Sound Sparks Motion: Audio and Text Tuning for Video Editing

2026-05-18 · 更新于 2026-07-24 · 1 min · 211 words

Toward World Modeling of Physiological Signals with Chaos-Theoretic Balancing and Latent Dynamics

2026-05-18 · 更新于 2026-07-24 · 3 min · 455 words

语音/音乐/音频论文速递 2026-05-18

2026-05-18 · 更新于 2026-07-24 · 11 min · 2305 words

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

2026-05-17 · 更新于 2026-07-24 · 4 min · 681 words

ViMU: Benchmarking Video Metaphorical Understanding

2026-05-17 · 更新于 2026-07-24 · 3 min · 558 words

语音/音乐/音频论文速递 2026-05-17

2026-05-17 · 更新于 2026-07-24 · 3 min · 515 words

A Benchmark for Early-stage Parkinson’s Disease Detection from Speech

2026-05-15 · 更新于 2026-07-24 · 3 min · 531 words

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

2026-05-15 · 更新于 2026-07-24 · 4 min · 673 words

AudioMosaic: Contrastive Masked Audio Representation Learning

2026-05-15 · 更新于 2026-07-24 · 3 min · 635 words

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

2026-05-15 · 更新于 2026-07-24 · 3 min · 517 words

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

2026-05-15 · 更新于 2026-07-24 · 3 min · 543 words

FSD50K-Solo: Automated Curation of Single-Source Sound Events

2026-05-15 · 更新于 2026-07-24 · 2 min · 354 words

FutureSim: Replaying World Events to Evaluate Adaptive Agents

2026-05-15 · 更新于 2026-07-24 · 3 min · 570 words

IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

2026-05-15 · 更新于 2026-07-24 · 3 min · 459 words

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

2026-05-15 · 更新于 2026-07-24 · 3 min · 444 words

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

2026-05-15 · 更新于 2026-07-24 · 2 min · 303 words

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

2026-05-15 · 更新于 2026-07-24 · 3 min · 509 words

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

2026-05-15 · 更新于 2026-07-24 · 2 min · 290 words

Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

2026-05-15 · 更新于 2026-07-24 · 2 min · 407 words

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

2026-05-15 · 更新于 2026-07-24 · 3 min · 439 words

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

2026-05-15 · 更新于 2026-07-24 · 3 min · 453 words

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

2026-05-15 · 更新于 2026-07-24 · 3 min · 621 words

Streaming Speech-to-Text Translation with a SpeechLLM

2026-05-15 · 更新于 2026-07-24 · 2 min · 341 words

Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

2026-05-15 · 更新于 2026-07-24 · 3 min · 516 words

Transmit Beamforming for High-Rate Underwater Acoustic Communications

2026-05-15 · 更新于 2026-07-24 · 2 min · 352 words

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

2026-05-15 · 更新于 2026-07-24 · 3 min · 590 words

语音/音乐/音频论文速递 2026-05-15

2026-05-15 · 更新于 2026-07-24 · 15 min · 3187 words

Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval

2026-05-14 · 更新于 2026-07-24 · 2 min · 252 words

Decoupled Azimuth Elevation AoA Estimation Exploiting Kronecker Separable Steering Matrices

2026-05-14 · 更新于 2026-07-24 · 2 min · 331 words

Does language matter for spoken word classification? A multilingual generative meta-learning approach

2026-05-14 · 更新于 2026-07-24 · 2 min · 326 words

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

2026-05-14 · 更新于 2026-07-24 · 3 min · 545 words

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

2026-05-14 · 更新于 2026-07-24 · 3 min · 444 words

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

2026-05-14 · 更新于 2026-07-24 · 2 min · 357 words

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

2026-05-14 · 更新于 2026-07-24 · 3 min · 510 words

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

2026-05-14 · 更新于 2026-07-24 · 2 min · 381 words

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

2026-05-14 · 更新于 2026-07-24 · 2 min · 362 words

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

2026-05-14 · 更新于 2026-07-24 · 3 min · 434 words

Scaling few-shot spoken word classification with generative meta-continual learning

2026-05-14 · 更新于 2026-07-24 · 2 min · 336 words

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

2026-05-14 · 更新于 2026-07-24 · 4 min · 709 words

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

2026-05-14 · 更新于 2026-07-24 · 4 min · 720 words

Text2Score: Generating Sheet Music From Textual Prompts

2026-05-14 · 更新于 2026-07-24 · 3 min · 459 words

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

2026-05-14 · 更新于 2026-07-24 · 3 min · 453 words

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

2026-05-14 · 更新于 2026-07-24 · 3 min · 467 words

语音/音乐/音频论文速递 2026-05-14

2026-05-14 · 更新于 2026-07-24 · 11 min · 2240 words

A Semi-Supervised Framework for Speech Confidence Detection using Whisper

2026-05-13 · 更新于 2026-07-24 · 3 min · 570 words

Adaptive Diagonal Loading using Krylov Subspaces for Robust Beamforming

2026-05-13 · 更新于 2026-07-24 · 2 min · 365 words

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

2026-05-13 · 更新于 2026-07-24 · 3 min · 578 words

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

2026-05-13 · 更新于 2026-07-24 · 3 min · 487 words

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

2026-05-13 · 更新于 2026-07-24 · 3 min · 568 words

Chunkwise Aligners for Streaming Speech Recognition

2026-05-13 · 更新于 2026-07-24 · 3 min · 605 words

Exploring Token-Space Manipulation in Latent Audio Tokenizers

2026-05-13 · 更新于 2026-07-24 · 5 min · 900 words

jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition

2026-05-13 · 更新于 2026-07-24 · 3 min · 447 words

Mechanistic Interpretability of ASR models using Sparse Autoencoders

2026-05-13 · 更新于 2026-07-24 · 3 min · 429 words

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

2026-05-13 · 更新于 2026-07-24 · 1 min · 197 words

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

2026-05-13 · 更新于 2026-07-24 · 3 min · 556 words

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

2026-05-13 · 更新于 2026-07-24 · 4 min · 728 words

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

2026-05-13 · 更新于 2026-07-24 · 4 min · 688 words

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

2026-05-13 · 更新于 2026-07-24 · 4 min · 674 words

Spatial Power Estimation via Riemannian Covariance Matching

2026-05-13 · 更新于 2026-07-24 · 2 min · 295 words

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

2026-05-13 · 更新于 2026-07-24 · 3 min · 435 words

The Deepfakes We Missed: We Built Detectors for a Threat That Didn’t Arrive

2026-05-13 · 更新于 2026-07-24 · 2 min · 324 words

The SMC Blind Spot: A Failure Mode Analysis of State-of-the-Art Beat Tracking

2026-05-13 · 更新于 2026-07-24 · 2 min · 343 words

Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement

2026-05-13 · 更新于 2026-07-24 · 4 min · 644 words

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

2026-05-13 · 更新于 2026-07-24 · 5 min · 943 words

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

2026-05-13 · 更新于 2026-07-24 · 2 min · 399 words

What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty

2026-05-13 · 更新于 2026-07-24 · 3 min · 429 words

语音/音乐/音频论文速递 2026-05-13

2026-05-13 · 更新于 2026-07-24 · 14 min · 2798 words

A Cold Diffusion Approach for Percussive Dereverberation

2026-05-12 · 更新于 2026-07-24 · 4 min · 708 words

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

2026-05-12 · 更新于 2026-07-24 · 2 min · 418 words

APEX: Audio Prototype EXplanations for Classification Tasks

2026-05-12 · 更新于 2026-07-24 · 4 min · 823 words

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

2026-05-12 · 更新于 2026-07-24 · 3 min · 505 words

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

2026-05-12 · 更新于 2026-07-24 · 2 min · 367 words

CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

2026-05-12 · 更新于 2026-07-24 · 4 min · 652 words

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

2026-05-12 · 更新于 2026-07-24 · 3 min · 612 words

Dolphin-CN-Dialect: Where Chinese Dialects Matter

2026-05-12 · 更新于 2026-07-24 · 4 min · 696 words

Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

2026-05-12 · 更新于 2026-07-24 · 4 min · 663 words

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

2026-05-12 · 更新于 2026-07-24 · 3 min · 507 words

Encoding and Decoding Temporal Signals with Spiking Bandpass Wavelets

2026-05-12 · 更新于 2026-07-24 · 2 min · 405 words

Evaluating the Expressive Appropriateness of Speech in Rich Contexts

2026-05-12 · 更新于 2026-07-24 · 3 min · 633 words

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

2026-05-12 · 更新于 2026-07-24 · 4 min · 708 words

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

2026-05-12 · 更新于 2026-07-24 · 4 min · 839 words

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

2026-05-12 · 更新于 2026-07-24 · 4 min · 716 words

Latent Secret Spin: Keyed Orthogonal Rotations for Blind Speech Watermarking in Anisotropic Latent Spaces

2026-05-12 · 更新于 2026-07-24 · 3 min · 446 words

Low-Cost Detection of Degraded Voice Clones via Source-Output Acoustic Consistency

2026-05-12 · 更新于 2026-07-24 · 3 min · 444 words

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

2026-05-12 · 更新于 2026-07-24 · 3 min · 499 words

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

2026-05-12 · 更新于 2026-07-24 · 3 min · 433 words

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

2026-05-12 · 更新于 2026-07-24 · 3 min · 438 words

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

2026-05-12 · 更新于 2026-07-24 · 3 min · 558 words

Online Segmented Beamforming via Dynamic Programming

2026-05-12 · 更新于 2026-07-24 · 3 min · 448 words

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

2026-05-12 · 更新于 2026-07-24 · 3 min · 618 words

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

2026-05-12 · 更新于 2026-07-24 · 3 min · 547 words

Probing Cross-modal Information Hubs in Audio-Visual LLMs

2026-05-12 · 更新于 2026-07-24 · 4 min · 724 words

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

2026-05-12 · 更新于 2026-07-24 · 3 min · 429 words

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

2026-05-12 · 更新于 2026-07-24 · 4 min · 753 words

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

2026-05-12 · 更新于 2026-07-24 · 3 min · 529 words

Responsible Benchmarking of Fairness for Automatic Speech Recognition

2026-05-12 · 更新于 2026-07-24 · 2 min · 293 words

Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

2026-05-12 · 更新于 2026-07-24 · 3 min · 521 words

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

2026-05-12 · 更新于 2026-07-24 · 4 min · 660 words

SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements

2026-05-12 · 更新于 2026-07-24 · 3 min · 447 words

ShipEcho – An Interactive Tool for Global Mapping of Underwater Radiated Noise from Vessels

2026-05-12 · 更新于 2026-07-24 · 2 min · 295 words

Single-Microphone Audio Point Source Discriminative Localization From Reverberation Late Tail Estimation

2026-05-12 · 更新于 2026-07-24 · 2 min · 339 words

Speech-based Psychological Crisis Assessment using LLMs

2026-05-12 · 更新于 2026-07-24 · 3 min · 451 words

Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

2026-05-12 · 更新于 2026-07-24 · 2 min · 229 words

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

2026-05-12 · 更新于 2026-07-24 · 4 min · 773 words

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

2026-05-12 · 更新于 2026-07-24 · 3 min · 588 words

Voice Biomarkers for Depression and Anxiety

2026-05-12 · 更新于 2026-07-24 · 1 min · 166 words

语音/音乐/音频论文速递 2026-05-12

2026-05-12 · 更新于 2026-07-24 · 28 min · 5761 words

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

2026-05-11 · 更新于 2026-07-24 · 3 min · 432 words

Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers

2026-05-11 · 更新于 2026-07-24 · 2 min · 398 words

Anisotropic Modality Align

2026-05-11 · 更新于 2026-07-24 · 3 min · 585 words

Asymmetric Phase Coding Audio Watermarking

2026-05-11 · 更新于 2026-07-24 · 3 min · 429 words

BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing

2026-05-11 · 更新于 2026-07-24 · 2 min · 380 words

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

2026-05-11 · 更新于 2026-07-24 · 2 min · 305 words

Do Joint Audio-Video Generation Models Understand Physics?

2026-05-11 · 更新于 2026-07-24 · 3 min · 589 words

Evaluating voice anonymisation using similarity rank disclosure

2026-05-11 · 更新于 2026-07-24 · 3 min · 435 words

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

2026-05-11 · 更新于 2026-07-24 · 2 min · 363 words

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

2026-05-11 · 更新于 2026-07-24 · 4 min · 710 words

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

2026-05-11 · 更新于 2026-07-24 · 2 min · 410 words

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

2026-05-11 · 更新于 2026-07-24 · 2 min · 264 words

语音/音乐/音频论文速递 2026-05-11

2026-05-11 · 更新于 2026-07-24 · 9 min · 1723 words

Audio-Visual Intelligence in Large Foundation Models

2026-05-09 · 更新于 2026-07-24 · 1 min · 190 words

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

2026-05-09 · 更新于 2026-07-24 · 3 min · 520 words

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

2026-05-09 · 更新于 2026-07-24 · 2 min · 254 words

语音/音乐/音频论文速递 2026-05-09

2026-05-09 · 更新于 2026-07-24 · 3 min · 427 words

Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings

2026-05-08 · 更新于 2026-07-24 · 3 min · 543 words

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

2026-05-08 · 更新于 2026-07-24 · 2 min · 393 words

Do Melody and Rhythm Coevolve?

2026-05-08 · 更新于 2026-07-24 · 3 min · 633 words

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

2026-05-08 · 更新于 2026-07-24 · 3 min · 449 words

Linear Semantic Segmentation for Low-Resource Spoken Dialects

2026-05-08 · 更新于 2026-07-24 · 4 min · 738 words

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

2026-05-08 · 更新于 2026-07-24 · 5 min · 945 words

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

2026-05-08 · 更新于 2026-07-24 · 7 min · 1464 words

Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

2026-05-08 · 更新于 2026-07-24 · 3 min · 519 words

More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

2026-05-08 · 更新于 2026-07-24 · 1 min · 172 words

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

2026-05-08 · 更新于 2026-07-24 · 4 min · 774 words

NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction

2026-05-08 · 更新于 2026-07-24 · 2 min · 414 words

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

2026-05-08 · 更新于 2026-07-24 · 6 min · 1097 words

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

2026-05-08 · 更新于 2026-07-24 · 3 min · 566 words

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

2026-05-08 · 更新于 2026-07-24 · 3 min · 607 words

PianoCoRe: Combined and Refined Piano MIDI Dataset

2026-05-08 · 更新于 2026-07-24 · 4 min · 813 words

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

2026-05-08 · 更新于 2026-07-24 · 7 min · 1301 words

Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction

2026-05-08 · 更新于 2026-07-24 · 3 min · 432 words

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

2026-05-08 · 更新于 2026-07-24 · 1 min · 196 words

Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

2026-05-08 · 更新于 2026-07-24 · 2 min · 399 words

Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models

2026-05-08 · 更新于 2026-07-24 · 4 min · 751 words

Topological Signatures of Grokking

2026-05-08 · 更新于 2026-07-24 · 3 min · 480 words

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

2026-05-08 · 更新于 2026-07-24 · 4 min · 761 words

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

2026-05-08 · 更新于 2026-07-24 · 3 min · 593 words

语音/音乐/音频论文速递 2026-05-08

2026-05-08 · 更新于 2026-07-24 · 17 min · 3434 words

Adaptive Diagonal Loading for Norm Constrained Beamforming

2026-05-07 · 更新于 2026-07-24 · 1 min · 183 words

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

2026-05-07 · 更新于 2026-07-24 · 3 min · 485 words

AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

2026-05-07 · 更新于 2026-07-24 · 3 min · 444 words

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

2026-05-07 · 更新于 2026-07-24 · 1 min · 116 words

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

2026-05-07 · 更新于 2026-07-24 · 2 min · 364 words

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

2026-05-07 · 更新于 2026-07-24 · 2 min · 282 words

Hearing the Ocean: Bio-inspired Gammatone-CNN framework for Robust Underwater Acoustic Target Classification

2026-05-07 · 更新于 2026-07-24 · 2 min · 341 words

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

2026-05-07 · 更新于 2026-07-24 · 2 min · 418 words

Library learning with e-graphs on jazz harmony

2026-05-07 · 更新于 2026-07-24 · 2 min · 304 words

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

2026-05-07 · 更新于 2026-07-24 · 3 min · 523 words

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

2026-05-07 · 更新于 2026-07-24 · 1 min · 208 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-05-07 · 更新于 2026-07-24 · 3 min · 468 words

RenCon 2025: Revival of the Expressive Performance Rendering Competition

2026-05-07 · 更新于 2026-07-24 · 2 min · 336 words

SEI-SHIELD: Robust Specific Emitter Identification Under Label Noise Via Self-Supervised Filtering and Iterative Rescue

2026-05-07 · 更新于 2026-07-24 · 3 min · 492 words

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

2026-05-07 · 更新于 2026-07-24 · 2 min · 417 words

Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

2026-05-07 · 更新于 2026-07-24 · 4 min · 797 words

Stage Light is Sequence^2: Multi-Light Control via Imitation Learning

2026-05-07 · 更新于 2026-07-24 · 3 min · 501 words

Stage-adaptive audio diffusion modeling

2026-05-07 · 更新于 2026-07-24 · 2 min · 353 words

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

2026-05-07 · 更新于 2026-07-24 · 3 min · 457 words

To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

2026-05-07 · 更新于 2026-07-24 · 3 min · 540 words

Trustworthy Federated Label Distribution Learning under Annotation Quality Disparity

2026-05-07 · 更新于 2026-07-24 · 3 min · 570 words

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

2026-05-07 · 更新于 2026-07-24 · 4 min · 643 words

语音/音乐/音频论文速递 2026-05-07

2026-05-07 · 更新于 2026-07-24 · 14 min · 2879 words

A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

2026-05-06 · 更新于 2026-07-24 · 2 min · 411 words

A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition

2026-05-06 · 更新于 2026-07-24 · 1 min · 112 words

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

2026-05-06 · 更新于 2026-07-24 · 3 min · 439 words

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

2026-05-06 · 更新于 2026-07-24 · 2 min · 357 words

Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs

2026-05-06 · 更新于 2026-07-24 · 2 min · 306 words

AsymK-Talker: Real-Time and Long-Horizon Talking Head Generation via Asymmetric Kernel Distillation

2026-05-06 · 更新于 2026-07-24 · 2 min · 418 words

Contrastive Regularization for Accent-Robust ASR

2026-05-06 · 更新于 2026-07-24 · 2 min · 359 words

Cosmodoit: A Python Package for Adaptive, Efficient Pipelining of Feature Extraction from Performed Music

2026-05-06 · 更新于 2026-07-24 · 1 min · 207 words

DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition

2026-05-06 · 更新于 2026-07-24 · 3 min · 485 words

Deepfake Audio Detection Using Self-supervised Fusion Representations

2026-05-06 · 更新于 2026-07-24 · 2 min · 265 words

Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data

2026-05-06 · 更新于 2026-07-24 · 2 min · 312 words

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

2026-05-06 · 更新于 2026-07-24 · 3 min · 428 words

Learning Generalizable Action Representations via Pre-training AEMG

2026-05-06 · 更新于 2026-07-24 · 2 min · 338 words

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

2026-05-06 · 更新于 2026-07-24 · 5 min · 929 words

Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection

2026-05-06 · 更新于 2026-07-24 · 3 min · 456 words

PHALAR: Phasors for Learned Musical Audio Representations

2026-05-06 · 更新于 2026-07-24 · 3 min · 491 words

Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

2026-05-06 · 更新于 2026-07-24 · 2 min · 357 words

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

2026-05-06 · 更新于 2026-07-24 · 3 min · 429 words

Smart Passive Acoustic Monitoring: Embedding a Classifier on AudioMoth Microcontroller

2026-05-06 · 更新于 2026-07-24 · 1 min · 123 words

Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

2026-05-06 · 更新于 2026-07-24 · 3 min · 497 words

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

2026-05-06 · 更新于 2026-07-24 · 3 min · 464 words

Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts

2026-05-06 · 更新于 2026-07-24 · 2 min · 325 words

Towards Open World Sound Event Detection

2026-05-06 · 更新于 2026-07-24 · 3 min · 475 words

语音/音乐/音频论文速递 2026-05-06

2026-05-06 · 更新于 2026-07-24 · 15 min · 3158 words

Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead

2026-05-05 · 更新于 2026-07-24 · 1 min · 129 words

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

2026-05-05 · 更新于 2026-07-24 · 2 min · 295 words

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

2026-05-05 · 更新于 2026-07-24 · 3 min · 461 words

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

2026-05-05 · 更新于 2026-07-24 · 3 min · 458 words

Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks

2026-05-05 · 更新于 2026-07-24 · 2 min · 413 words

HARMES: A Multi-Modal Dataset for Wearable Human Activity Recognition with Motion, Environmental Sensing and Sound

2026-05-05 · 更新于 2026-07-24 · 2 min · 286 words

Integrating acoustic tapping with a UAV platform for tile condition classification

2026-05-05 · 更新于 2026-07-24 · 3 min · 472 words

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

2026-05-05 · 更新于 2026-07-24 · 2 min · 403 words

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

2026-05-05 · 更新于 2026-07-24 · 1 min · 119 words

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

2026-05-05 · 更新于 2026-07-24 · 3 min · 495 words

MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

2026-05-05 · 更新于 2026-07-24 · 2 min · 312 words

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

2026-05-05 · 更新于 2026-07-24 · 2 min · 331 words

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

2026-05-05 · 更新于 2026-07-24 · 2 min · 389 words

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

2026-05-05 · 更新于 2026-07-24 · 2 min · 405 words

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

2026-05-05 · 更新于 2026-07-24 · 3 min · 433 words

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

2026-05-05 · 更新于 2026-07-24 · 2 min · 386 words

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

2026-05-05 · 更新于 2026-07-24 · 2 min · 362 words

NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty

2026-05-05 · 更新于 2026-07-24 · 2 min · 396 words

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

2026-05-05 · 更新于 2026-07-24 · 2 min · 302 words

PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention

2026-05-05 · 更新于 2026-07-24 · 3 min · 464 words

Period-conscious Time-series Reconstruction under Local Differential Privacy

2026-05-05 · 更新于 2026-07-24 · 2 min · 255 words

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation

2026-05-05 · 更新于 2026-07-24 · 2 min · 350 words

RenCon 2025: Revival of the Expressive Performance Rendering Competition

2026-05-05 · 更新于 2026-07-24 · 2 min · 277 words

Spoken Language Identification with Pre-trained Models and Margin Loss

2026-05-05 · 更新于 2026-07-24 · 1 min · 194 words

The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

2026-05-05 · 更新于 2026-07-24 · 2 min · 261 words

The AECM Algorithm for Deterministic Maximum Likelihood Direction Finding in the Presence of Gaussian Mixture Noise

2026-05-05 · 更新于 2026-07-24 · 1 min · 188 words

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

2026-05-05 · 更新于 2026-07-24 · 1 min · 202 words

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

2026-05-05 · 更新于 2026-07-24 · 2 min · 420 words

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

2026-05-05 · 更新于 2026-07-24 · 1 min · 109 words

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

2026-05-05 · 更新于 2026-07-24 · 1 min · 213 words

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

2026-05-05 · 更新于 2026-07-24 · 2 min · 237 words

When Attention Collapses: Residual Evidence Modeling for Compositional Inference

2026-05-05 · 更新于 2026-07-24 · 2 min · 323 words

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

2026-05-05 · 更新于 2026-07-24 · 1 min · 164 words

语音/音乐/音频论文速递 2026-05-05

2026-05-05 · 更新于 2026-07-24 · 19 min · 3988 words

A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks

2026-05-04 · 更新于 2026-07-24 · 2 min · 288 words

A cross-species neural foundation model for end-to-end speech decoding

2026-05-04 · 更新于 2026-07-24 · 2 min · 349 words

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

2026-05-04 · 更新于 2026-07-24 · 2 min · 378 words

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

2026-05-04 · 更新于 2026-07-24 · 2 min · 250 words

Alethia: A Foundational Encoder for Voice Deepfakes

2026-05-04 · 更新于 2026-07-24 · 1 min · 204 words

AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching

2026-05-04 · 更新于 2026-07-24 · 2 min · 299 words

Are Deep Speech Denoising Models Robust to Adversarial Noise?

2026-05-04 · 更新于 2026-07-24 · 2 min · 291 words

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

2026-05-04 · 更新于 2026-07-24 · 3 min · 440 words

AudioX: A Unified Framework for Anything-to-Audio Generation

2026-05-04 · 更新于 2026-07-24 · 4 min · 756 words

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

2026-05-04 · 更新于 2026-07-24 · 2 min · 328 words

Aurelius: Relation Aware Text-to-Audio Generation At Scale

2026-05-04 · 更新于 2026-07-24 · 2 min · 390 words

Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

2026-05-04 · 更新于 2026-07-24 · 3 min · 450 words

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

2026-05-04 · 更新于 2026-07-24 · 3 min · 477 words

AVEX: What Matters for Animal Vocalization Encoding

2026-05-04 · 更新于 2026-07-24 · 3 min · 432 words

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

2026-05-04 · 更新于 2026-07-24 · 3 min · 467 words

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 425 words

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

2026-05-04 · 更新于 2026-07-24 · 2 min · 258 words

Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval

2026-05-04 · 更新于 2026-07-24 · 2 min · 411 words

Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

2026-05-04 · 更新于 2026-07-24 · 3 min · 577 words

Can Speech LLMs Think while Listening?

2026-05-04 · 更新于 2026-07-24 · 2 min · 347 words

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

2026-05-04 · 更新于 2026-07-24 · 2 min · 261 words

Closing the Gap Between Text and Speech Understanding in LLMs

2026-05-04 · 更新于 2026-07-24 · 2 min · 323 words

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

2026-05-04 · 更新于 2026-07-24 · 2 min · 301 words

Confident and Adaptive Generative Speech Recognition via Risk Control

2026-05-04 · 更新于 2026-07-24 · 2 min · 351 words

Continuous Audio Language Models

2026-05-04 · 更新于 2026-07-24 · 3 min · 525 words

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

2026-05-04 · 更新于 2026-07-24 · 2 min · 345 words

CustomDancer: Customized Dance Recommendation by Text-Dance Retrieval

2026-05-04 · 更新于 2026-07-24 · 2 min · 296 words

Data-Centric Lessons To Improve Speech-Language Pretraining

2026-05-04 · 更新于 2026-07-24 · 2 min · 277 words

Deep Learning with Learnable Product-Structured Activations

2026-05-04 · 更新于 2026-07-24 · 2 min · 298 words

DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

2026-05-04 · 更新于 2026-07-24 · 3 min · 589 words

Discovering and Steering Interpretable Concepts in Large Generative Music Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 224 words

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

2026-05-04 · 更新于 2026-07-24 · 2 min · 392 words

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

2026-05-04 · 更新于 2026-07-24 · 2 min · 381 words

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

2026-05-04 · 更新于 2026-07-24 · 2 min · 226 words

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 261 words

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

2026-05-04 · 更新于 2026-07-24 · 2 min · 251 words

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

2026-05-04 · 更新于 2026-07-24 · 2 min · 229 words

End-to-end Listen, Look, Speak and Act

2026-05-04 · 更新于 2026-07-24 · 2 min · 277 words

Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

2026-05-04 · 更新于 2026-07-24 · 2 min · 393 words

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

2026-05-04 · 更新于 2026-07-24 · 4 min · 669 words

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

2026-05-04 · 更新于 2026-07-24 · 2 min · 348 words

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

2026-05-04 · 更新于 2026-07-24 · 2 min · 373 words

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

2026-05-04 · 更新于 2026-07-24 · 3 min · 487 words

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

2026-05-04 · 更新于 2026-07-24 · 3 min · 577 words

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

2026-05-04 · 更新于 2026-07-24 · 2 min · 345 words

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

2026-05-04 · 更新于 2026-07-24 · 2 min · 286 words

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

2026-05-04 · 更新于 2026-07-24 · 2 min · 367 words

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

2026-05-04 · 更新于 2026-07-24 · 1 min · 162 words

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

2026-05-04 · 更新于 2026-07-24 · 2 min · 342 words

Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation

2026-05-04 · 更新于 2026-07-24 · 3 min · 461 words

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

2026-05-04 · 更新于 2026-07-24 · 2 min · 367 words

Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis

2026-05-04 · 更新于 2026-07-24 · 4 min · 776 words

Human Behavior Atlas: Benchmarking Unified Psychological And Social Behavior Understanding

2026-05-04 · 更新于 2026-07-24 · 2 min · 384 words

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

2026-05-04 · 更新于 2026-07-24 · 2 min · 233 words

ICLR 2026 - 动作生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 115 words

ICLR 2026 - 图像生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 100 words

ICLR 2026 - 基准测试 #数据集论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 136 words

ICLR 2026 - 基准测试论文列表

2026-05-04 · 更新于 2026-07-24 · 6 min · 1203 words

ICLR 2026 - 声源定位论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 113 words

ICLR 2026 - 多模态推理论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 102 words

ICLR 2026 - 多模态模型论文列表

2026-05-04 · 更新于 2026-07-24 · 4 min · 671 words

ICLR 2026 - 序列解耦论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 193 words

ICLR 2026 - 数据集论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 144 words

ICLR 2026 - 机器人操作论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 122 words

ICLR 2026 - 模型可解释性论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 149 words

ICLR 2026 - 模型比较论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 121 words

ICLR 2026 - 模型评估论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 281 words

ICLR 2026 - 生态计算论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 130 words

ICLR 2026 - 生成模型论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 272 words

ICLR 2026 - 生物声学论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 193 words

ICLR 2026 - 神经网络架构论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 97 words

ICLR 2026 - 空间音频论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 105 words

ICLR 2026 - 脑编码论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 97 words

ICLR 2026 - 视频描述生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 187 words

ICLR 2026 - 视频摘要论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 103 words

ICLR 2026 - 视频生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 171 words

ICLR 2026 - 语音分离论文列表

2026-05-04 · 更新于 2026-07-24 · 4 min · 708 words

ICLR 2026 - 语音合成论文列表

2026-05-04 · 更新于 2026-07-24 · 8 min · 1679 words

ICLR 2026 - 语音合成评估论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 198 words

ICLR 2026 - 语音增强 #对抗样本论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 131 words

ICLR 2026 - 语音增强论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 105 words

ICLR 2026 - 语音大模型论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 128 words

ICLR 2026 - 语音对话系统论文列表

2026-05-04 · 更新于 2026-07-24 · 4 min · 817 words

ICLR 2026 - 语音情感识别论文列表

2026-05-04 · 更新于 2026-07-24 · 3 min · 637 words

ICLR 2026 - 语音生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 126 words

ICLR 2026 - 语音翻译论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 214 words

ICLR 2026 - 语音识别 #语音合成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 197 words

ICLR 2026 - 语音识别论文列表

2026-05-04 · 更新于 2026-07-24 · 6 min · 1099 words

ICLR 2026 - 语音转换 #语音匿名化论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 168 words

ICLR 2026 - 语音问答论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 145 words

ICLR 2026 - 跨模态检索论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 91 words

ICLR 2026 - 跨模态生成论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 108 words

ICLR 2026 - 音乐信息检索论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 262 words

ICLR 2026 - 音乐理解论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 224 words

ICLR 2026 - 音乐生成论文列表

2026-05-04 · 更新于 2026-07-24 · 7 min · 1298 words

ICLR 2026 - 音视频论文列表

2026-05-04 · 更新于 2026-07-24 · 2 min · 400 words

ICLR 2026 - 音视频事件检测论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 128 words

ICLR 2026 - 音视频深度伪造检测论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 109 words

ICLR 2026 - 音视频联合推理论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 91 words

ICLR 2026 - 音频分离论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 119 words

ICLR 2026 - 音频分类论文列表

2026-05-04 · 更新于 2026-07-24 · 4 min · 839 words

ICLR 2026 - 音频场景理解论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 114 words

ICLR 2026 - 音频安全论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 127 words

ICLR 2026 - 音频检索论文列表

2026-05-04 · 更新于 2026-07-24 · 3 min · 500 words

ICLR 2026 - 音频生成论文列表

2026-05-04 · 更新于 2026-07-24 · 9 min · 1782 words

ICLR 2026 - 音频编辑论文列表

2026-05-04 · 更新于 2026-07-24 · 1 min · 130 words

ICLR 2026 - 音频问答论文列表

2026-05-04 · 更新于 2026-07-24 · 3 min · 541 words

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

2026-05-04 · 更新于 2026-07-24 · 2 min · 261 words

Instilling an Active Mind in Avatars via Cognitive Simulation

2026-05-04 · 更新于 2026-07-24 · 2 min · 285 words

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

2026-05-04 · 更新于 2026-07-24 · 2 min · 376 words

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 283 words

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

2026-05-04 · 更新于 2026-07-24 · 2 min · 370 words

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 327 words

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

2026-05-04 · 更新于 2026-07-24 · 2 min · 358 words

Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks

2026-05-04 · 更新于 2026-07-24 · 3 min · 439 words

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

2026-05-04 · 更新于 2026-07-24 · 2 min · 331 words

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

2026-05-04 · 更新于 2026-07-24 · 2 min · 397 words

Latent Fourier Transform

2026-05-04 · 更新于 2026-07-24 · 2 min · 294 words

Latent Speech-Text Transformer

2026-05-04 · 更新于 2026-07-24 · 3 min · 485 words

LayerSync: Self-aligning Intermediate Layers

2026-05-04 · 更新于 2026-07-24 · 2 min · 311 words

Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition

2026-05-04 · 更新于 2026-07-24 · 2 min · 402 words

Learning multimodal dictionary decompositions with group-sparse autoencoders

2026-05-04 · 更新于 2026-07-24 · 2 min · 290 words

LLM2Fx-Tools: Tool Calling for Music Post-Production

2026-05-04 · 更新于 2026-07-24 · 2 min · 385 words

MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control

2026-05-04 · 更新于 2026-07-24 · 2 min · 252 words

MAPSS: Manifold-based Assessment of Perceptual Source Separation

2026-05-04 · 更新于 2026-07-24 · 2 min · 237 words

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

2026-05-04 · 更新于 2026-07-24 · 5 min · 908 words

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

2026-05-04 · 更新于 2026-07-24 · 2 min · 289 words

Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 243 words

MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications

2026-05-04 · 更新于 2026-07-24 · 2 min · 421 words

MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment

2026-05-04 · 更新于 2026-07-24 · 3 min · 444 words

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

2026-05-04 · 更新于 2026-07-24 · 2 min · 373 words

MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

2026-05-04 · 更新于 2026-07-24 · 2 min · 382 words

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

2026-05-04 · 更新于 2026-07-24 · 1 min · 176 words

Music Flamingo: Scaling Music Understanding in Audio Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 392 words

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

2026-05-04 · 更新于 2026-07-24 · 2 min · 316 words

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

2026-05-04 · 更新于 2026-07-24 · 2 min · 364 words

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

2026-05-04 · 更新于 2026-07-24 · 2 min · 367 words

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

2026-05-04 · 更新于 2026-07-24 · 2 min · 247 words

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

2026-05-04 · 更新于 2026-07-24 · 2 min · 292 words

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

2026-05-04 · 更新于 2026-07-24 · 2 min · 406 words

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

2026-05-04 · 更新于 2026-07-24 · 3 min · 464 words

OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 326 words

PACE: Pretrained Audio Continual Learning

2026-05-04 · 更新于 2026-07-24 · 2 min · 376 words

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

2026-05-04 · 更新于 2026-07-24 · 2 min · 272 words

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

2026-05-04 · 更新于 2026-07-24 · 2 min · 324 words

Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

2026-05-04 · 更新于 2026-07-24 · 2 min · 275 words

PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 316 words

Query-Guided Spatial–Temporal–Frequency Interaction for Music Audio–Visual Question Answering

2026-05-04 · 更新于 2026-07-24 · 2 min · 244 words

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

2026-05-04 · 更新于 2026-07-24 · 3 min · 545 words

RoboKA: KAN Informed Multimodal Learning for RoboCall Surveillance System

2026-05-04 · 更新于 2026-07-24 · 2 min · 285 words

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

2026-05-04 · 更新于 2026-07-24 · 2 min · 340 words

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

2026-05-04 · 更新于 2026-07-24 · 2 min · 329 words

Scaling Speech Tokenizers with Diffusion Autoencoders

2026-05-04 · 更新于 2026-07-24 · 2 min · 342 words

SCRAPL: Scattering Transform with Random Paths for Machine Learning

2026-05-04 · 更新于 2026-07-24 · 3 min · 516 words

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

2026-05-04 · 更新于 2026-07-24 · 2 min · 290 words

SmartDJ: Declarative Audio Editing with Audio Language Model

2026-05-04 · 更新于 2026-07-24 · 2 min · 330 words

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

2026-05-04 · 更新于 2026-07-24 · 3 min · 578 words

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

2026-05-04 · 更新于 2026-07-24 · 2 min · 326 words

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 383 words

Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech

2026-05-04 · 更新于 2026-07-24 · 3 min · 499 words

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

2026-05-04 · 更新于 2026-07-24 · 2 min · 288 words

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

2026-05-04 · 更新于 2026-07-24 · 3 min · 619 words

SpeechOp: Inference-Time Task Composition for Generative Speech Processing

2026-05-04 · 更新于 2026-07-24 · 2 min · 344 words

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

2026-05-04 · 更新于 2026-07-24 · 2 min · 280 words

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2026-05-04 · 更新于 2026-07-24 · 1 min · 207 words

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

2026-05-04 · 更新于 2026-07-24 · 2 min · 257 words

Steering Autoregressive Music Generation with Recursive Feature Machines

2026-05-04 · 更新于 2026-07-24 · 2 min · 422 words

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 241 words

SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis

2026-05-04 · 更新于 2026-07-24 · 2 min · 420 words

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

2026-05-04 · 更新于 2026-07-24 · 2 min · 376 words

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

2026-05-04 · 更新于 2026-07-24 · 2 min · 358 words

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 345 words

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2026-05-04 · 更新于 2026-07-24 · 5 min · 1000 words

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

2026-05-04 · 更新于 2026-07-24 · 2 min · 379 words

Tell me Habibi, is it Real or Fake?

2026-05-04 · 更新于 2026-07-24 · 2 min · 276 words

The Deleuzian Representation Hypothesis

2026-05-04 · 更新于 2026-07-24 · 2 min · 285 words

Timing is Everything: Temporal Scaffolding of Semantic Surprise in Humor

2026-05-04 · 更新于 2026-07-24 · 2 min · 349 words

TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES

2026-05-04 · 更新于 2026-07-24 · 2 min · 227 words

Token-Based Audio Inpainting via Discrete Diffusion

2026-05-04 · 更新于 2026-07-24 · 3 min · 508 words

Toward Complex-Valued Neural Networks for Waveform Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 308 words

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

2026-05-04 · 更新于 2026-07-24 · 2 min · 226 words

Towards True Speech-to-Speech Models Without Text Guidance

2026-05-04 · 更新于 2026-07-24 · 2 min · 393 words

Transformer-based End-to-End Control Filter Generation for Active Noise Control

2026-05-04 · 更新于 2026-07-24 · 2 min · 316 words

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

2026-05-04 · 更新于 2026-07-24 · 2 min · 348 words

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

2026-05-04 · 更新于 2026-07-24 · 2 min · 332 words

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

2026-05-04 · 更新于 2026-07-24 · 2 min · 365 words

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

2026-05-04 · 更新于 2026-07-24 · 2 min · 327 words

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

2026-05-04 · 更新于 2026-07-24 · 2 min · 386 words

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

2026-05-04 · 更新于 2026-07-24 · 2 min · 340 words

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

2026-05-04 · 更新于 2026-07-24 · 2 min · 306 words

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

2026-05-04 · 更新于 2026-07-24 · 2 min · 300 words

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

2026-05-04 · 更新于 2026-07-24 · 2 min · 323 words

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

2026-05-04 · 更新于 2026-07-24 · 2 min · 220 words

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

2026-05-04 · 更新于 2026-07-24 · 2 min · 335 words

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 292 words

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

2026-05-04 · 更新于 2026-07-24 · 3 min · 552 words

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

2026-05-04 · 更新于 2026-07-24 · 2 min · 327 words

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

2026-05-04 · 更新于 2026-07-24 · 2 min · 240 words

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

2026-05-04 · 更新于 2026-07-24 · 2 min · 269 words

YuE: Scaling Open Foundation Models for Long-Form Music Generation

2026-05-04 · 更新于 2026-07-24 · 2 min · 424 words

语音/音乐/音频论文速递 2026-05-04

2026-05-04 · 更新于 2026-07-24 · 9 min · 1720 words

语音/音乐/音频论文速递 2026-05-03

2026-05-03 · 更新于 2026-07-24 · 8 min · 1688 words

A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks

2026-05-02 · 更新于 2026-07-24 · 3 min · 552 words

A cross-species neural foundation model for end-to-end speech decoding

2026-05-02 · 更新于 2026-07-24 · 2 min · 412 words

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

2026-05-02 · 更新于 2026-07-24 · 2 min · 395 words

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

2026-05-02 · 更新于 2026-07-24 · 2 min · 382 words

AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching

2026-05-02 · 更新于 2026-07-24 · 3 min · 441 words

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

2026-05-02 · 更新于 2026-07-24 · 3 min · 485 words

Are Deep Speech Denoising Models Robust to Adversarial Noise?

2026-05-02 · 更新于 2026-07-24 · 1 min · 203 words

AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models

2026-05-02 · 更新于 2026-07-24 · 3 min · 476 words

AudioX: A Unified Framework for Anything-to-Audio Generation

2026-05-02 · 更新于 2026-07-24 · 3 min · 442 words

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

2026-05-02 · 更新于 2026-07-24 · 2 min · 423 words

Aurelius: Relation Aware Text-to-Audio Generation At Scale

2026-05-02 · 更新于 2026-07-24 · 2 min · 386 words

Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

2026-05-02 · 更新于 2026-07-24 · 3 min · 454 words

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

2026-05-02 · 更新于 2026-07-24 · 2 min · 293 words

AVEX: What Matters for Animal Vocalization Encoding

2026-05-02 · 更新于 2026-07-24 · 2 min · 318 words

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

2026-05-02 · 更新于 2026-07-24 · 2 min · 346 words

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 406 words

Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval

2026-05-02 · 更新于 2026-07-24 · 2 min · 343 words

Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

2026-05-02 · 更新于 2026-07-24 · 2 min · 417 words

Can Speech LLMs Think while Listening?

2026-05-02 · 更新于 2026-07-24 · 2 min · 298 words

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

2026-05-02 · 更新于 2026-07-24 · 2 min · 254 words

Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

2026-05-02 · 更新于 2026-07-24 · 2 min · 395 words

Closing the Gap Between Text and Speech Understanding in LLMs

2026-05-02 · 更新于 2026-07-24 · 3 min · 579 words

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

2026-05-02 · 更新于 2026-07-24 · 2 min · 355 words

Confident and Adaptive Generative Speech Recognition via Risk Control

2026-05-02 · 更新于 2026-07-24 · 2 min · 229 words

Continuous Audio Language Models

2026-05-02 · 更新于 2026-07-24 · 3 min · 587 words

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

2026-05-02 · 更新于 2026-07-24 · 2 min · 374 words

Data-Centric Lessons To Improve Speech-Language Pretraining

2026-05-02 · 更新于 2026-07-24 · 2 min · 265 words

Deep Learning with Learnable Product-Structured Activations

2026-05-02 · 更新于 2026-07-24 · 2 min · 326 words

DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

2026-05-02 · 更新于 2026-07-24 · 2 min · 365 words

Discovering and Steering Interpretable Concepts in Large Generative Music Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 297 words

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

2026-05-02 · 更新于 2026-07-24 · 3 min · 445 words

DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

2026-05-02 · 更新于 2026-07-24 · 3 min · 496 words

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

2026-05-02 · 更新于 2026-07-24 · 2 min · 225 words

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 287 words

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

2026-05-02 · 更新于 2026-07-24 · 2 min · 358 words

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

2026-05-02 · 更新于 2026-07-24 · 2 min · 251 words

End-to-end Listen, Look, Speak and Act

2026-05-02 · 更新于 2026-07-24 · 3 min · 444 words

Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

2026-05-02 · 更新于 2026-07-24 · 2 min · 316 words

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

2026-05-02 · 更新于 2026-07-24 · 3 min · 544 words

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

2026-05-02 · 更新于 2026-07-24 · 2 min · 332 words

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

2026-05-02 · 更新于 2026-07-24 · 2 min · 353 words

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

2026-05-02 · 更新于 2026-07-24 · 3 min · 431 words

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

2026-05-02 · 更新于 2026-07-24 · 2 min · 326 words

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

2026-05-02 · 更新于 2026-07-24 · 2 min · 400 words

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

2026-05-02 · 更新于 2026-07-24 · 2 min · 295 words

Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation

2026-05-02 · 更新于 2026-07-24 · 2 min · 372 words

Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis

2026-05-02 · 更新于 2026-07-24 · 3 min · 457 words

Human Behavior Atlas: Benchmarking Unified Psychological And Social Behavior Understanding

2026-05-02 · 更新于 2026-07-24 · 2 min · 424 words

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

2026-05-02 · 更新于 2026-07-24 · 1 min · 191 words

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

2026-05-02 · 更新于 2026-07-24 · 2 min · 289 words

Instilling an Active Mind in Avatars via Cognitive Simulation

2026-05-02 · 更新于 2026-07-24 · 2 min · 263 words

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

2026-05-02 · 更新于 2026-07-24 · 2 min · 350 words

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

2026-05-02 · 更新于 2026-07-24 · 3 min · 452 words

JaiTTS: A Thai Voice Cloning Model

2026-05-02 · 更新于 2026-07-24 · 2 min · 425 words

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

2026-05-02 · 更新于 2026-07-24 · 3 min · 631 words

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

2026-05-02 · 更新于 2026-07-24 · 3 min · 566 words

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

2026-05-02 · 更新于 2026-07-24 · 3 min · 567 words

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

2026-05-02 · 更新于 2026-07-24 · 2 min · 306 words

Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks

2026-05-02 · 更新于 2026-07-24 · 2 min · 372 words

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

2026-05-02 · 更新于 2026-07-24 · 3 min · 469 words

Latent Fourier Transform

2026-05-02 · 更新于 2026-07-24 · 2 min · 322 words

Latent Speech-Text Transformer

2026-05-02 · 更新于 2026-07-24 · 3 min · 535 words

LayerSync: Self-aligning Intermediate Layers

2026-05-02 · 更新于 2026-07-24 · 2 min · 346 words

Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition

2026-05-02 · 更新于 2026-07-24 · 2 min · 329 words

Learning multimodal dictionary decompositions with group-sparse autoencoders

2026-05-02 · 更新于 2026-07-24 · 2 min · 317 words

LLM2Fx-Tools: Tool Calling for Music Post-Production

2026-05-02 · 更新于 2026-07-24 · 3 min · 439 words

MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control

2026-05-02 · 更新于 2026-07-24 · 3 min · 453 words

MAPSS: Manifold-based Assessment of Perceptual Source Separation

2026-05-02 · 更新于 2026-07-24 · 2 min · 404 words

MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

2026-05-02 · 更新于 2026-07-24 · 2 min · 385 words

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

2026-05-02 · 更新于 2026-07-24 · 2 min · 349 words

Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 284 words

MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications

2026-05-02 · 更新于 2026-07-24 · 2 min · 275 words

MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment

2026-05-02 · 更新于 2026-07-24 · 3 min · 459 words

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

2026-05-02 · 更新于 2026-07-24 · 2 min · 406 words

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

2026-05-02 · 更新于 2026-07-24 · 2 min · 229 words

Music Flamingo: Scaling Music Understanding in Audio Language Models

2026-05-02 · 更新于 2026-07-24 · 3 min · 495 words

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

2026-05-02 · 更新于 2026-07-24 · 2 min · 248 words

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

2026-05-02 · 更新于 2026-07-24 · 2 min · 291 words

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

2026-05-02 · 更新于 2026-07-24 · 2 min · 243 words

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

2026-05-02 · 更新于 2026-07-24 · 2 min · 300 words

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

2026-05-02 · 更新于 2026-07-24 · 2 min · 292 words

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

2026-05-02 · 更新于 2026-07-24 · 2 min · 388 words

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

2026-05-02 · 更新于 2026-07-24 · 3 min · 581 words

OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 406 words

PACE: Pretrained Audio Continual Learning

2026-05-02 · 更新于 2026-07-24 · 2 min · 384 words

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction

2026-05-02 · 更新于 2026-07-24 · 2 min · 361 words

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

2026-05-02 · 更新于 2026-07-24 · 2 min · 371 words

Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

2026-05-02 · 更新于 2026-07-24 · 2 min · 277 words

PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation

2026-05-02 · 更新于 2026-07-24 · 2 min · 397 words

Query-Guided Spatial–Temporal–Frequency Interaction for Music Audio–Visual Question Answering

2026-05-02 · 更新于 2026-07-24 · 2 min · 286 words

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

2026-05-02 · 更新于 2026-07-24 · 2 min · 346 words

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

2026-05-02 · 更新于 2026-07-24 · 2 min · 246 words

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

2026-05-02 · 更新于 2026-07-24 · 3 min · 599 words

Scaling Speech Tokenizers with Diffusion Autoencoders

2026-05-02 · 更新于 2026-07-24 · 2 min · 282 words

SCRAPL: Scattering Transform with Random Paths for Machine Learning

2026-05-02 · 更新于 2026-07-24 · 3 min · 487 words

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

2026-05-02 · 更新于 2026-07-24 · 2 min · 347 words

SmartDJ: Declarative Audio Editing with Audio Language Model

2026-05-02 · 更新于 2026-07-24 · 2 min · 328 words

SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

2026-05-02 · 更新于 2026-07-24 · 3 min · 494 words

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

2026-05-02 · 更新于 2026-07-24 · 3 min · 518 words

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

2026-05-02 · 更新于 2026-07-24 · 2 min · 387 words

Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech

2026-05-02 · 更新于 2026-07-24 · 2 min · 351 words

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

2026-05-02 · 更新于 2026-07-24 · 2 min · 334 words

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

2026-05-02 · 更新于 2026-07-24 · 2 min · 349 words

SpeechOp: Inference-Time Task Composition for Generative Speech Processing

2026-05-02 · 更新于 2026-07-24 · 2 min · 340 words

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

2026-05-02 · 更新于 2026-07-24 · 2 min · 382 words

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2026-05-02 · 更新于 2026-07-24 · 3 min · 506 words

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

2026-05-02 · 更新于 2026-07-24 · 2 min · 329 words

Steering Autoregressive Music Generation with Recursive Feature Machines

2026-05-02 · 更新于 2026-07-24 · 2 min · 318 words

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 319 words

SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis

2026-05-02 · 更新于 2026-07-24 · 2 min · 334 words

SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

2026-05-02 · 更新于 2026-07-24 · 2 min · 422 words

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

2026-05-02 · 更新于 2026-07-24 · 3 min · 512 words

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

2026-05-02 · 更新于 2026-07-24 · 3 min · 497 words

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2026-05-02 · 更新于 2026-07-24 · 2 min · 295 words

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

2026-05-02 · 更新于 2026-07-24 · 2 min · 318 words

Tell me Habibi, is it Real or Fake?

2026-05-02 · 更新于 2026-07-24 · 2 min · 305 words

The Deleuzian Representation Hypothesis

2026-05-02 · 更新于 2026-07-24 · 2 min · 262 words

TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO- DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN- FERENCE ON BATTERY-POWERED SMALL DEVICES

2026-05-02 · 更新于 2026-07-24 · 2 min · 284 words

Token-Based Audio Inpainting via Discrete Diffusion

2026-05-02 · 更新于 2026-07-24 · 3 min · 519 words

Toward Complex-Valued Neural Networks for Waveform Generation

2026-05-02 · 更新于 2026-07-24 · 3 min · 446 words

Towards True Speech-to-Speech Models Without Text Guidance

2026-05-02 · 更新于 2026-07-24 · 2 min · 368 words

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

2026-05-02 · 更新于 2026-07-24 · 2 min · 341 words

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

2026-05-02 · 更新于 2026-07-24 · 2 min · 236 words

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

2026-05-02 · 更新于 2026-07-24 · 2 min · 294 words

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

2026-05-02 · 更新于 2026-07-24 · 2 min · 396 words

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

2026-05-02 · 更新于 2026-07-24 · 2 min · 336 words

Unified Multi-Modal Interactive and Reactive 3D Motion Generation via Rectified Flow

2026-05-02 · 更新于 2026-07-24 · 2 min · 357 words

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

2026-05-02 · 更新于 2026-07-24 · 2 min · 338 words

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

2026-05-02 · 更新于 2026-07-24 · 2 min · 323 words

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

2026-05-02 · 更新于 2026-07-24 · 3 min · 432 words

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

2026-05-02 · 更新于 2026-07-24 · 2 min · 300 words

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

2026-05-02 · 更新于 2026-07-24 · 3 min · 457 words

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 361 words

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

2026-05-02 · 更新于 2026-07-24 · 2 min · 391 words

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

2026-05-02 · 更新于 2026-07-24 · 2 min · 422 words

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

2026-05-02 · 更新于 2026-07-24 · 2 min · 353 words

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

2026-05-02 · 更新于 2026-07-24 · 2 min · 312 words

YuE: Scaling Open Foundation Models for Long-Form Music Generation

2026-05-02 · 更新于 2026-07-24 · 2 min · 354 words

语音/音乐/音频论文速递 2026-05-02

2026-05-02 · 更新于 2026-07-24 · 4 min · 724 words

ICASSP 2026 语音/音频论文详细分析

2026-05-01 · 更新于 2026-07-24 · 430 min · 91382 words

ICLR 2026 语音/音频论文详细分析

2026-05-01 · 更新于 2026-07-24 · 72 min · 15177 words

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

2026-05-01 · 更新于 2026-07-24 · 2 min · 336 words

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

2026-05-01 · 更新于 2026-07-24 · 1 min · 148 words

Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

2026-05-01 · 更新于 2026-07-24 · 1 min · 181 words

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

2026-05-01 · 更新于 2026-07-24 · 2 min · 344 words

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

2026-05-01 · 更新于 2026-07-24 · 2 min · 357 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-05-01 · 更新于 2026-07-24 · 2 min · 267 words

Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues

2026-05-01 · 更新于 2026-07-24 · 1 min · 160 words

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

2026-05-01 · 更新于 2026-07-24 · 3 min · 593 words

Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification

2026-05-01 · 更新于 2026-07-24 · 2 min · 378 words

BUT System Description for CHiME-9 MCoRec Challenge

2026-05-01 · 更新于 2026-07-24 · 2 min · 334 words

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

2026-05-01 · 更新于 2026-07-24 · 2 min · 396 words

Do Sparse Autoencoders Capture Concept Manifolds?

2026-05-01 · 更新于 2026-07-24 · 2 min · 283 words

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

2026-05-01 · 更新于 2026-07-24 · 3 min · 452 words

Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation

2026-05-01 · 更新于 2026-07-24 · 2 min · 271 words

EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures

2026-05-01 · 更新于 2026-07-24 · 3 min · 568 words

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

2026-05-01 · 更新于 2026-07-24 · 2 min · 311 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-05-01 · 更新于 2026-07-24 · 2 min · 319 words

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

2026-05-01 · 更新于 2026-07-24 · 2 min · 314 words

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2026-05-01 · 更新于 2026-07-24 · 2 min · 261 words

JaiTTS: A Thai Voice Cloning Model

2026-05-01 · 更新于 2026-07-24 · 2 min · 264 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-05-01 · 更新于 2026-07-24 · 2 min · 378 words

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

2026-05-01 · 更新于 2026-07-24 · 2 min · 228 words

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

2026-05-01 · 更新于 2026-07-24 · 1 min · 153 words

MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

2026-05-01 · 更新于 2026-07-24 · 3 min · 434 words

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

2026-05-01 · 更新于 2026-07-24 · 3 min · 461 words

Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People

2026-05-01 · 更新于 2026-07-24 · 1 min · 125 words

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

2026-05-01 · 更新于 2026-07-24 · 3 min · 434 words

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

2026-05-01 · 更新于 2026-07-24 · 1 min · 139 words

Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

2026-05-01 · 更新于 2026-07-24 · 1 min · 174 words

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012

2026-05-01 · 更新于 2026-07-24 · 2 min · 237 words

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

2026-05-01 · 更新于 2026-07-24 · 1 min · 133 words

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

2026-05-01 · 更新于 2026-07-24 · 2 min · 274 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-05-01 · 更新于 2026-07-24 · 2 min · 327 words

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2026-05-01 · 更新于 2026-07-24 · 4 min · 708 words

语音/音乐/音频论文速递 2026-05-01

2026-05-01 · 更新于 2026-07-24 · 12 min · 2481 words

April ¹³¹²

A New Location Estimator for Mixed LOS & NLOS scenarios

2026-04-30 · 更新于 2026-07-24 · 2 min · 319 words

A Toolkit for Detecting Spurious Correlations in Speech Datasets

2026-04-30 · 更新于 2026-07-24 · 2 min · 345 words

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

2026-04-30 · 更新于 2026-07-24 · 2 min · 404 words

Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

2026-04-30 · 更新于 2026-07-24 · 2 min · 318 words

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

2026-04-30 · 更新于 2026-07-24 · 2 min · 422 words

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

2026-04-30 · 更新于 2026-07-24 · 2 min · 411 words

Fitting Large Nonlinear Mixed Effects Models Using Variational Expectation Maximization

2026-04-30 · 更新于 2026-07-24 · 1 min · 103 words

Full band denoising of room impulse response in the wavelet domain with dictionary learning

2026-04-30 · 更新于 2026-07-24 · 2 min · 270 words

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

2026-04-30 · 更新于 2026-07-24 · 2 min · 344 words

Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation

2026-04-30 · 更新于 2026-07-24 · 2 min · 255 words

Multimodal LLMs are not all you need for Pediatric Speech Language Pathology

2026-04-30 · 更新于 2026-07-24 · 2 min · 405 words

Multiple Additive Neural Networks for Structured and Unstructured Data

2026-04-30 · 更新于 2026-07-24 · 2 min · 297 words

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

2026-04-30 · 更新于 2026-07-24 · 2 min · 365 words

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

2026-04-30 · 更新于 2026-07-24 · 2 min · 411 words

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

2026-04-30 · 更新于 2026-07-24 · 3 min · 444 words

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

2026-04-30 · 更新于 2026-07-24 · 2 min · 410 words

Random Cloud: Finding Minimal Neural Architectures Without Training

2026-04-30 · 更新于 2026-07-24 · 2 min · 286 words

Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

2026-04-30 · 更新于 2026-07-24 · 1 min · 207 words

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

2026-04-30 · 更新于 2026-07-24 · 3 min · 493 words

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

2026-04-30 · 更新于 2026-07-24 · 2 min · 223 words

StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

2026-04-30 · 更新于 2026-07-24 · 2 min · 230 words

Step-Audio-R1.5 Technical Report

2026-04-30 · 更新于 2026-07-24 · 2 min · 266 words

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

2026-04-30 · 更新于 2026-07-24 · 2 min · 374 words

Text-Utilization for Encoder-dominated Speech Recognition Models

2026-04-30 · 更新于 2026-07-24 · 1 min · 135 words

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

2026-04-30 · 更新于 2026-07-24 · 2 min · 414 words

语音/音乐/音频论文速递 2026-04-30

2026-04-30 · 更新于 2026-07-24 · 16 min · 3385 words

3D Mesh Grid Room Impulse Responses Measured with A Linear Microphone Array And Suppression of Frame Reflections

2026-04-29 · 更新于 2026-07-24 · 1 min · 202 words

A Bayesian Approach to Singing Skill Evaluation Using Semitone Pitch Histogram and MCMC-Based Generated Quantities

2026-04-29 · 更新于 2026-07-24 · 2 min · 271 words

A Bimodal Approach for Detecting Fatigue Using Speech and Personal Assessments in College Students

2026-04-29 · 更新于 2026-07-24 · 1 min · 194 words

A Consistent Learning Depression Detection Framework Integrating Multi-View Attention

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

A Data-Driven Framework for Personal Sound Zone Control Addressing Loudspeaker Nonlinearities

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

2026-04-29 · 更新于 2026-07-24 · 2 min · 279 words

A Dynamic Gated Cross-Attention Framework for Audio-Text Apparent Personality Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 285 words

A Feature-Optimized Audio Watermarking Algorithm with Adaptive Embedding Strength

2026-04-29 · 更新于 2026-07-24 · 2 min · 375 words

A Framework for Controlled Multi-Speaker Audio Synthesis for Robustness Evaluation of Speaker Diarisation Systems

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

A Generalization Strategy for Speech Quality Prediction: From Domain-Specific to Unified Datasets

2026-04-29 · 更新于 2026-07-24 · 2 min · 274 words

A Generative-First Neural Audio Autoencoder

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

A Hybrid Convolution-Mamba Network with Tone-Octave Contrastive Learning for Stratified Semi-Supervised Singing Melody Extraction

2026-04-29 · 更新于 2026-07-24 · 2 min · 391 words

A Learning-Based Automotive Sound Field Reproduction Method Using Plane-Wave Decomposition and Multi-Position Constraint

2026-04-29 · 更新于 2026-07-24 · 2 min · 243 words

A Lightweight Fourier-Based Network for Binaural Speech Enhancement with Spatial Cue Preservation

2026-04-29 · 更新于 2026-07-24 · 2 min · 395 words

A LLM-Driven Acoustic Semantic Enriched Framework for Underwater Acoustic Target Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 379 words

A Metric Learning Approach to Heart Murmur Detection from Phonocardiogram Recordings

2026-04-29 · 更新于 2026-07-24 · 2 min · 389 words

A New Method and Dataset for Classroom Teaching Stage Segmentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 372 words

A Noniterative Phase Retrieval Considering the Zeros of STFT Magnitude

2026-04-29 · 更新于 2026-07-24 · 2 min · 214 words

A Noval Monte Carlo Gradient Method Based on Meta-Learning for Effective Step-Size Selection in Active Noise Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 242 words

A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 314 words

A Personalized Real-Time Proactive Voice Memory Assistant

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

A Robust KNN Approach for Multi-Class Laryngeal Disease Detection using MFCC Features

2026-04-29 · 更新于 2026-07-24 · 2 min · 219 words

A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding

2026-04-29 · 更新于 2026-07-24 · 1 min · 194 words

A Speech-Driven Paradigm for Physics-Informed Modeling of Coupled Micro-Speakers

2026-04-29 · 更新于 2026-07-24 · 2 min · 280 words

A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

A State-Dependent Markov Diffusion Process for Generative Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 3 min · 463 words

A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 293 words

A Superb-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 3 min · 507 words

A Task-Aware Dual-Level Self-Supervised Learning Method for Effective Sound Event Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 308 words

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

2026-04-29 · 更新于 2026-07-24 · 2 min · 387 words

A Unified SVD-Modal Solution for Sparse Sound Field Reconstruction with Hybrid Spherical-Linear Microphone Arrays

2026-04-29 · 更新于 2026-07-24 · 2 min · 264 words

A Unsupervised Domain Adaptation Framework For Semi-Supervised Melody Extraction Using Confidence Matrix Replace and Nearest Neighbour Supervision

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

ACAVCaps: Enabling Large-Scale Training for Fine-Grained and Diverse Audio Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 268 words

Accelerating Regularized Attention Kernel Regression for Spectrum Cartography

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

AccLID: Accent-aware Language Identification for Robust Multilingual Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 417 words

ACIR-MACL: Effective Multimodal Sentiment Analysis via Attention-Based Causal Intervention Regularization and Multi-Aspect Contrastive Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 399 words

Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Acoustic Feedback Cancellation in Hearing Aids Exploiting an Inertial Sensor

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

Acoustic Non-Stationarity Objective Assessment with Hard Label Criteria for Supervised Learning Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Acoustic Teleportation Via Disentangled Neural Audio Codec Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 313 words

Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 330 words

Adaptive Deterministic Flow Matching for Target Speaker Extraction

2026-04-29 · 更新于 2026-07-24 · 2 min · 383 words

Adaptive Embedding Fusion with Contrastive Learning for Robust Fully Few-Shot Class-Incremental Audio Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 378 words

Adaptive Per-Channel Energy Normalization Front-End for Robust Audio Signal Processing

2026-04-29 · 更新于 2026-07-24 · 2 min · 266 words

Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

Adaptive Spectral Weighting in Sagittal-Plane Sound Localization: A Reliability-Driven Approach

2026-04-29 · 更新于 2026-07-24 · 1 min · 193 words

Adaptive Task-Incremental Learning For Underwater Acoustic Recognition Based on Mixture-of-Experts Adapter

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 261 words

ADH-VA: Adaptive Directed-Hypergraph Convolution with VA Contrastive Learning for Multimodal Conversational Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 401 words

Advanced modeling of interlanguage speech intelligibility benefit with L1-L2 multi-task learning using differentiable K-means for accent-robust discrete token-based ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 367 words

Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 274 words

Advancing Semi-Supervised Child Speech Recognition with Omni-Temporal Classification under Label Noise

2026-04-29 · 更新于 2026-07-24 · 2 min · 397 words

Advancing Speech Summarization in Multi-Modal LLMs with Reinforcement Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 278 words

Advancing Speech Understanding in Speech-Aware Language Models with GRPO

2026-04-29 · 更新于 2026-07-24 · 2 min · 359 words

Adversarial Defense via Generative Speech Enhancement Module

2026-04-29 · 更新于 2026-07-24 · 2 min · 311 words

Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 457 words

Adversarial Rivalry Learning for Music Classification

2026-04-29 · 更新于 2026-07-24 · 3 min · 476 words

Affect-Jigsaw: Integrating Core and Peripheral Emotions for Harmonious Fine-Grained Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

AFT: An Exemplar-Free Class Incremental Learning Method for Environmental Sound Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

AI-Generated Music Detection in Broadcast Monitoring

2026-04-29 · 更新于 2026-07-24 · 2 min · 235 words

Ailive Mixer: A Deep Learning Based Zero Latency Automatic Music Mixer for Live Music Performances

2026-04-29 · 更新于 2026-07-24 · 1 min · 197 words

AISHELL6-Whisper: A Chinese Mandarin Audio-Visual Whisper Speech Dataset with Speech Recognition Baselines

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

Aligning Generative Speech Enhancement with Perceptual Feedback

2026-04-29 · 更新于 2026-07-24 · 3 min · 481 words

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

ALMA-Chor: Leveraging Audio-Lyric Alignment with Mamba for Chorus Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

AMBER2: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text

2026-04-29 · 更新于 2026-07-24 · 3 min · 533 words

AmbiDrop: Array-Agnostic Speech Enhancement Using Ambisonics Encoding and Dropout-Based Learning

2026-04-29 · 更新于 2026-07-24 · 1 min · 108 words

AMBISONIC-DML: A Benchmark Dataset for Dynamic Higher-Order Ambisonics Music with Motion-Aligned Stems

2026-04-29 · 更新于 2026-07-24 · 2 min · 322 words

An Anomaly-Aware and Audio-Enhanced Dual-Pathway Framework for Alzheimer’s Disease Progression Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

An Audio-Visual Speech Separation Network with Joint Cross-Attention and Iterative Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 358 words

An Efficient Neural Network for Modeling Human Auditory Neurograms for Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 300 words

An End-to-End Multimodal System for Subtitle Recognition and Chinese-Japanese Translation in Short Dramas

2026-04-29 · 更新于 2026-07-24 · 2 min · 269 words

An Envelope Separation Aided Multi-Task Learning Model for Blind Source Counting and Localization

2026-04-29 · 更新于 2026-07-24 · 2 min · 262 words

An Event-Based Sequence Modeling Approach to Recognizing Non-Triad Chords with Oversegmentation Minimization

2026-04-29 · 更新于 2026-07-24 · 2 min · 263 words

An Unsupervised Alignment Feature Fusion System for Spoken Language-Based Dementia Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Aneural Forward Filtering for Speaker-Image Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

AnyAccomp: Generalizable Accompaniment Generation Via Quantized Melodic Bottleneck

2026-04-29 · 更新于 2026-07-24 · 2 min · 370 words

AnyRIR: Robust Non-Intrusive Room Impulse Response Estimation in the Wild

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

APKD: Aligned And Paced Knowledge Distillation Towards Lightweight Heterogeneous Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

AQUA-Bench: Beyond finding answers to knowing when there are None in Audio Question Answering

2026-04-29 · 更新于 2026-07-24 · 2 min · 356 words

AR-BSNet: Towards Ultra-Low Complexity Autoregressive Target Speaker Extraction With Band-Split Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 364 words

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Ara-BEST-RQ: Multi Dialectal Arabic SSL

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

Arbitrarily Settable Frame Rate Neural Speech Codec with Content Adaptive Variable Length Segmentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 320 words

ARCHI-TTS: A Flow-Matching-Based Text-to-Speech Model with Self-Supervised Semantic Aligner and Accelerated Inference

2026-04-29 · 更新于 2026-07-24 · 3 min · 528 words

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

2026-04-29 · 更新于 2026-07-24 · 2 min · 369 words

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

2026-04-29 · 更新于 2026-07-24 · 2 min · 286 words

Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

2026-04-29 · 更新于 2026-07-24 · 3 min · 520 words

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

Assessing The Perceptual Impact of Low-Altitude Aircraft Noise in Cities: An Auralization Framework Using Gaussian Beam Tracing

2026-04-29 · 更新于 2026-07-24 · 2 min · 222 words

Asynchrony-Aware Decoupled Multimodal Control for Cued Speech Video Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 286 words

ATOM: Adaptive Token-Level Optimal Transport Mixup for Speech Translation

2026-04-29 · 更新于 2026-07-24 · 2 min · 301 words

Atomic Norm Minimization Revisited: Progressive Atom Identification And Refinement

2026-04-29 · 更新于 2026-07-24 · 2 min · 258 words

Attention-Based Encoder-Decoder Target-Speaker Voice Activity Detection for Robust Speaker Diarization

2026-04-29 · 更新于 2026-07-24 · 3 min · 509 words

Attention-Weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied To Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 478 words

Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-text System

2026-04-29 · 更新于 2026-07-24 · 2 min · 412 words

Attentive AV-Fusionnet: Audio-Visual Quality Prediction with Hybrid Attention

2026-04-29 · 更新于 2026-07-24 · 2 min · 334 words

Attentive Masked Self-Distillation for Respiratory Sound Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

2026-04-29 · 更新于 2026-07-24 · 3 min · 450 words

Audience-Aware Co-speech Gesture Generation in Public Speaking via Anticipation Tokens

2026-04-29 · 更新于 2026-07-24 · 2 min · 274 words

Audio Classification Models are Vulnerable to Filter Perturbations

2026-04-29 · 更新于 2026-07-24 · 1 min · 199 words

Audio Deepfake Detection at the First Greeting: “Hi!”

2026-04-29 · 更新于 2026-07-24 · 2 min · 315 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

Audio-Guided Multimodal Approach for Fine-Grained Alignment and Boundary Modeling in Active Speaker Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 270 words

Audio-Text Jailbreak Attack on Large Audio-Language Models: Towards Generality and Stealthiness

2026-04-29 · 更新于 2026-07-24 · 2 min · 264 words

Audio-to-Score Jazz Solo Transcription with the Rhythm Perceiver

2026-04-29 · 更新于 2026-07-24 · 2 min · 282 words

Audio-Visual Deepfake Generation and Detection: An Exploratory Survey

2026-04-29 · 更新于 2026-07-24 · 1 min · 176 words

Audio-Visual Feature Fusion for Calibrating Relevance Scores of Video Moment Retrieval

2026-04-29 · 更新于 2026-07-24 · 2 min · 346 words

AUDIOCARDS: Structured Metadata Improves Audio Language Models for Sound Design

2026-04-29 · 更新于 2026-07-24 · 2 min · 257 words

AudioFuse: Unified Spectral-Temporal Learning Via A Hybrid VIT-1D CNN Architecture for Phonocardiogram Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 293 words

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 412 words

AUDIOGENIE-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

2026-04-29 · 更新于 2026-07-24 · 3 min · 468 words

Auditory Illusion Benchmark for Large Audio Language Models

2026-04-29 · 更新于 2026-07-24 · 1 min · 196 words

Auditory-Inspired Transformer for Binaural Speech Enhancement and Spatial Cue Preservation

2026-04-29 · 更新于 2026-07-24 · 2 min · 271 words

AURA: A Stegaformer-Based Scalable Deep Audio Watermark with Extreme Robustness

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

Auto-MatchCut: An Audio-Visual Retrieval Framework for Seamless Match Cutting

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing

2026-04-29 · 更新于 2026-07-24 · 2 min · 376 words

Automatic Estimation of Speaker Diarization Error Rate Based on Features of Audio Quality and Speaker Discriminability

2026-04-29 · 更新于 2026-07-24 · 2 min · 270 words

Automatic Music Mixing Using a Generative Model of Effect Embeddings

2026-04-29 · 更新于 2026-07-24 · 2 min · 352 words

Automatic Music Sample Identification with Multi-Track Contrastive Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 412 words

AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

2026-04-29 · 更新于 2026-07-24 · 2 min · 374 words

Auxiliary Multi-Label Training For Improving the Robustness of Audio Deepfake Detection on AI-Processed Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

AVATAR: Audio-Visual Adaptive Fusion via Trained Agent Reinforcement for Multimodal Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

AVO-65: A Large-Scale Hierarchical Audio-Visual Object Dataset

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

B-GRPO: Unsupervised Speech Emotion Recognition Based on Batched-Group Relative Policy Optimization

2026-04-29 · 更新于 2026-07-24 · 2 min · 393 words

BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on POP and Classical Music

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

Bayesian Low-Rank Factorization for Robust Model Adaptation

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

Bayesian Signal Separation Via Plug-and-Play Diffusion-Within-Gibbs Sampling

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Improved Multilingual Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 310 words

Beamforming Using Virtual Microphones for Hearing Aid Applications

2026-04-29 · 更新于 2026-07-24 · 1 min · 210 words

Beat and Downbeat Detection: A Reformulated Approach

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

BeatMamba: Bidirectional Selective State-Space Modeling for Efficient Beat Tracking

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Behind the Scenes: Mechanistic Interpretability of Lora-Adapted Whisper for Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 233 words

Benchmarking Humans And Machines On Complex Multilingual Speech Understanding Tasks

2026-04-29 · 更新于 2026-07-24 · 2 min · 262 words

Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

BEST-RQ-based Self-Supervised Learning for Whisper Domain Adaptation

2026-04-29 · 更新于 2026-07-24 · 2 min · 320 words

BEST-STD 2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

2026-04-29 · 更新于 2026-07-24 · 4 min · 650 words

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 389 words

Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

Beyond Mapping: Domain-Invariant Representations via Spectral Embedding of Optimal Transport Plans

2026-04-29 · 更新于 2026-07-24 · 3 min · 446 words

Bimodal Fusion Framework for Dynamic Facial Expression Recognition In-The-Wild

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

BioSEN: A Bio-Acoustic Signal Enhancement Network for Animal Vocalizations

2026-04-29 · 更新于 2026-07-24 · 2 min · 395 words

BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 415 words

Bleed No More: Generative Interference Reduction for Musical Recordings

2026-04-29 · 更新于 2026-07-24 · 3 min · 600 words

Bloodroot: When Watermarking Turns Poisonous for Stealthy Backdoor

2026-04-29 · 更新于 2026-07-24 · 2 min · 230 words

Bone-Conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 448 words

Brainprint-Modulated Target Speaker Extraction

2026-04-29 · 更新于 2026-07-24 · 2 min · 320 words

Break-the-Beat! Controllable MIDI-to-Drum audio synthesis

2026-04-29 · 更新于 2026-07-24 · 3 min · 440 words

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

Bridging the Front-End and Back-End for Robust ASR via Cross-Attention-Based U-Net

2026-04-29 · 更新于 2026-07-24 · 2 min · 255 words

Bridging the Measurement–Simulation Gap in Room Acoustics with Real2sim Diffusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 276 words

Bridging the Semantic Gap: Cross-Attentive Fusion for Joint Acoustic-Semantic Speech Quality Assessment

2026-04-29 · 更新于 2026-07-24 · 2 min · 404 words

BSMP-SENet:Band-Split Magnitude-Phase Network for Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 301 words

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

2026-04-29 · 更新于 2026-07-24 · 3 min · 520 words

CaMoD: Causal-Aware Modality Denoising for Multimodal Dialogue Intent Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

2026-04-29 · 更新于 2026-07-24 · 2 min · 309 words

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

Caption and Audio-Guided Video Representation Learning with Gated Attention for Partially Relevant Video Retrieval

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

Cardiobridge-DM: Bridging Cross-Cohort Heart Sound Synthesis via Rhythm-Aware Semi-Supervised Diffusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 309 words

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

2026-04-29 · 更新于 2026-07-24 · 2 min · 216 words

CCST: Cross-Modal and Consistency-Aware Self-Training for Source-Free Unsupervised Domain Adaptation in Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 486 words

Chunk-Wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

Chunkwise Aligners for Streaming Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

Class-Aware Permutation-Invariant Signal-to-Distortion Ratio for Semantic Segmentation of Sound Scene with Same-Class Sources

2026-04-29 · 更新于 2026-07-24 · 2 min · 252 words

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

2026-04-29 · 更新于 2026-07-24 · 3 min · 596 words

Clue2Emo: A Brain-Inspired Framework for Open-Vocabulary Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 441 words

CMSA-Mamba: Hierarchical State Space Modeling for Audio-Based Depression Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 288 words

Co-Initialization of Control Filter and Secondary Path via Meta-Learning for Active Noise Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 290 words

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 351 words

Combining Multi-Order Attention and Multi-Resolution Discriminator for High-Fidelity Neural Vocoder

2026-04-29 · 更新于 2026-07-24 · 3 min · 487 words

Combining SSL Speech Features, Contextual Transformers and Mamba Models for Realistic Audio Spoofing Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 352 words

Compression meets Sampling: LZ78-SPA for Efficient Symbolic Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 396 words

CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-Spoofing Countermeasures

2026-04-29 · 更新于 2026-07-24 · 2 min · 411 words

Condition-Invariant fMRI decoding of speech intelligibility with deep state space model

2026-04-29 · 更新于 2026-07-24 · 3 min · 448 words

Conditional Diffusion Models for Mental Health-Preserving Voice Conversion

2026-04-29 · 更新于 2026-07-24 · 2 min · 246 words

Confidence-Based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Confidence-Guided Error Correction for Disordered Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 425 words

Connecting Layer-Wise Representation of Wavlm with Spectro-Temporal Modulation on Speaker Verification

2026-04-29 · 更新于 2026-07-24 · 2 min · 214 words

Constraint Optimized Multichannel Mixer-Limiter Design

2026-04-29 · 更新于 2026-07-24 · 2 min · 370 words

Constructing Composite Features for Interpretable Music-Tagging

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

Content Anonymization for Privacy in Long-Form Audio

2026-04-29 · 更新于 2026-07-24 · 2 min · 237 words

Content Leakage in Librispeech and its Impact on the Privacy Evaluation of Speaker Anonymization

2026-04-29 · 更新于 2026-07-24 · 1 min · 192 words

Content-Preserving Speech Representation Learning Via Adaptive Segment-Level Alignment

2026-04-29 · 更新于 2026-07-24 · 3 min · 434 words

Context-Aware Dynamic Graph Learning for Multimodal Emotion Recognition with Missing Modalities

2026-04-29 · 更新于 2026-07-24 · 2 min · 367 words

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-29 · 更新于 2026-07-24 · 3 min · 492 words

Continuation Method for Feedback Delay Network Modal Decomposition

2026-04-29 · 更新于 2026-07-24 · 1 min · 184 words

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

2026-04-29 · 更新于 2026-07-24 · 3 min · 454 words

Contrastive Timbre Representations for Musical Instrument And Synthesizer Retrieval

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

Controllable Embedding Transformation for Mood-Guided Music Retrieval

2026-04-29 · 更新于 2026-07-24 · 2 min · 347 words

Cooperative Multi-Agent Reinforcement Learning for Adaptive Aggregation in Semi-Supervised Federated Learning with non-IID Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 275 words

CosyAccent: Duration-Controllable Accent Normalization using Source-Synthesis Training Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 246 words

Coupling Acoustic Geometry and Visual Semantics for Robust Depth Estimation

2026-04-29 · 更新于 2026-07-24 · 4 min · 742 words

CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content

2026-04-29 · 更新于 2026-07-24 · 2 min · 345 words

Cross-Architecture Knowledge Distillation of WavLM for Lightweight Speaker Verification

2026-04-29 · 更新于 2026-07-24 · 2 min · 376 words

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

2026-04-29 · 更新于 2026-07-24 · 2 min · 256 words

Cross-Domain Contrastive Learning with Dynamic Threshold Calibration for Source Speaker Tracing

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

Cross-Lingual Alzheimer’s Disease Detection with Multimodal LLMs via Speech Cue-Augmented Prompting and Instruction Tuning

2026-04-29 · 更新于 2026-07-24 · 3 min · 479 words

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 3 min · 428 words

Cross-Lingual Interleaving for Speech Language Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 507 words

Cross-Linguistic Rhythmic and Spectral Feature-Based Analysis of Nyishi and Adi: Two Under-Resourced Languages of Arunachal Pradesh

2026-04-29 · 更新于 2026-07-24 · 1 min · 22 words

Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 289 words

Cross-Modal Knowledge Distillation for Speech Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 371 words

CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications

2026-04-29 · 更新于 2026-07-24 · 2 min · 237 words

Curriculum Learning with Contrastive Loss for Lightweight Speaker Verification

2026-04-29 · 更新于 2026-07-24 · 3 min · 428 words

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

2026-04-29 · 更新于 2026-07-24 · 3 min · 458 words

D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation from Lead Sheet

2026-04-29 · 更新于 2026-07-24 · 2 min · 305 words

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 408 words

DAMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMS

2026-04-29 · 更新于 2026-07-24 · 3 min · 446 words

DAT-CFTNet: Speech Enhancement for Cochlear Implant Recipients using Attention-based Dual-Path Recurrent Neural Network

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

DBFT-SD: Weakly Supervised Multimodal Detection of Sensitive Audio-Visual Content

2026-04-29 · 更新于 2026-07-24 · 2 min · 215 words

DDSC: Dynamic Dual-Signal Curriculum for Data-Efficient Acoustic Scene Classification Under Domain Shift

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

DDSR-Net: Robust Multimodal Sentiment Analysis via Dynamic Modality Reliability Assessment

2026-04-29 · 更新于 2026-07-24 · 5 min · 864 words

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

2026-04-29 · 更新于 2026-07-24 · 2 min · 221 words

Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 379 words

Decorrelation-Enhanced Multiband Subband Adaptive Filtering for RIR Tracking in Sound Field Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 299 words

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

Deep Learning-Based Joint Optimization of Adaptive Feedback Cancellation and Residual Feedback Suppression for Hearing Aids

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

Deep Spatial Clue Informed Ambisonic Encoding for Irregular Microphone Arrays

2026-04-29 · 更新于 2026-07-24 · 3 min · 478 words

Deepaq: A Perceptual Audio Quality Metric Based on Foundational Models and Weakly Supervised Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 400 words

Denoising Of Stochastic Ray Tracing Room Impulse Responses

2026-04-29 · 更新于 2026-07-24 · 2 min · 360 words

DepthTalk: Few-Shot Talking Head Generation with Depth-Aware 3D Gaussian Field Motion

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

Detecting and Attributing Synthetic Spanish Speech: The HISPASpoof Dataset

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

DGSDNet: Dual-Graph Spectral Diffusion Network for Incomplete Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-07-24 · 3 min · 438 words

Diff-vs: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

Diffemotalk: Audio-Driven Facial Animation with Fine-Grained Emotion Control via Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 317 words

Differentiable Grouped Feedback Delay Networks for Learning Direction and Position-Dependent Late Reverberation

2026-04-29 · 更新于 2026-07-24 · 2 min · 340 words

Differentiable Pulsetable Synthesis for Wind Instrument Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Diffusion Timbre Transfer via Mutual Information Guided Inpainting

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

Direct Preference Optimization For Speech Autoregressive Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 347 words

Direct Simultaneous Translation Activation for Large Audio-Language Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 465 words

Direct Transfer of Prosody in Speech-to-speech Translation using Disentangled Speech Tokens

2026-04-29 · 更新于 2026-07-24 · 3 min · 523 words

Directly Trained Spiking Neural Networks with Adaptive Phase Coding

2026-04-29 · 更新于 2026-07-24 · 1 min · 206 words

DisContSE: Single-Step Diffusion Speech Enhancement based on Joint Discrete and Continuous Embeddings

2026-04-29 · 更新于 2026-07-24 · 3 min · 431 words

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

2026-04-29 · 更新于 2026-07-24 · 2 min · 392 words

Discrete-Continuous Fusion With Adaptive Hierarchical Features For Audio Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

Disentangled Authenticity Representation for Partially Deepfake Audio Localization

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Disentangling Physiology from Fidelity: Latent-Guided Diffusion Models for Cross-Modal Cardiac Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 313 words

Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

DISSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration

2026-04-29 · 更新于 2026-07-24 · 3 min · 431 words

Distilling Attention Knowledge for Speaker Verification

2026-04-29 · 更新于 2026-07-24 · 3 min · 462 words

Distributed Multichannel Active Noise Control with Asynchronous Communication

2026-04-29 · 更新于 2026-07-24 · 2 min · 216 words

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

2026-04-29 · 更新于 2026-07-24 · 3 min · 513 words

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

2026-04-29 · 更新于 2026-07-24 · 2 min · 334 words

Diverse and Few-Step Audio Captioning via Flow Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

DMP-TTS: Disentangled Multi-Modal Prompting for Controllable Text-to-Speech with Chained Guidance

2026-04-29 · 更新于 2026-07-24 · 2 min · 399 words

Do Bias Benchmarks Generalise? Evidence from Voice-Based Evaluation of Gender Bias in Speechllms

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

Do Foundational Audio Encoders Understand Music Structure?

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

Do Speech LLMs Learn Crossmodal Embedding Spaces?

2026-04-29 · 更新于 2026-07-24 · 1 min · 213 words

Do We Need EMA for Diffusion-Based Speech Enhancement? Toward A Magnitude-Preserving Network Architecture

2026-04-29 · 更新于 2026-07-24 · 3 min · 476 words

Do we really need self-attention for streaming automatic speech recognition?

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

Do You Hear What I Mean? Quantifying the Instruction-Perception GAP in Instruction-Guided Expressive Text-to-Speech Systems

2026-04-29 · 更新于 2026-07-24 · 2 min · 224 words

Does the Pre-Training of an Embedding Influence its Encoding of Age?

2026-04-29 · 更新于 2026-07-24 · 1 min · 169 words

DOMA: Leveraging Diffusion Language Models with Adaptive Prior for Intent Classification and Slot Filling

2026-04-29 · 更新于 2026-07-24 · 3 min · 427 words

Domain Partitioning Meets Parameter-Efficient Fine-Tuning: A Novel Method for Improved Language-Queried Audio Source Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 376 words

Domain-Aware Scheduling for ASR Fine-Tuning

2026-04-29 · 更新于 2026-07-24 · 2 min · 269 words

Domain-Invariant Representation Learning of Bird Sounds

2026-04-29 · 更新于 2026-07-24 · 2 min · 412 words

DPO-Regularized Regression for Age Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 236 words

DPT-Net: Dual-Path Transformer Network with Hierarchical Fusion for EEG-based Envelope Reconstruction

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

DSRMS-TransUnet: A Decentralized Non-Shifted Transunet for Shallow Water Acoustic Source Range Estimation

2026-04-29 · 更新于 2026-07-24 · 2 min · 294 words

DSSR: Decoupling Salient and Subtle Representations Under Missing Modalities for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Dual Contrastive Learning for Semi-Supervised Domain Adaptation in Bi-Modal Depression Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

2026-04-29 · 更新于 2026-07-24 · 2 min · 405 words

Dual-Perspective Multimodal Sentiment Analysis with MoE Fusion: Representation Learning via Semantic Resonance and Divergence

2026-04-29 · 更新于 2026-07-24 · 3 min · 434 words

Dual-Strategy-Enhanced Conbimamba for Neural Speaker Diarization

2026-04-29 · 更新于 2026-07-24 · 2 min · 367 words

Dynamic Balanced Cross-Modal Attention with Gated Sequence Restoration: Towards Robust Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 233 words

Dynamic Noise-Aware Multi Lora Framework Towards Real-World Audio Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 294 words

Dynamic Spectrogram Analysis with Local-Aware Graph Networks for Audio Anti-Spoofing

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 244 words

E2E-AEC: Implementing An End-To-End Neural Network Learning Approach for Acoustic Echo Cancellation

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems

2026-04-29 · 更新于 2026-07-24 · 2 min · 324 words

ECHO: Frequency-Aware Hierarchical Encoding for Variable-Length Signals

2026-04-29 · 更新于 2026-07-24 · 2 min · 340 words

EchoFake: A Replay-Aware Dataset For Practical Speech Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 393 words

EchoRAG: A Two-Stage Framework for Audio-Text Retrieval and Temporal Grounding

2026-04-29 · 更新于 2026-07-24 · 2 min · 308 words

ECSA: Dual-Branch Emotion Compensation for Emotion-Consistent Speaker Anonymization

2026-04-29 · 更新于 2026-07-24 · 2 min · 404 words

EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting

2026-04-29 · 更新于 2026-07-24 · 2 min · 277 words

EEG and Eye-Tracking Driven Dynamic Target Speaker Extraction with Spontaneous Attention Switching

2026-04-29 · 更新于 2026-07-24 · 2 min · 295 words

EEND-SAA: Enrollment-Less Main Speaker Voice Activity Detection Using Self-Attention Attractors

2026-04-29 · 更新于 2026-07-24 · 2 min · 396 words

Efficient Audio-Visual Inference Via Token Clustering And Modality Fusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

Efficient Depression Detection from Speech via Language-Independent Prompt-Driven Reprogramming

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

Efficient Solutions for Mitigating Initialization Bias in Unsupervised Self-Adaptive Auditory Attention Decoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 261 words

EMG-to-Speech with Fewer Channels

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 391 words

Emo-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 486 words

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 274 words

EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

Emotion-Aligned Generation in Diffusion Text to Speech Models Via Preference-Guided Optimization

2026-04-29 · 更新于 2026-07-24 · 2 min · 402 words

Emotional Damage: Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

2026-04-29 · 更新于 2026-07-24 · 2 min · 230 words

Emotional Dimension Control in Language Model-Based Text-To-Speech: Spanning a Broad Spectrum of Human Emotions

2026-04-29 · 更新于 2026-07-24 · 1 min · 186 words

EmoTri-RL: Emotion- and Cause-Aware Reinforcement Learning for Multi-Modal Empathetic Dialogue

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness

2026-04-29 · 更新于 2026-07-24 · 2 min · 408 words

Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers

2026-04-29 · 更新于 2026-07-24 · 2 min · 294 words

Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Enhanced Generative Machine Listener

2026-04-29 · 更新于 2026-07-24 · 2 min · 256 words

Enhancing Audio Question-Answering Performance Through Log-Likelihood Guided Reward Functions

2026-04-29 · 更新于 2026-07-24 · 2 min · 367 words

Enhancing Automatic Drum Transcription with Online Dynamic Few-Shot Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 245 words

Enhancing Dialogue-Related Speech Tasks with Generated Spoken Dialogues

2026-04-29 · 更新于 2026-07-24 · 2 min · 291 words

Enhancing Noise Robustness for Neural Speech Codecs Through Resource-Efficient Progressive Quantization Perturbation Simulation

2026-04-29 · 更新于 2026-07-24 · 1 min · 178 words

Enhancing Speaker Verification with w2v-BERT 2.0 and Knowledge Distillation Guided Structured Pruning

2026-04-29 · 更新于 2026-07-24 · 3 min · 443 words

Enhancing Speech Intelligibility Prediction for Hearing Aids with Complementary Speech Foundation Model Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

2026-04-29 · 更新于 2026-07-24 · 1 min · 179 words

Equipping Large Language Model with Directional Speech Understanding Capabilities

2026-04-29 · 更新于 2026-07-24 · 2 min · 249 words

Erasing Your Voice Before it’s Heard: Training-Free Speaker Unlearning for Zero-Shot Text-to-Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 384 words

Estimating Hand-Related Features from Speech Using Machine Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 226 words

Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

2026-04-29 · 更新于 2026-07-24 · 2 min · 223 words

Etude: Piano Cover Generation with a Three-Stage Approach — Extract, Structuralize, and Decode

2026-04-29 · 更新于 2026-07-24 · 2 min · 421 words

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

2026-04-29 · 更新于 2026-07-24 · 3 min · 437 words

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

2026-04-29 · 更新于 2026-07-24 · 2 min · 313 words

Evaluating Compositional Structure in Audio Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 324 words

Evaluating Disentangled Representations for Controllable Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 289 words

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 240 words

Evaluating High-Resolution Piano Sustain Pedal Depth Estimation with Musically Informed Metrics

2026-04-29 · 更新于 2026-07-24 · 2 min · 351 words

Evaluating Pretrained Speech Embedding Systems for Dysarthria Detection Across Heterogenous Datasets

2026-04-29 · 更新于 2026-07-24 · 2 min · 249 words

Event Classification by Physics-Informed Inpainting for Distributed Multichannel Acoustic Sensor with Partially Degraded Channels

2026-04-29 · 更新于 2026-07-24 · 2 min · 230 words

Exploring Fine-Tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 375 words

Exploring How Audio Effects Alter Emotion with Foundation Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 220 words

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 3 min · 572 words

Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

Expressive Voice Conversion with Controllable Emotional Intensity

2026-04-29 · 更新于 2026-07-24 · 2 min · 387 words

Exterior Sound Field Estimation Based on Physics-Constrained Kernel

2026-04-29 · 更新于 2026-07-24 · 1 min · 199 words

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Face-Voice Association with Inductive Bias for Maximum Class Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 382 words

Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

2026-04-29 · 更新于 2026-07-24 · 2 min · 418 words

Fast-ULCNet: A Fast and Ultra Low Complexity Network for Single-Channel Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

FastEnhancer: Speed-Optimized Streaming Neural Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 421 words

FD-ARL: Feature Disentanglement with Adversarial-Reconstruction Learning for Cross-Subject Auditory Attention Decoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

FDCNet: Frequency Domain Channel Attention and Convolution for Lipreading

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

FED-PISA: Federated Voice Cloning Via Personalized Identity-Style Adaptation

2026-04-29 · 更新于 2026-07-24 · 3 min · 442 words

Feedback-Driven Retrieval-Augmented Audio Generation with Large Audio Language Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 431 words

Few-Shot Recognition of Audio Deepfake Generators using Graph-Based Prototype Adaptation

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

FIDIC:Fine-Grained Conversational Emotion Recognition via Individual Differences in Inertia and Contagion

2026-04-29 · 更新于 2026-07-24 · 2 min · 234 words

Fine-Grained Frame Modeling in Multi-Head Self-Attention for Speech Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 299 words

Fine-Tuning Bigvgan-V2 for Robust Musical Tuning Preservation

2026-04-29 · 更新于 2026-07-24 · 2 min · 252 words

Fine-Tuning Large Audio-Language Models with Lora for Precise Temporal Localization of Prolonged Exposure Therapy Elements

2026-04-29 · 更新于 2026-07-24 · 4 min · 698 words

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment

2026-04-29 · 更新于 2026-07-24 · 3 min · 568 words

FinHuBERT: Hierarchical Feature Imitating Networks for Low-Resource Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 322 words

FlashFoley: Fast Interactive Sketch2audio Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

Flexio: Flexible Single- and Multi-Channel Speech Separation and Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

FOCA: Multimodal Malware Classification via Hyperbolic Cross-Attention

2026-04-29 · 更新于 2026-07-24 · 2 min · 373 words

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

2026-04-29 · 更新于 2026-07-24 · 3 min · 626 words

FODGE : High-Fidelity Dance Generation via Full-Body Optimization

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

FoleyBench: A Benchmark for Video-to-Audio Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Forward Convolutive Prediction for Frame Online Monaural Speech Dereverberation based on Kronecker Product Decomposition

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

Frame-Stacked Local Transformers for Efficient Multi-Codebook Speech Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 421 words

Frequency-Independent Ambisonics Upscaling Using Deep Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 243 words

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-Modal Understanding in Multimodal LLMS

2026-04-29 · 更新于 2026-07-24 · 2 min · 370 words

From Diet to Free Lunch: Estimating Auxiliary Signal Properties Using Dynamic Pruning Masks in Speech Enhancement Networks

2026-04-29 · 更新于 2026-07-24 · 2 min · 403 words

From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

2026-04-29 · 更新于 2026-07-24 · 2 min · 285 words

From Human Speech to Ocean Signals: Transferring Speech Large Models for Underwater Acoustic Target Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 285 words

Frontend Token Enhancement for Token-Based Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 460 words

Full Band Denoising of Room Impulse Response in the Wavelet Domain with Dictionary Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 227 words

FUN-SSL: Full-Band Layer Followed by U-Net With Narrow-Band Layers for Multiple Moving Sound Source Localization

2026-04-29 · 更新于 2026-07-24 · 2 min · 271 words

FUSEMOS: Perceptual Evaluation of Text-to-Music Generation with Dual-Encoder Fusion and Ranking-Aware Composite Loss

2026-04-29 · 更新于 2026-07-24 · 3 min · 506 words

Fusion of Multimodal Estimations by Extended State Hidden Markov Model: Application to Fetal Heart Rate Monitoring

2026-04-29 · 更新于 2026-07-24 · 2 min · 286 words

FxSearcher: Gradient-Free Text-Driven Audio Transformation

2026-04-29 · 更新于 2026-07-24 · 2 min · 359 words

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 245 words

Gdiffuse: Diffusion-Based Speech Enhancement with Noise Model Guidance

2026-04-29 · 更新于 2026-07-24 · 3 min · 498 words

Gelina: Unified Speech and Gesture Synthesis Via Interleaved Token Prediction

2026-04-29 · 更新于 2026-07-24 · 3 min · 433 words

Gen-SER: When the Generative Model Meets Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 255 words

Generalizability of Predictive and Generative Speech Enhancement Models to Pathological Speakers

2026-04-29 · 更新于 2026-07-24 · 3 min · 434 words

Generating Localized Audible Zones Using a Single-Channel Parametric Loudspeaker

2026-04-29 · 更新于 2026-07-24 · 1 min · 202 words

Generating Moving 3d Soundscapes with Latent Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 257 words

Generative Audio Extension and Morphing

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

Generative UI as an Accessibility Bridge: Lessons from C2C E-Commerce

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

GLA-GRAD++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

GLAP: General Contrastive Audio-Text Pretraining Across Domains and Languages

2026-04-29 · 更新于 2026-07-24 · 3 min · 434 words

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

2026-04-29 · 更新于 2026-07-24 · 3 min · 455 words

GLUE: Gradient-free Learning to Unify Experts

2026-04-29 · 更新于 2026-07-24 · 2 min · 315 words

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Constrative and Generative Pretraining

2026-04-29 · 更新于 2026-07-24 · 2 min · 354 words

Graph-Based Emotion Consensus Perception Learning for Multimodal Emotion Recognition in Conversation

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

Graph-based Modality Alignment for Robustness in Conversational Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Graph-Biased EEG Transformers for Silent Speech Decoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 351 words

Grey-Box Prompt Tuning With Graph Alignment for Speech-Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

GRNet: Graph Reconstruction Network for Robust Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Group Relative Policy Optimization for Text-to-Speech with Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 347 words

Group-Sparse Gaussian Process Regression for Inhomogeneous Sound Field Estimation

2026-04-29 · 更新于 2026-07-24 · 2 min · 241 words

H-nnPBFDAF: Hierarchical Neural Network Partitioned Block Frequency Domain Adaptive Filter with Novel Block Activation Probability

2026-04-29 · 更新于 2026-07-24 · 2 min · 405 words

Hair Noise Analysis and Mitigation for Smart Glasses Audio Captures

2026-04-29 · 更新于 2026-07-24 · 2 min · 288 words

Hanui: Harnessing Distributional Discrepancies for Singing Voice Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 264 words

HarmoNet: Music Grounding by Short Video via Harmonic Resample and Dynamic Sparse Alignment

2026-04-29 · 更新于 2026-07-24 · 2 min · 373 words

Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 268 words

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection with Multichannel Audio and Multiscale Visual Cues

2026-04-29 · 更新于 2026-07-24 · 2 min · 415 words

HCGAN: Harmonic-Coupled Generative Adversarial Network for Speech Super-Resolution in Low-Bandwidth Scenarios

2026-04-29 · 更新于 2026-07-24 · 2 min · 301 words

HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

HergNet: A Fast Neural Surrogate Model for Sound Field Predictions Via Superposition of Plane Waves

2026-04-29 · 更新于 2026-07-24 · 2 min · 259 words

HFSQVAE: Hierarchical Vector Quantization with Residuals for Frequency-Specific Embedding

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

Hierarchical Activity Recognition and Captioning from Long-Form Audio

2026-04-29 · 更新于 2026-07-24 · 2 min · 410 words

Hierarchical Discrete Flow Matching For Multi-Codebook Codec-Based Text-To-Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

Hierarchical Tokenization of Multimodal Music Data for Generative Music Retrieval

2026-04-29 · 更新于 2026-07-24 · 2 min · 337 words

HiFi-HARP: A High-Fidelity 7th-Order Ambisonic Room Impulse Response Dataset

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

High-Fidelity Speech Enhancement Via Discrete Audio Tokens

2026-04-29 · 更新于 2026-07-24 · 2 min · 322 words

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-Resource Transfer

2026-04-29 · 更新于 2026-07-24 · 1 min · 162 words

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 243 words

Huí Sù: Co-constructing a Dual Feedback Apparatus

2026-04-29 · 更新于 2026-07-24 · 1 min · 149 words

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

2026-04-29 · 更新于 2026-07-24 · 2 min · 315 words

HVAC-EAR: Eavesdropping Human Speech Using HVAC Systems

2026-04-29 · 更新于 2026-07-24 · 2 min · 423 words

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing

2026-04-29 · 更新于 2026-07-24 · 2 min · 395 words

HyFlowSE: Hybrid End-To-End Flow-Matching Speech Enhancement via Generative-Discriminative Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-Based Single-Channel Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 370 words

IBPCodec : A Low-Bitrate Lightweight Speech Codec With Inter-Band Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

ICASSP 2026 - 主动噪声控制论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 145 words

ICASSP 2026 - 主动降噪论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 132 words

ICASSP 2026 - 主题建模论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 86 words

ICASSP 2026 - 信号处理论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 162 words

ICASSP 2026 - 关键词检测论文列表

2026-04-29 · 更新于 2026-07-24 · 4 min · 682 words

ICASSP 2026 - 医疗AI 论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 96 words

ICASSP 2026 - 听觉注意力解码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 208 words

ICASSP 2026 - 听觉注意解码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 117 words

ICASSP 2026 - 噪声控制论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 103 words

ICASSP 2026 - 回声消除论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 127 words

ICASSP 2026 - 基准测试论文列表

2026-04-29 · 更新于 2026-07-24 · 4 min · 748 words

ICASSP 2026 - 基频估计论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 106 words

ICASSP 2026 - 声场估计论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 91 words

ICASSP 2026 - 声学建模论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 80 words

ICASSP 2026 - 声源定位论文列表

2026-04-29 · 更新于 2026-07-24 · 7 min · 1446 words

ICASSP 2026 - 多模态学习论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 85 words

ICASSP 2026 - 多模态对话意图识别论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 90 words

ICASSP 2026 - 多模态情感分析论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 151 words

ICASSP 2026 - 多模态情感识别论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 231 words

ICASSP 2026 - 多模态模型论文列表

2026-04-29 · 更新于 2026-07-24 · 4 min · 672 words

ICASSP 2026 - 多通道论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 80 words

ICASSP 2026 - 多音高估计 #音符跟踪论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 383 words

ICASSP 2026 - 实体消歧论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 91 words

ICASSP 2026 - 实时处理论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 155 words

ICASSP 2026 - 对抗样本论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 128 words

ICASSP 2026 - 异常声音检测论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 154 words

ICASSP 2026 - 情感分析论文列表

2026-04-29 · 更新于 2026-07-24 · 4 min · 748 words

ICASSP 2026 - 情感识别论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 154 words

ICASSP 2026 - 房间脉冲响应论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 131 words

ICASSP 2026 - 房间脉冲响应去噪论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 102 words

ICASSP 2026 - 数据集论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

ICASSP 2026 - 数据集对齐论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 122 words

ICASSP 2026 - 槽填充论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 95 words

ICASSP 2026 - 模型评估论文列表

2026-04-29 · 更新于 2026-07-24 · 11 min · 2176 words

ICASSP 2026 - 歌唱旋律提取论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 102 words

ICASSP 2026 - 歌唱语音合成论文列表

2026-04-29 · 更新于 2026-07-24 · 3 min · 601 words

ICASSP 2026 - 歌唱语音转录论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 134 words

ICASSP 2026 - 歌唱语音转换论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 403 words

ICASSP 2026 - 水下声学目标识别论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 146 words

ICASSP 2026 - 生物声学论文列表

2026-04-29 · 更新于 2026-07-24 · 7 min · 1362 words

ICASSP 2026 - 目标说话人提取论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 81 words

ICASSP 2026 - 神经解码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 130 words

ICASSP 2026 - 空间音频论文列表

2026-04-29 · 更新于 2026-07-24 · 18 min · 3752 words

ICASSP 2026 - 联邦学习论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 195 words

ICASSP 2026 - 脑信号编码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 167 words

ICASSP 2026 - 脑机接口论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 155 words

ICASSP 2026 - 舞蹈生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 95 words

ICASSP 2026 - 视觉语音识别论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 365 words

ICASSP 2026 - 视频到音频生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 135 words

ICASSP 2026 - 视频检索论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 96 words

ICASSP 2026 - 视频片段检索论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 110 words

ICASSP 2026 - 视频理解论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 97 words

ICASSP 2026 - 视频生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 165 words

ICASSP 2026 - 视频设备识别论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 87 words

ICASSP 2026 - 视频问答论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 183 words

ICASSP 2026 - 视频高光检测论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 108 words

ICASSP 2026 - 语音伪造检测论文列表

2026-04-29 · 更新于 2026-07-24 · 5 min · 938 words

ICASSP 2026 - 语音克隆论文列表

2026-04-29 · 更新于 2026-07-24 · 3 min · 470 words

ICASSP 2026 - 语音分离论文列表

2026-04-29 · 更新于 2026-07-24 · 13 min · 2634 words

ICASSP 2026 - 语音匿名化论文列表

2026-04-29 · 更新于 2026-07-24 · 6 min · 1240 words

ICASSP 2026 - 语音发现论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 145 words

ICASSP 2026 - 语音合成论文列表

2026-04-29 · 更新于 2026-07-24 · 37 min · 7808 words

ICASSP 2026 - 语音增强 #对抗防御论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 80 words

ICASSP 2026 - 语音增强论文列表

2026-04-29 · 更新于 2026-07-24 · 40 min · 8423 words

ICASSP 2026 - 语音大模型论文列表

2026-04-29 · 更新于 2026-07-24 · 3 min · 457 words

ICASSP 2026 - 语音对话系统论文列表

2026-04-29 · 更新于 2026-07-24 · 7 min · 1302 words

ICASSP 2026 - 语音情感识别论文列表

2026-04-29 · 更新于 2026-07-24 · 26 min · 5504 words

ICASSP 2026 - 语音摘要论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 204 words

ICASSP 2026 - 语音活动检测论文列表

2026-04-29 · 更新于 2026-07-24 · 5 min · 863 words

ICASSP 2026 - 语音理解论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 362 words

ICASSP 2026 - 语音生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 128 words

ICASSP 2026 - 语音生物标志物论文列表

2026-04-29 · 更新于 2026-07-24 · 13 min · 2674 words

ICASSP 2026 - 语音编码论文列表

2026-04-29 · 更新于 2026-07-24 · 3 min · 515 words

ICASSP 2026 - 语音编码器论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 130 words

ICASSP 2026 - 语音翻译论文列表

2026-04-29 · 更新于 2026-07-24 · 6 min · 1095 words

ICASSP 2026 - 语音表示学习论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 170 words

ICASSP 2026 - 语音解码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 116 words

ICASSP 2026 - 语音评估论文列表

2026-04-29 · 更新于 2026-07-24 · 3 min · 531 words

ICASSP 2026 - 语音识别 #语音合成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 149 words

ICASSP 2026 - 语音识别 #语音翻译论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 389 words

ICASSP 2026 - 语音识别论文列表

2026-04-29 · 更新于 2026-07-24 · 55 min · 11705 words

ICASSP 2026 - 语音质量评估论文列表

2026-04-29 · 更新于 2026-07-24 · 6 min · 1238 words

ICASSP 2026 - 语音转换 #语音增强论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 144 words

ICASSP 2026 - 语音转换论文列表

2026-04-29 · 更新于 2026-07-24 · 5 min · 962 words

ICASSP 2026 - 语音问答论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 311 words

ICASSP 2026 - 语音驱动动作生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 130 words

ICASSP 2026 - 说话人分离论文列表

2026-04-29 · 更新于 2026-07-24 · 6 min · 1217 words

ICASSP 2026 - 说话人合成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 96 words

ICASSP 2026 - 说话人日志 #语音分离论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 202 words

ICASSP 2026 - 说话人日志论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 278 words

ICASSP 2026 - 说话人检测论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 86 words

ICASSP 2026 - 说话人生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 91 words

ICASSP 2026 - 说话人脸生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 163 words

ICASSP 2026 - 说话人识别论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 103 words

ICASSP 2026 - 说话人验证论文列表

2026-04-29 · 更新于 2026-07-24 · 6 min · 1183 words

ICASSP 2026 - 课堂阶段分割论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 83 words

ICASSP 2026 - 跨模态论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 213 words

ICASSP 2026 - 跨模态检索论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 215 words

ICASSP 2026 - 轻度认知障碍检测论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 241 words

ICASSP 2026 - 迁移学习论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 96 words

ICASSP 2026 - 零样本关键词检测论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 105 words

ICASSP 2026 - 音乐信息检索论文列表

2026-04-29 · 更新于 2026-07-24 · 17 min · 3478 words

ICASSP 2026 - 音乐分离论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 99 words

ICASSP 2026 - 音乐分类论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 181 words

ICASSP 2026 - 音乐推荐论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 167 words

ICASSP 2026 - 音乐检索论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

ICASSP 2026 - 音乐混合论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 100 words

ICASSP 2026 - 音乐源分离论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 242 words

ICASSP 2026 - 音乐源提取论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 134 words

ICASSP 2026 - 音乐理解论文列表

2026-04-29 · 更新于 2026-07-24 · 7 min · 1392 words

ICASSP 2026 - 音乐生成论文列表

2026-04-29 · 更新于 2026-07-24 · 18 min · 3742 words

ICASSP 2026 - 音乐转录论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 85 words

ICASSP 2026 - 音视频论文列表

2026-04-29 · 更新于 2026-07-24 · 5 min · 1042 words

ICASSP 2026 - 音视频实例分割论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 130 words

ICASSP 2026 - 音频事件检测论文列表

2026-04-29 · 更新于 2026-07-24 · 12 min · 2538 words

ICASSP 2026 - 音频信号处理论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 83 words

ICASSP 2026 - 音频分离论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 236 words

ICASSP 2026 - 音频分类 #零样本学习论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 81 words

ICASSP 2026 - 音频分类论文列表

2026-04-29 · 更新于 2026-07-24 · 22 min · 4671 words

ICASSP 2026 - 音频压缩论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

ICASSP 2026 - 音频场景分类论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 178 words

ICASSP 2026 - 音频场景理解论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

ICASSP 2026 - 音频增强论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 331 words

ICASSP 2026 - 音频大模型论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 101 words

ICASSP 2026 - 音频字幕生成论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 102 words

ICASSP 2026 - 音频安全论文列表

2026-04-29 · 更新于 2026-07-24 · 8 min · 1559 words

ICASSP 2026 - 音频描述论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 105 words

ICASSP 2026 - 音频效果估计论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 140 words

ICASSP 2026 - 音频无损编码论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 108 words

ICASSP 2026 - 音频检索 #音频分类论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 174 words

ICASSP 2026 - 音频检索论文列表

2026-04-29 · 更新于 2026-07-24 · 8 min · 1662 words

ICASSP 2026 - 音频水印论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 148 words

ICASSP 2026 - 音频深度伪造检测论文列表

2026-04-29 · 更新于 2026-07-24 · 17 min · 3544 words

ICASSP 2026 - 音频生成论文列表

2026-04-29 · 更新于 2026-07-24 · 22 min · 4597 words

ICASSP 2026 - 音频编辑论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 96 words

ICASSP 2026 - 音频质量评估论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 209 words

ICASSP 2026 - 音频超分辨率论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 111 words

ICASSP 2026 - 音频问答论文列表

2026-04-29 · 更新于 2026-07-24 · 9 min · 1795 words

ICASSP 2026 - 预训练论文列表

2026-04-29 · 更新于 2026-07-24 · 1 min · 159 words

ICASSP 2026 - 领域适应论文列表

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

Identifying Birdsong Syllables without Labelled Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 292 words

Identifying the Minimal and Maximal Phonetic Subspace of Speech Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 221 words

Identity Leakage Through Accent Cues in Voice Anonymisation

2026-04-29 · 更新于 2026-07-24 · 2 min · 382 words

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

2026-04-29 · 更新于 2026-07-24 · 2 min · 252 words

Improving Active Learning for Melody Estimation by Disentangling Uncertainties

2026-04-29 · 更新于 2026-07-24 · 3 min · 462 words

Improving Anomalous Sound Detection with Attribute-Aware Representation from Domain-Adaptive Pre-Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 288 words

Improving Audio Event Recognition with Consistency Regularization

2026-04-29 · 更新于 2026-07-24 · 2 min · 289 words

Improving Audio Question Answering with Variational Inference

2026-04-29 · 更新于 2026-07-24 · 2 min · 377 words

Improving Automatic Speech Recognition by Mitigating Distortions Introduced by Speech Enhancement Under Drone Noise

2026-04-29 · 更新于 2026-07-24 · 3 min · 630 words

Improving Binaural Distance Estimation in Reverberant Rooms Through Contrastive And Multi-Task Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 267 words

Improving Contextual Asr Via Multi-Grained Fusion With Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 317 words

Improving Interpretability in Generative Multitimbral DDSP Frameworks via Semantically-Disentangled Musical Attributes

2026-04-29 · 更新于 2026-07-24 · 2 min · 404 words

Improving Multimodal Brain Encoding Model with Dynamic Subject-Awareness Routing

2026-04-29 · 更新于 2026-07-24 · 3 min · 476 words

Improving the Speaker Anonymization Evaluation’s Robustness to Target Speakers with Adversarial Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word level timestamp predictions

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 352 words

Incremental Learning for Audio Classification with Hebbian Deep Neural Networks

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

2026-04-29 · 更新于 2026-07-24 · 2 min · 264 words

Individualize the HRTF Neural Field Using Anthropometric Parameters Weighted by Direction-Attention

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

Influence of Clean Speech Characteristics on Speech Enhancement Performance

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Influence-Aware Curation and Active Selection for Industrial and Surveillance Sound Events

2026-04-29 · 更新于 2026-07-24 · 3 min · 547 words

Input-Adaptive Differentiable Filterbanks via Hypernetworks for Robust Speech Processing

2026-04-29 · 更新于 2026-07-24 · 2 min · 418 words

InstructAudio: Unified Speech and Music Generation with Natural Language Instruction

2026-04-29 · 更新于 2026-07-24 · 4 min · 791 words

Instrument Generation Through Distributional Flow Matching and Test-Time Search

2026-04-29 · 更新于 2026-07-24 · 2 min · 270 words

Int-MeanFlow: Few-Step Speech Generation with Integral Velocity Distillation

2026-04-29 · 更新于 2026-07-24 · 3 min · 487 words

Integrating Speaker Embeddings and LLM-Derived Semantic Representations for Streaming Speaker Diarization

2026-04-29 · 更新于 2026-07-24 · 2 min · 408 words

Inter-Dialog Contrastive Learning for Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-07-24 · 3 min · 436 words

Interpretable Music Harmonic Analysis Through Multilinear Mixture of Experts

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

Interval-Aware Retrieval Framework For Speech-Based Automatic Alzheimer’s Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 317 words

Inverse-Hessian Regularization for Continual Learning in ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 219 words

Investigating Modality Contribution in Audio LLMs for Music

2026-04-29 · 更新于 2026-07-24 · 1 min · 151 words

Investigating The Effect Of Sentence-Level Syntactic Structure On Information Loss In The Human Auditory System

2026-04-29 · 更新于 2026-07-24 · 2 min · 309 words

Is Phase Really Needed for Weakly-Supervised Dereverberation?

2026-04-29 · 更新于 2026-07-24 · 2 min · 224 words

It Is Personal: The Importance of Personalization for Recognizing Self-Reported Emotion

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation

2026-04-29 · 更新于 2026-07-24 · 2 min · 394 words

Joint Deep Secondary Path Estimation and Adaptive Control for Active Noise Cancellation

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-Task Multi-Scale Network

2026-04-29 · 更新于 2026-07-24 · 3 min · 531 words

Joint Estimation of Primary and Secondary Paths for Personalized Hearable Applications

2026-04-29 · 更新于 2026-07-24 · 2 min · 275 words

Joint Multichannel Acoustic Feedback Cancellation and Speaker Extraction via Kalman Filter and Deep Non-Linear Spatial Filter

2026-04-29 · 更新于 2026-07-24 · 2 min · 247 words

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

2026-04-29 · 更新于 2026-07-24 · 2 min · 247 words

KAN We Make Models Simpler for Audio Deepfake Detection with Kolmogorov–Arnold Networks?

2026-04-29 · 更新于 2026-07-24 · 2 min · 309 words

Keeping Models Listening: Segment- and time-aware attention rescaling at decoding time

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

2026-04-29 · 更新于 2026-07-24 · 1 min · 135 words

KSDIFF: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

2026-04-29 · 更新于 2026-07-24 · 3 min · 457 words

LAFUFU: Latent Acoustic Features For Ultra-Fast Utterance Restoration

2026-04-29 · 更新于 2026-07-24 · 3 min · 480 words

LAMB: LLM-Based Audio Captioning with Modality Gap Bridging Via Cauchy-Schwarz Divergence

2026-04-29 · 更新于 2026-07-24 · 2 min · 243 words

Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR

2026-04-29 · 更新于 2026-07-24 · 1 min · 209 words

Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 396 words

Learnable Mel-Frontend for Robust Underwater Acoustic Target Detection under Non-Target Interference

2026-04-29 · 更新于 2026-07-24 · 2 min · 397 words

Learning Domain-Robust Bioacoustic Representations for Mosquito Species Classification with Contrastive Learning and Distribution Alignment

2026-04-29 · 更新于 2026-07-24 · 3 min · 462 words

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

2026-04-29 · 更新于 2026-07-24 · 2 min · 295 words

Learning Piezoelectric Hysteresis in In-Ear MEMS Loudspeakers from Acoustic Measurements

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 277 words

Learning Vocal-Tract Area And Radiation With A Physics-Informed Webster Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 415 words

Learning What to Hear: Boosting Sound-Source Association for Robust Audiovisual Instance Segmentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 377 words

LenslessMic: Audio Encryption and Authentication via Lensless Computational Imaging

2026-04-29 · 更新于 2026-07-24 · 3 min · 574 words

LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 364 words

LETPAV: Lexicon-Enhanced Text with Progressive Audio-Visual Fusion for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 3 min · 480 words

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

Leveraging Diffusion U-Net Features for Predominant Instrument Recognition

2026-04-29 · 更新于 2026-07-24 · 1 min · 175 words

Leveraging Large Multimodal Models for Audio-Video Deepfake Detection: A Pilot Study

2026-04-29 · 更新于 2026-07-24 · 2 min · 385 words

Leveraging Large Speech Language Models as Evaluators for Expressive Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

Leveraging Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners

2026-04-29 · 更新于 2026-07-24 · 2 min · 340 words

Leveraging prediction entropy for Automatic prompt weighting in Zero-Shot Audio-Language Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 290 words

Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Leveraging Text-to-Speech and Voice Conversion as Data Augmentation for Alzheimer’s Disease Detection from Spontaneous Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

Leveraging Whisper Embeddings For Audio-Based Lyrics Matching

2026-04-29 · 更新于 2026-07-24 · 3 min · 442 words

Lightweight and Generalizable Acoustic Scene Representations Via Contrastive Fine-Tuning and Distillation

2026-04-29 · 更新于 2026-07-24 · 2 min · 350 words

Lightweight and Perceptually-Guided Voice Conversion for Electro-Laryngeal Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 388 words

Lightweight Implicit Neural Network for Binaural Audio Synthesis

2026-04-29 · 更新于 2026-07-24 · 3 min · 443 words

Lightweight Phoneme-Conditioned Bandwidth Extension for Body-Conducted Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 279 words

Lingometer: On-Device Personal Speech Word Counting System

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Linguard: Authenticating Speech Recordings Using Speech Recognition and Watermark

2026-04-29 · 更新于 2026-07-24 · 2 min · 335 words

LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-And-Play Dereverberation

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Lisa: Lightweight Yet Superb Neural Speech Coding

2026-04-29 · 更新于 2026-07-24 · 2 min · 371 words

Listen, But Don’t Leak: Sensitive Data Protection for Privacy Aware Automatic Speech Recognition with Acoustic Triggers

2026-04-29 · 更新于 2026-07-24 · 1 min · 190 words

LLAC: Learned Lossless Audio Codec

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

LLM-Based Post-ASR Error Correction for Disordered Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 219 words

Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 250 words

Look, Listen and Segment: Towards Weakly Supervised Audio-Visual Semantic Segmentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Loose Coupling of Spectral and Spatial Models for Multi-Channel Diarization and Enhancement of Meetings in Dynamic Environments

2026-04-29 · 更新于 2026-07-24 · 2 min · 383 words

LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 220 words

Low-Bandwidth High-Fidelity Speech Transmission with Generative Latent Joint Source-Channel Coding

2026-04-29 · 更新于 2026-07-24 · 2 min · 262 words

Low-Frequency Harmonic Control for Speech Intelligibility in Open-Ear Headphones

2026-04-29 · 更新于 2026-07-24 · 2 min · 234 words

Low-Latency Audio Front-End Region-of-Interest Beamforming for Smart Glasses

2026-04-29 · 更新于 2026-07-24 · 2 min · 236 words

Low-Resource Guidance for Controllable Latent Audio Diffusion

2026-04-29 · 更新于 2026-07-24 · 3 min · 563 words

Low-Resource Speech-Based Early Alzheimers Detection via Cross-Lingual and Few-Shot Transfer Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 254 words

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 313 words

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation Without Vector Quantization

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

2026-04-29 · 更新于 2026-07-24 · 3 min · 542 words

Malefa: Multi-Granularity Learning and Effective False Alarm Suppression for Zero-Shot Keyword Spotting

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Mambaformer: State-Space Augmented Self-Attention with Downup Sampling for Monaural Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 382 words

Marco-Voice: A Unified Framework for Expressive Speech Synthesis with Voice Cloning

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion with Increased Controllability via Multiple Guidances

2026-04-29 · 更新于 2026-07-24 · 3 min · 477 words

Matching Reverberant Speech Through Learned Acoustic Embeddings

2026-04-29 · 更新于 2026-07-24 · 2 min · 227 words

Matrix-Structured Hierarchical Convolutional Modeling for Pronunciation Assessment and Mispronunciation Detection

2026-04-29 · 更新于 2026-07-24 · 3 min · 429 words

Maximum Likelihood Measurement Noise Estimation for Block-Time Domain Kalman Filters

2026-04-29 · 更新于 2026-07-24 · 2 min · 233 words

MC-MRX: Reference- and Midi-Guided Music Source Extraction with Contrastive Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 388 words

MCF: Text LLMS for Multimodal Emotional Causality

2026-04-29 · 更新于 2026-07-24 · 2 min · 334 words

MCI-OTFusion: A Multimodal Model for MCI Detection and Cognitive Score Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

Meanflow-Accelerated Multimodal Video-to-Audio Synthesis Via One-Step Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

MeanFlowSE: One-Step Generative Speech Enhancement via Conditional Mean Flow

2026-04-29 · 更新于 2026-07-24 · 2 min · 393 words

MeanSE: Efficient Generative Speech Enhancement with Mean Flows

2026-04-29 · 更新于 2026-07-24 · 2 min · 350 words

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

2026-04-29 · 更新于 2026-07-24 · 3 min · 451 words

MeanVoiceFlow: One-Step Nonparallel Voice Conversion with Mean Flows

2026-04-29 · 更新于 2026-07-24 · 2 min · 389 words

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

2026-04-29 · 更新于 2026-07-24 · 2 min · 293 words

MECap-R1: Emotion-Aware Policy with Reinforcement Learning for Multimodal Emotion Captioning

2026-04-29 · 更新于 2026-07-24 · 2 min · 375 words

Medical ASR Enhancement by Domain-Specific Reinforcement Fine-Tuning

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

MELA-TTS: Joint Transformer-Diffusion Model with Representation Alignment for Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 426 words

Melos: Sentence-To-Section Training with Multi-Task Learning for LLM-Driven Song Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 417 words

Membership Inference Attack against Music Diffusion Models via Generative Manifold Perturbation

2026-04-29 · 更新于 2026-07-24 · 2 min · 235 words

MFF-RVRDI: Multimodal Fusion Framework for Robust Video Recording Device Identification

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large Audio-Language Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 353 words

Microphone-Less Measurement of Three-Dimensional Radiating Impulse Response of Sound Source using Spherical Harmonic-Domain Acousto-Optic Tomography

2026-04-29 · 更新于 2026-07-24 · 1 min · 161 words

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 245 words

Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 270 words

Mind Your [m]S, Cross Your [t]S: a Large-Scale Phonetic Analysis of Speech Reproduction in Modern Speech Generators

2026-04-29 · 更新于 2026-07-24 · 1 min · 196 words

MirrorTalk: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

2026-04-29 · 更新于 2026-07-24 · 2 min · 273 words

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs

2026-04-29 · 更新于 2026-07-24 · 2 min · 229 words

Mitigating Data Replication in Text-to-Audio Generative Diffusion Models Through Anti-Memorization Guidance

2026-04-29 · 更新于 2026-07-24 · 2 min · 405 words

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

2026-04-29 · 更新于 2026-07-24 · 1 min · 195 words

Mitigating Language Prior-Induced Hallucinations via Bi-Level Contrastive Decoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 388 words

Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 335 words

Mix2Morph: Learning Sound Morphing from Noisy Mixes

2026-04-29 · 更新于 2026-07-24 · 2 min · 322 words

MixGAN-based Non-blind Bandwidth Extension for Audio Codec

2026-04-29 · 更新于 2026-07-24 · 2 min · 311 words

Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

Mixture To Beamformed Mixture: Leveraging Beamformed Mixture As Weak-Supervision for Speech Enhancement and Noise-Robust ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 310 words

Mixture-of-Experts Based Soft-Label Learning for Multi-Label Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

2026-04-29 · 更新于 2026-07-24 · 2 min · 244 words

Mixtures of Lightweight Articulatory Experts for Multilingual Asr

2026-04-29 · 更新于 2026-07-24 · 2 min · 378 words

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-07-24 · 2 min · 283 words

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 222 words

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 385 words

MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

2026-04-29 · 更新于 2026-07-24 · 1 min · 176 words

Modeling Both Intra- And Inter-Utterance Variability for Conversational Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Modeling Inter-Segment Relationships in Speech for Dementia Detection with Audio Spectrogram Transformers and Graph Attention Networks

2026-04-29 · 更新于 2026-07-24 · 2 min · 346 words

Modeling Strategies For Speech Enhancement in The Latent Space of a Neural Audio Codec

2026-04-29 · 更新于 2026-07-24 · 3 min · 460 words

Monitoring exposure-length variations in submarine power cables using distributed fiber-optic sensing

2026-04-29 · 更新于 2026-07-24 · 1 min · 146 words

More Than a Shortcut: A Hyperbolic Approach to Early-Exit Networks

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Motionbeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 263 words

MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 425 words

MSANET: Multi-Scale Semantic Aggregation Network for Brain-Assisted Speech Enhancement in Multi-Speaker Conditions

2026-04-29 · 更新于 2026-07-24 · 2 min · 420 words

MSCT: Differential Cross-Modal Attention for Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 220 words

MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 405 words

MT-HuBERT: Self-Supervised Mix-Training for Few-Shot Keyword Spotting in Mixed Speech

2026-04-29 · 更新于 2026-07-24 · 6 min · 1085 words

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-Token Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 377 words

Multi-Layer Attentive Probing Improves Transfer of Audio Representations for Bioacoustics

2026-04-29 · 更新于 2026-07-24 · 2 min · 254 words

Multi-Scale Physiologically-Motivated Alignment for Auditory Attention Decoding

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Multi-Task Learning For Speech Quality Assessment Using ASR-Derived Entropy Features

2026-04-29 · 更新于 2026-07-24 · 3 min · 488 words

Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Multi-View Hierarchical Hypergraph Neural Network for Automatic Stuttering Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 392 words

Multilingual Supervised Pretraining with Lm-Assisted Decoding for Visual Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 290 words

Multimodal Co-Training with Subtractive Unlabeled-Benefit Bounds

2026-04-29 · 更新于 2026-07-24 · 1 min · 159 words

Multimodal Fusion-Based IPCLIP Network for Mixed Reality Surgical Assistance

2026-04-29 · 更新于 2026-07-24 · 2 min · 250 words

Multimodal LLMs as Expert Speech Annotators: Acoustic Macro-Descriptors for Parkinson’s Detection

2026-04-29 · 更新于 2026-07-24 · 1 min · 208 words

Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 295 words

Multimodal Transformer with Multiperspective Training for Predicting Self-Expression Skills from Video Interview

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

Multimodal Variational Graph Network for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 410 words

MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Musicdetr: A Position-Aware Spectral Note Detection Model for Singing Transcription

2026-04-29 · 更新于 2026-07-24 · 2 min · 315 words

MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 403 words

Natural Language to Spatial Audio Parameters: Lightweight Deterministic Rendering for Creative Authoring

2026-04-29 · 更新于 2026-07-24 · 2 min · 422 words

NCF-TTS: Enhancing Flow Matching Based Text-To-Speech with Neighborhood Consistency Flow

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

2026-04-29 · 更新于 2026-07-24 · 4 min · 852 words

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

Neuromamba: Adaptive Frequency Filtering with a Pyramid Mamba for sEEG-driven Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 327 words

NeuroSIFT: A Biologically-Inspired Framework with Explicit Signal-Noise Separation for Robust Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 277 words

nGPT as a Scalable Architecture for Speech Recognition and Translation

2026-04-29 · 更新于 2026-07-24 · 2 min · 328 words

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Noise-Robust AV-ASR Using Visual Features both in the Whisper Encoder and Decoder

2026-04-29 · 更新于 2026-07-24 · 3 min · 435 words

Noise-Robust Contrastive Learning with an MFCC-Conformer for Coronary Artery Disease Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 290 words

Noise-to-Notes: Diffusion-Based Generation and Refinement for Automatic Drum Transcription

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

Non-Line-of-Sight Vehicle Detection via Audio-Visual Fusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Obstructive Sleep Apnea Endotype Prediction During Wakefulness Using Voice Biomarkers

2026-04-29 · 更新于 2026-07-24 · 1 min · 171 words

Off-The-Grid Multi-Pitch Estimation Using Optimal Transport

2026-04-29 · 更新于 2026-07-24 · 2 min · 224 words

OMNI-AVSR: Towards Unified Multimodal Speech Recognition With Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 395 words

On deepfake voice detection - It’s all in the presentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

On The Design of Efficient Neural Methods for Geometry-Agnostic Multichannel Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

On the Design of Higher-Order Time-Intensity Microphone Arrays for Panoramic Audio Recording and Reproduction

2026-04-29 · 更新于 2026-07-24 · 2 min · 369 words

One Model–Three Tasks: Discovering a Shared Winning Ticket for Low-Complexity Audio Intelligence

2026-04-29 · 更新于 2026-07-24 · 2 min · 258 words

Online Register For Dual-Mode Self-Supervised Speech Models: Mitigating the Lack of Future Context

2026-04-29 · 更新于 2026-07-24 · 2 min · 369 words

Optimizing Domain-Adaptive Self-Supervised Learning for Clinical Voice-Based Disease Classification

2026-04-29 · 更新于 2026-07-24 · 3 min · 470 words

Optimizing Speech Language Models for Acoustic Consistency

2026-04-29 · 更新于 2026-07-24 · 2 min · 335 words

OV-INSTRUCTTTS: Towards Open-Vocabulary Instruct Text-to-Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

PAC: Pronunciation-Aware Contextualized Large Language Model-Based Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 384 words

PADAM: Perceptual Audio Defect Assessment Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 369 words

ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-Based Neural Speech Codec

2026-04-29 · 更新于 2026-07-24 · 2 min · 415 words

Parametric Neural Amp Modeling with Active Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 214 words

PC-MCL: Patient-Consistent Multi-Cycle Learning with Multi-Label Bias Correction for Respiratory Sound Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

Peeking Into the Future for Contextual Biasing

2026-04-29 · 更新于 2026-07-24 · 2 min · 327 words

Perceptual Loss Optimized HRTF Personalization in Spherical Harmonic Domain

2026-04-29 · 更新于 2026-07-24 · 2 min · 330 words

Perceptual Quality Assessment for Stylized Talking Heads

2026-04-29 · 更新于 2026-07-24 · 2 min · 303 words

PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

2026-04-29 · 更新于 2026-07-24 · 1 min · 104 words

Personal Sound Zones with Flexible Bright Zone Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 295 words

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 401 words

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 411 words

PG-SE: Predictive Acceleration and Correction for Generative Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 407 words

Phase-Retrieval-Based Physics-Informed Neural Networks For Acoustic Magnitude Field Reconstruction

2026-04-29 · 更新于 2026-07-24 · 2 min · 251 words

Phase-Space Signal Processing of Acoustic Data for Advanced Manufacturing In-Situ Monitoring

2026-04-29 · 更新于 2026-07-24 · 1 min · 157 words

PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

2026-04-29 · 更新于 2026-07-24 · 2 min · 343 words

Phonological Tokenizer: Prosody-Aware Phonetic Token Via Multi-Objective Fine-Tuning with Differentiable K-Means

2026-04-29 · 更新于 2026-07-24 · 3 min · 510 words

Phrased: Phrase Dictionary Biasing for Speech Translation

2026-04-29 · 更新于 2026-07-24 · 2 min · 266 words

Physics-Informed Neural Networks for Ocean Acoustic Field Reconstruction and Source Localization

2026-04-29 · 更新于 2026-07-24 · 2 min · 235 words

Pianoroll-Event: A Novel Score Representation for Symbolic Music

2026-04-29 · 更新于 2026-07-24 · 2 min · 340 words

PICOAUDIO2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 360 words

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Polynomial Mixing for Efficient Self-Supervised Speech Encoders

2026-04-29 · 更新于 2026-07-24 · 2 min · 379 words

Position-Invariant Fine-Tuning Of Speech Enhancement Models With Self-Supervised Speech Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

2026-04-29 · 更新于 2026-07-24 · 2 min · 400 words

Principled Coarse-Grained Acceptance For Speculative Decoding In Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 279 words

PRoADS: Provably Secure And Robust Audio Diffusion Steganography With Latent Optimization And Backward Euler Inversion

2026-04-29 · 更新于 2026-07-24 · 2 min · 239 words

Probing the Hidden Talent of ASR foundation models for L2 English Oral Assessment

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

Probing Whisper for Dysarthric Speech in Detection and Assessment

2026-04-29 · 更新于 2026-07-24 · 1 min · 174 words

Production-Scale Dynamic Vocabulary ASR Biasing with Word-Level FST and Robust Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 248 words

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

2026-04-29 · 更新于 2026-07-24 · 1 min · 186 words

Prompt-Guided Mixture-of-Experts for Robust Multimodal Sentiment Analysis with Missing Modalities

2026-04-29 · 更新于 2026-07-24 · 3 min · 597 words

PromptSep: Generative Audio Separation Via Multimodal Prompting

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

2026-04-29 · 更新于 2026-07-24 · 2 min · 247 words

PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

2026-04-29 · 更新于 2026-07-24 · 2 min · 305 words

Prototype-Guided Cross-Modal Contrastive Learning for Continual Audio-Visual Sound Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 292 words

PRSA: Preventing Malicious Speaker Recognition and Speech Synthesis Simultaneously with Adversarial Examples

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

PSTalker: Realistic 3D Talking Head Synthesis via a Semantic-Aware Audio-Driven Point-Based Shape

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 362 words

Qastanet: A DNN-Based Quality Metric for Spatial Audio

2026-04-29 · 更新于 2026-07-24 · 2 min · 282 words

QE-XVC: Zero-Shot Cross-Lingual Voice Conversion via Query-Enhancement and Conditional Flow Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 320 words

QFOCUS: Controllable Synthesis for Automated Speech Stress Editing to Deliver Human-Like Emphatic Intent

2026-04-29 · 更新于 2026-07-24 · 1 min · 160 words

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for Voicemos 2024

2026-04-29 · 更新于 2026-07-24 · 2 min · 386 words

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 281 words

Random Matrix-Driven Graph Representation Learning For Bioacoustic Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 272 words

Ranking The Impact of Contextual Specialization in Neural Speech Enhancement

2026-04-29 · 更新于 2026-07-24 · 3 min · 489 words

RAP: Real-Time Audio-Driven Portrait Animation with Video Diffusion Transformer

2026-04-29 · 更新于 2026-07-24 · 3 min · 454 words

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 226 words

RASD-SR: A Robust Anomalous Sound Detection Framework with Score Recalibration

2026-04-29 · 更新于 2026-07-24 · 2 min · 293 words

Rationale-Guided Learning for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 402 words

RCAL: Reinforced Cross-Modal Alignment for Multimodal Sentiment Analysis with Sparse Visual Frames

2026-04-29 · 更新于 2026-07-24 · 2 min · 409 words

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

2026-04-29 · 更新于 2026-07-24 · 3 min · 431 words

Real-Time Streaming MEL Vocoding with Generative Flow Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

Reasoning Driven Captions to Assist Noise Robust Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

2026-04-29 · 更新于 2026-07-24 · 2 min · 362 words

Reconstruction of Spherical Sound Source Radiation Characteristics with Graph Signal Processing

2026-04-29 · 更新于 2026-07-24 · 2 min · 244 words

Recovering Performance in Speech Emotion Recognition from Discrete Tokens Via Multi-Layer Fusion and Paralinguistic Feature Integration

2026-04-29 · 更新于 2026-07-24 · 2 min · 416 words

Reducing Prompt Sensitivity in LLM-Based Speech Recognition Through Learnable Projection

2026-04-29 · 更新于 2026-07-24 · 2 min · 310 words

Reference Microphone Selection for Guided Source Separation Based on The Normalized L-P Norm

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

Reference-Aware SFM Layers for Intrusive Intelligibility Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

Refgen: Reference-Guided Synthetic Data Generation for Anomalous Sound Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 264 words

Regularized Inverse Filter Design for Rigid Spherical Microphone Array Processing: Laplace- And Time-Domain Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 231 words

Relative Time Intervals Representation For Word-Level Timestamping With Masked Training

2026-04-29 · 更新于 2026-07-24 · 3 min · 482 words

Reliable AI via Age-Balanced Validation: Fair Model Selection for Parkinson’s Detection from Voice

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

Representation-Based Data Quality Audits for Audio

2026-04-29 · 更新于 2026-07-24 · 3 min · 433 words

Representation-Diverse Self-Supervision for Cross-Domain Bioacoustic Learning in Low-Resource Settings

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

2026-04-29 · 更新于 2026-07-24 · 2 min · 425 words

Respire-Mamba C-UNet: Consistency-Trained Autoencoder for High-Fidelity Respiratory Sound Compression

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

Rethinking Entity Disambiguation in Complex Modalities

2026-04-29 · 更新于 2026-07-24 · 3 min · 471 words

Rethinking Music Captioning with Music Metadata LLMS

2026-04-29 · 更新于 2026-07-24 · 3 min · 470 words

Retrieval-Based Speculative Decoding For Autoregressive Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 1 min · 203 words

Revisiting Direct Speech-to-Text Translation with Speech LLMS: Better Scaling than Cot Prompting?

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

RFM-Editing: Rectified Flow Matching for Text-Guided Audio Editing

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

RHO-PERFECT: Correlation Ceiling for Subjective Evaluation Datasets

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses

2026-04-29 · 更新于 2026-07-24 · 2 min · 272 words

RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

RMODGDF: A Robust STFT-Derived Feature for Musical Instrument Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 412 words

Robust Accent Identification via Voice Conversion and Non-Timbral Embeddings

2026-04-29 · 更新于 2026-07-24 · 1 min · 159 words

Robust and Lightweight F0 Estimation Through Mid-Level Fusion of DSP-Informed Features

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Robust Deepfake Audio Detection via Multi-Level Intermediate Feature Fusion

2026-04-29 · 更新于 2026-07-24 · 2 min · 295 words

Robust Online Overdetermined Independent Vector Analysis Based on Bilinear Decomposition

2026-04-29 · 更新于 2026-07-24 · 1 min · 203 words

RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack

2026-04-29 · 更新于 2026-07-24 · 3 min · 522 words

RRPO: Robust Reward Policy Optimization for LLM-Based Emotional TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 244 words

S-PRESSO: Ultra Low Bitrate Sound Effect Compression with Diffusion Autoencoders and Offline Quantization

2026-04-29 · 更新于 2026-07-24 · 2 min · 410 words

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 483 words

S2Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

2026-04-29 · 更新于 2026-07-24 · 3 min · 492 words

SA-SSL-MOS: Self-Supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

2026-04-29 · 更新于 2026-07-24 · 3 min · 526 words

SAASDNet: An EEG-Based Streaming Auditory Attention Switch Decoding Network for Self-Initiated Attention Switching in Mixed Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 354 words

SAGA-SR: Semantically and Acoustically Guided Audio Super-Resolution

2026-04-29 · 更新于 2026-07-24 · 2 min · 339 words

Salad-VAE: Semantic Audio Compression with Language-Audio Distillation

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems with Deep Kernel Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 320 words

SAUNA: Song-Level Audio & User-Listening Data Neural Alignment

2026-04-29 · 更新于 2026-07-24 · 2 min · 216 words

Savgbench: Benchmarking Spatially Aligned Audio-Video Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 216 words

Scalable Evaluation for Audio Identification Via Synthetic Latent Fingerprint Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 314 words

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

2026-04-29 · 更新于 2026-07-24 · 2 min · 257 words

Scaling Spoken Language Models with Syllabic Speech Tokenization

2026-04-29 · 更新于 2026-07-24 · 2 min · 272 words

SceneRAG: Scene-Level Retrieval-Augmented Generation for Video Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

2026-04-29 · 更新于 2026-07-24 · 2 min · 369 words

Secondary Source Placement for Sound Field Control Based on Ising Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 218 words

SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR

2026-04-29 · 更新于 2026-07-24 · 2 min · 377 words

Segmentwise Pruning in Audio-Language Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 488 words

SELD-MOHA: A Fine-Tuning Method with the Mixture of Heterogeneous Adapters for Sound Event Localization and Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 400 words

Selective Hub Fusion with Modality-Heterogeneous Experts for Multimodal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 460 words

Self-Supervised Note Tracking and Multi-Pitch Estimation Via Reconstruction-Based Learning

2026-04-29 · 更新于 2026-07-24 · 3 min · 628 words

Semantic Anchor Transfer from Short to Long Speech in a Distillation-Based Summarization Framework

2026-04-29 · 更新于 2026-07-24 · 2 min · 418 words

Semantic-Guided Pseudo-Feature Attention Network for Audio-Visual Zero-Shot Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 402 words

SEP-ST: Incorporating Speech Entity Prompt Into Large Language Models for Speech Translation

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

Separate this, and all of these Things Around It: Music Source Separation Via Hyperellipsoidal Queries

2026-04-29 · 更新于 2026-07-24 · 2 min · 339 words

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

2026-04-29 · 更新于 2026-07-24 · 2 min · 222 words

Sequential and Simultaneous Optimization of Microphone Array Geometry and Region-of-Interest Beamforming

2026-04-29 · 更新于 2026-07-24 · 1 min · 209 words

Session-Level Spoken Language Assessment with A Multimodal Foundation Model Via Multi-Target Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 296 words

SFM-TTS: Lightweight and Rapid Speech Synthesis with Flexible Shortcut Flow Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 409 words

Shared Representation Learning for Reference-Guided Targeted Sound Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

Shortcut Flow Matching for Speech Enhancement: Step-Invariant Flows via Single Stage Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 363 words

Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-Scale Dataset Cleansing

2026-04-29 · 更新于 2026-07-24 · 2 min · 302 words

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

Sing What You Fit: A Perception-Based Dataset and Benchmark for Vocal-Song Suitability Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 226 words

Sing2Song: An Accompaniment Generation System Based on Solo Singing

2026-04-29 · 更新于 2026-07-24 · 2 min · 393 words

Single-Microphone Audio Point Source Discriminative Localization from Reverberation Late Tail Estimation

2026-04-29 · 更新于 2026-07-24 · 2 min · 259 words

Single-Step Controllable Music Bandwidth extension with Flow Matching

2026-04-29 · 更新于 2026-07-24 · 3 min · 433 words

SingMOS-Pro: An Comprehensive Benchmark For Singing Quality Assessment

2026-04-29 · 更新于 2026-07-24 · 2 min · 246 words

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

2026-04-29 · 更新于 2026-07-24 · 3 min · 489 words

SIRUP: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics

2026-04-29 · 更新于 2026-07-24 · 2 min · 342 words

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

2026-04-29 · 更新于 2026-07-24 · 2 min · 315 words

SLM-SS: Speech Language Model for Generative Speech Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 325 words

SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Slot Filling as a Reasoning Task for Speechllms

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

SmoothCLAP: Soft-Target Enhanced Contrastive Language-Audio Pretraining for Affective Computing

2026-04-29 · 更新于 2026-07-24 · 2 min · 353 words

Snore Sound Classification Based on Physiological Features and Adaptive Loss Function

2026-04-29 · 更新于 2026-07-24 · 2 min · 324 words

Solving the Helmholtz Equation Via Physics-Informed Neural Networks with an Adaptive Weighting Strategy

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

SONAR: Self-Distilled Continual Pre-Training for Domain Adaptive Audio Representation

2026-04-29 · 更新于 2026-07-24 · 2 min · 276 words

SoundCompass: Navigating Target Sound Extraction with Effective Directional Clue Integration in Complex Acoustic Scenes

2026-04-29 · 更新于 2026-07-24 · 2 min · 247 words

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

2026-04-29 · 更新于 2026-07-24 · 3 min · 496 words

Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

2026-04-29 · 更新于 2026-07-24 · 2 min · 288 words

Source Separation For A Cappella Music

2026-04-29 · 更新于 2026-07-24 · 2 min · 310 words

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

2026-04-29 · 更新于 2026-07-24 · 3 min · 470 words

SPAM: Style Prompt Adherence Metric for Prompt-Based TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

Sparse Autoencoders Make Audio Foundation Models More Explainable

2026-04-29 · 更新于 2026-07-24 · 2 min · 364 words

Sparse-View Visual-Acoustic Latent Learning for Novel-View Audio Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 424 words

Spatial Covariance Matrix Reconstruction for Speech Enhancement in Reverberant Multi-Source Environments

2026-04-29 · 更新于 2026-07-24 · 2 min · 401 words

Spatial-CLAP: Learning Spatially-Aware Audio–Text Embeddings for Multi-Source Conditions

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Spatially Aware Self-Supervised Models for Multi-Channel Neural Speaker Diarization

2026-04-29 · 更新于 2026-07-24 · 2 min · 288 words

SpatialNet-Echo: Real-Time Acoustic Echo Cancellation via Integrated Narrow-Band and Cross-Band Processing

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Speaker Anonymisation for Speech-Based Suicide Risk Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 259 words

Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

2026-04-29 · 更新于 2026-07-24 · 2 min · 397 words

Spectral or Spatial? Leveraging Both for Speaker Extraction in Challenging Data Conditions

2026-04-29 · 更新于 2026-07-24 · 2 min · 261 words

Spectrogram Event Based Feature Representation for Generalizable Automatic Music Transcription

2026-04-29 · 更新于 2026-07-24 · 3 min · 430 words

Speech Emotion Recognition based on Hierarchical Transformer with Shifted Windows

2026-04-29 · 更新于 2026-07-24 · 2 min · 286 words

Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts

2026-04-29 · 更新于 2026-07-24 · 2 min · 359 words

SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

SpeechMapper: Speech-To-Text Embedding Projector for LLMs

2026-04-29 · 更新于 2026-07-24 · 3 min · 482 words

Spike-Driven Low-Power Speech Bandwidth Extension

2026-04-29 · 更新于 2026-07-24 · 2 min · 398 words

Spiking Attention Network: A Hybrid Neuromorphic Approach to Underwater Acoustic Localization and Zero-Shot Adaptation

2026-04-29 · 更新于 2026-07-24 · 2 min · 308 words

Spiking Temporal-Enhanced Network for Zero-Shot Audio-Visual Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 332 words

Spring Reverb Emulation with Hybrid Gated Convolutional Networks and State Space Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 442 words

SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 396 words

ST-HNTM: Joint Speech-Text Neural Topic Modeling on the Hypersphere

2026-04-29 · 更新于 2026-07-24 · 3 min · 539 words

STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

2026-04-29 · 更新于 2026-07-24 · 2 min · 356 words

Staged Diffusion with Hybrid Mixture-of-Experts (MOE) for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 313 words

Stemphonic: All-At-Once Flexible Multi-Stem Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 423 words

Step-Audio-R1.5 Technical Report

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

StereoFoley: Object-Aware Stereo Audio Generation from Video

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

Stereophonic Acoustic Echo Cancellation Using an Improved Affine Projection Algorithm with Adaptive Multiple Sub-Filters

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Still Thinking or Stopped Talking? Dialogue Silence Intention Classification Using Multimodal Large Language Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 318 words

Str-DiffSep: Streamable Diffusion Model for Speech Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 343 words

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization Via Neural Audio Codec and Language Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 456 words

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization

2026-04-29 · 更新于 2026-07-24 · 2 min · 362 words

Streamingbench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 262 words

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

Stress Prediction from Temporal Emotion Trajectories in Clinical Patient-Physician Conversations

2026-04-29 · 更新于 2026-07-24 · 3 min · 430 words

Structure-Aware Diffusion Schrödinger Bridge

2026-04-29 · 更新于 2026-07-24 · 1 min · 209 words

StyHarmo: Efficient Style-Specific Video Generation with Music Synchronization

2026-04-29 · 更新于 2026-07-24 · 2 min · 266 words

Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

2026-04-29 · 更新于 2026-07-24 · 3 min · 512 words

Style-Disentangled Diffusion for Controllable and Identity-Generalized Speech-Driven Body Motion Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 245 words

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

2026-04-29 · 更新于 2026-07-24 · 3 min · 463 words

StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

Subgraph Localization in the Subbands for Partially Spoofed Speech Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Subsequence SDTW: Differentiable Alignment with Flexible Boundary Conditions

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Subspace Hybrid Adaptive Filtering for Phonocardiogram Signal Denoising

2026-04-29 · 更新于 2026-07-24 · 2 min · 297 words

Sunac: Source-Aware Unified Neural Audio Codec

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

SURE: Synergistic Uncertainty-Aware Reasoning for Multimodal Emotion Recognition in Conversations

2026-04-29 · 更新于 2026-07-24 · 2 min · 285 words

SwitchCodec: Adaptive Residual-Expert Sparse Quantization for High-Fidelity Neural Audio Coding

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

Symphony Rendering: Midi and Composer-Conditioned Auto Orchestration with Flow-Matching Transformers

2026-04-29 · 更新于 2026-07-24 · 3 min · 482 words

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

2026-04-29 · 更新于 2026-07-24 · 2 min · 355 words

SynaSpot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy

2026-04-29 · 更新于 2026-07-24 · 2 min · 330 words

Synchronous Secondary Path Modeling and Kronecker-Factorized Adaptive Algorithm for Multichannel Active Noise Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

Syncspeech: Efficient and Low-Latency Text-to-Speech Based on Temporal Masked Transformer

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

2026-04-29 · 更新于 2026-07-24 · 3 min · 456 words

Synthcloner: Synthesizer-Style Audio Transfer via Factorized Codec with ADSR Envelope Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 324 words

Synthesized Data Selection via Score Distribution Matching for Te Reo Māori Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 262 words

Synthetic Data Domain Adaptation for ASR via LLM-Based Text and Phonetic Respelling Augmentation

2026-04-29 · 更新于 2026-07-24 · 3 min · 473 words

Synthetic yet Striking? Assessing Vocal Charisma in TTS via Perceptual and Algorithmic Measures

2026-04-29 · 更新于 2026-07-24 · 2 min · 227 words

T-Cache: Fast Inference For Masked Generative Transformer-Based TTS Via Prompt-Aware Feature Caching

2026-04-29 · 更新于 2026-07-24 · 2 min · 357 words

T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS

2026-04-29 · 更新于 2026-07-24 · 2 min · 292 words

TAG: Structured Temporal Audio Generation via LLM-Guided Manual Scription and Control

2026-04-29 · 更新于 2026-07-24 · 2 min · 343 words

TAGARELA - A Portuguese Speech Dataset from Podcasts

2026-04-29 · 更新于 2026-07-24 · 2 min · 284 words

Taming Audio VAEs via Target-KL Regularization

2026-04-29 · 更新于 2026-07-24 · 2 min · 352 words

Target Speaker Anonymization in Multi-Speaker Recordings

2026-04-29 · 更新于 2026-07-24 · 2 min · 280 words

Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder

2026-04-29 · 更新于 2026-07-24 · 2 min · 344 words

Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

Task-Oriented Sound Privacy Preservation for Sound Event Detection Via End-to-End Adversarial Multi-Task Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 387 words

TASU: Text-only Alignment for Speech Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 366 words

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

2026-04-29 · 更新于 2026-07-24 · 2 min · 335 words

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

2026-04-29 · 更新于 2026-07-24 · 3 min · 504 words

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-Wise Distillation

2026-04-29 · 更新于 2026-07-24 · 2 min · 278 words

Teaching the Teachers: Boosting Unsupervised Domain Adaptation In Speech Recognition By Ensemble Update

2026-04-29 · 更新于 2026-07-24 · 2 min · 400 words

Temporal Distillation for Music Representation Learning

2026-04-29 · 更新于 2026-07-24 · 3 min · 433 words

Temporal Graph Modeling for Speech Emotion Recognition Using LSTM-Aggregated Multigraph Networks

2026-04-29 · 更新于 2026-07-24 · 2 min · 229 words

Temporal-Spatial Decouple Before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

2026-04-29 · 更新于 2026-07-24 · 4 min · 737 words

Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic Event Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 278 words

Test Time Adaptation for Speech Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 241 words

Test-Time Scaling for Auditory Cognition in Audio Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 292 words

Testing The Efficient Coding Hypothesis Beyond Humans: The Auditory Kernels of Bat Vocalizations

2026-04-29 · 更新于 2026-07-24 · 2 min · 236 words

Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

2026-04-29 · 更新于 2026-07-24 · 2 min · 324 words

Text2Move: Text-To-Moving Sound Generation via Trajectory Prediction and Temporal Alignment

2026-04-29 · 更新于 2026-07-24 · 2 min · 243 words

TextlessRAG: End-to-End Visual Document RAG by Speech without Text

2026-04-29 · 更新于 2026-07-24 · 2 min · 375 words

The 3rd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing aid Speech Intelligibility Prediction

2026-04-29 · 更新于 2026-07-24 · 1 min · 190 words

The Curious Case of Visual Grounding: Different Effects for Speech-and Text-Based Language Encoders

2026-04-29 · 更新于 2026-07-24 · 2 min · 277 words

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

2026-04-29 · 更新于 2026-07-24 · 2 min · 390 words

The Muse Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMs

2026-04-29 · 更新于 2026-07-24 · 2 min · 307 words

The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

2026-04-29 · 更新于 2026-07-24 · 2 min · 255 words

The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion to Singing Style Conversion

2026-04-29 · 更新于 2026-07-24 · 1 min · 208 words

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 276 words

The Synergistic Role of Audio and Large Video-Language Model in Source-Free Video Domain Adaptation

2026-04-29 · 更新于 2026-07-24 · 2 min · 360 words

Theory and Application of Circular Relative Harmonic Coefficients

2026-04-29 · 更新于 2026-07-24 · 2 min · 334 words

Thinking While Listening: Simple Test Time Scaling for Audio Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 252 words

Three Seconds is Sufficient: A Multi-Pronged Framework for Model-Based Speaker Adaptation in ASR Under Data-Scarce Conditions

2026-04-29 · 更新于 2026-07-24 · 3 min · 493 words

TICL: Text-Embedding KNN for Speech in-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

Timbre-Aware Audio Difference Captioning for Anomalous Machine Sounds without Paired Training Data via Synthetic Perturbations

2026-04-29 · 更新于 2026-07-24 · 2 min · 352 words

Timbre-Based Pretraining with Pseudo-Labels for Multi-Instrument Automatic Music Transcription

2026-04-29 · 更新于 2026-07-24 · 3 min · 628 words

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

Time-Domain Synthesis of Virtual Sound Source Within Personalized Sound Zone using a Linear Loudspeaker Array

2026-04-29 · 更新于 2026-07-24 · 2 min · 221 words

Time-Shifted Token Scheduling for Symbolic Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 214 words

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

Tldiffgan: A Latent Diffusion-Gan Framework with Temporal Information Fusion for Anomalous Sound Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 350 words

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Framework for Ü-Tsang, Amdo and Kham Speech Dataset Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 323 words

Tokenchain: A Discrete Speech Chain via Semantic Token Modeling

2026-04-29 · 更新于 2026-07-24 · 3 min · 529 words

Toward Faithful Explanations in Acoustic Anomaly Detection

2026-04-29 · 更新于 2026-07-24 · 1 min · 207 words

Toward Robust And Efficient Beat Tracking Via Beat-Aware Attention

2026-04-29 · 更新于 2026-07-24 · 2 min · 384 words

Towards Blind Data Cleaning: A Case Study in Music Source Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 305 words

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

2026-04-29 · 更新于 2026-07-24 · 2 min · 384 words

Towards Data Drift Monitoring for Speech Deepfake Detection in the Context of MLOps

2026-04-29 · 更新于 2026-07-24 · 2 min · 248 words

Towards Distance-Aware Synthetic Audio Mixtures for Universal Sound Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 272 words

Towards Effective Negation Modeling in Joint Audio-Text Models for Music

2026-04-29 · 更新于 2026-07-24 · 2 min · 248 words

Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances

2026-04-29 · 更新于 2026-07-24 · 2 min · 339 words

Towards Fair ASR for Second Language Speakers using Fairness Prompted Finetuning

2026-04-29 · 更新于 2026-07-24 · 2 min · 273 words

Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

2026-04-29 · 更新于 2026-07-24 · 3 min · 442 words

Towards Multi-View Hierarchical Video-to-Piano Generation with MIDI Guidance

2026-04-29 · 更新于 2026-07-24 · 2 min · 346 words

Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

2026-04-29 · 更新于 2026-07-24 · 2 min · 399 words

Towards Real-Time Generative Speech Restoration with Flow-Matching

2026-04-29 · 更新于 2026-07-24 · 2 min · 280 words

Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

2026-04-29 · 更新于 2026-07-24 · 2 min · 343 words

Tpeformer: Temporal Patch Embedding Transformer

2026-04-29 · 更新于 2026-07-24 · 2 min · 290 words

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

2026-04-29 · 更新于 2026-07-24 · 3 min · 454 words

Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction

2026-04-29 · 更新于 2026-07-24 · 2 min · 294 words

Training Flow Matching Models with Reliable Labels via Self-Purification

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Training-Free Inference-Time Scaling for Audio Source Separation

2026-04-29 · 更新于 2026-07-24 · 2 min · 281 words

Training-Free Multimodal Guidance for Video to Audio Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 321 words

Transfer Learning for Paediatric Sleep Apnoea Detection using Physiology-Guided Acoustic Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 285 words

Transferable Audio Lottery Tickets: Gradient Accumulation for Extreme Sparsity

2026-04-29 · 更新于 2026-07-24 · 2 min · 265 words

Tri-Attention Fusion: Joint Temporal-Spectral and Bidirectional Modeling for Speech Spoofing Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 336 words

Triad: Tri-Head with Auxiliary Duplicating Permutation Invariant Training for Multi-Task Sound Event Localization and Detection

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

Triage Knowledge Distillation for Speaker Verification

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

TTA: Transcribe, Translate and Alignment for Cross-Lingual Speech Representation

2026-04-29 · 更新于 2026-07-24 · 2 min · 389 words

TVP-UNet: Threshold Variance Penalty U-Net for Voice Activity Detection in Dysarthric Speech

2026-04-29 · 更新于 2026-07-24 · 2 min · 263 words

Two-Stage Language Model Framework for Acoustic Echo Cancellation

2026-04-29 · 更新于 2026-07-24 · 2 min · 359 words

UJCodec: An End-to-end Unet-Style Codec for Joint Speech Compression and Enhancement

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

UMA-SPLIT: Unimodal Aggregation for Both English and Mandarin Non-Autoregressive Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 463 words

UMV: A Mixture-Of-Experts Vision Transformer with Multi-Spectrogram Fusion for Underwater Ship Noise Classification

2026-04-29 · 更新于 2026-07-24 · 2 min · 253 words

Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Understanding Textual Capability Degradation in Speech LLMS via Parameter Importance Analysis

2026-04-29 · 更新于 2026-07-24 · 2 min · 365 words

Understanding the Strengths and Weaknesses of SSL Models for Audio Deepfake Model Attribution

2026-04-29 · 更新于 2026-07-24 · 2 min · 304 words

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 348 words

Universr: Unified and Versatile Audio Super-Resolution Via Vocoder-Free Flow Matching

2026-04-29 · 更新于 2026-07-24 · 3 min · 445 words

UNMIXX: Untangling Highly Correlated Singing Voices Mixtures

2026-04-29 · 更新于 2026-07-24 · 2 min · 373 words

Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

2026-04-29 · 更新于 2026-07-24 · 2 min · 225 words

Unseen but Not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 279 words

Unsupervised Discovery and Analysis of the Vocal Repertoires and Patterns of Select Corvid Species

2026-04-29 · 更新于 2026-07-24 · 2 min · 316 words

Unsupervised Lexicon Learning from Speech is Limited by Representations Rather than Clustering

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

USVexplorer: Robust Detection of Ultrasonic Vocalizations with Cross Species Generalization

2026-04-29 · 更新于 2026-07-24 · 2 min · 268 words

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 383 words

Utilizing Information Theoretic Approach to Study Cochlear Neural Degeneration

2026-04-29 · 更新于 2026-07-24 · 2 min · 241 words

UVT-LM: Unifying Visual and Tactile Perception with Language Model

2026-04-29 · 更新于 2026-07-24 · 2 min · 411 words

V2A-DPO: Omni-Preference Optimization for Video-To-Audio Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 368 words

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 575 words

VBx for End-to-End Neural and Clustering-Based Diarization

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

VChangeCodec: An Ultra Low-Complexity Neural Speech Codec with Built-In Voice Changer for Customized Real-Time Communication

2026-04-29 · 更新于 2026-07-24 · 3 min · 460 words

Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation

2026-04-29 · 更新于 2026-07-24 · 2 min · 282 words

Vib2Sound: Separation Of Multimodal Sound Sources

2026-04-29 · 更新于 2026-07-24 · 2 min · 361 words

Vioptt: Violin Technique-Aware Transcription from Synthetic Data Augmentation

2026-04-29 · 更新于 2026-07-24 · 2 min · 395 words

Virtual Consistency for Audio Editing

2026-04-29 · 更新于 2026-07-24 · 3 min · 453 words

Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation

2026-04-29 · 更新于 2026-07-24 · 2 min · 238 words

ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 223 words

VividTalker: A Modular Framework for Expressive 3D Talking Avatars with Controllable Gaze and Blink

2026-04-29 · 更新于 2026-07-24 · 2 min · 408 words

VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays

2026-04-29 · 更新于 2026-07-24 · 3 min · 603 words

VMSP: Video-to-Music Generation with Two-Stage Alignment and Synthesis

2026-04-29 · 更新于 2026-07-24 · 2 min · 260 words

Vocalnet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

2026-04-29 · 更新于 2026-07-24 · 2 min · 319 words

Voting-Based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection

2026-04-29 · 更新于 2026-07-24 · 3 min · 449 words

VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings

2026-04-29 · 更新于 2026-07-24 · 2 min · 399 words

VoXtream: Full-Stream Text-To-Speech With Extremely Low Latency

2026-04-29 · 更新于 2026-07-24 · 3 min · 482 words

VT-Heads: Voice Cloning and Talking Head Generation from Text Based on V-DiT

2026-04-29 · 更新于 2026-07-24 · 2 min · 341 words

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 250 words

WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error

2026-04-29 · 更新于 2026-07-24 · 1 min · 199 words

Wave-Trainer-Fit: Neural Vocoder With Trainable Prior And Fixed-Point Iteration Towards High-Quality Speech Generation From SSL Features

2026-04-29 · 更新于 2026-07-24 · 2 min · 338 words

Wavenext 2: Convnext-Based Fast Neural Vocoders with Residual Denoising and Sub-Modeling for Gan And Diffusion Models

2026-04-29 · 更新于 2026-07-24 · 3 min · 553 words

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

2026-04-29 · 更新于 2026-07-24 · 3 min · 612 words

WaveSpikeNet: A Wavelet-Spiking Fusion Architecture for Audio Classification on Edge Devices

2026-04-29 · 更新于 2026-07-24 · 3 min · 498 words

WavLink: Compact Audio–Text Embeddings with a Global Whisper Token

2026-04-29 · 更新于 2026-07-24 · 2 min · 333 words

What the student learns in knowledge distillation: A subspace view and evidence on Convolutional Recurrent Network

2026-04-29 · 更新于 2026-07-24 · 2 min · 298 words

When Audio Matters: A Lightweight, Hierarchical Fusion Model for Speech and Non-Verbal Emotion Recognition

2026-04-29 · 更新于 2026-07-24 · 2 min · 380 words

When Children Talk and Machines Listen: Toward an Interpretable Speech-Based Screener for Dutch Developmental Language Disorder

2026-04-29 · 更新于 2026-07-24 · 2 min · 374 words

When Noise Lowers the Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 306 words

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 311 words

When Voice Matters: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making

2026-04-29 · 更新于 2026-07-24 · 2 min · 381 words

Whisper-FEST: Single-Channel Far-Field Enhanced Speech-to-text without Parallel Data

2026-04-29 · 更新于 2026-07-24 · 2 min · 425 words

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models Based on MHA2MLA Conversion

2026-04-29 · 更新于 2026-07-24 · 2 min · 312 words

Whisper-QF: Leveraging Dual Cross-Attention Q-Former for Speech Emotion Recognition With Multi-Task Learning

2026-04-29 · 更新于 2026-07-24 · 2 min · 329 words

Whisper: Courtside Edition - Enhancing ASR Performance through LLM-Driven Context Generation

2026-04-29 · 更新于 2026-07-24 · 1 min · 195 words

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 1 min · 178 words

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

2026-04-29 · 更新于 2026-07-24 · 2 min · 258 words

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition

2026-04-29 · 更新于 2026-07-24 · 3 min · 434 words

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

2026-04-29 · 更新于 2026-07-24 · 2 min · 248 words

ZK-VSA: Zero-Knowledge Verifiable Speaker Anonymization Leveraging Phase Vocoder with Time-Scale Modification

2026-04-29 · 更新于 2026-07-24 · 2 min · 340 words

ZSV2C-MLLM: Zero-Shot Visual Voice Cloning Via Multimodal Large Language Models

2026-04-29 · 更新于 2026-07-24 · 2 min · 334 words

β-AVSDNET: A Novel End-To-End Neural Network Architecture For Audio-Visual Speaker Diarization

2026-04-29 · 更新于 2026-07-24 · 3 min · 487 words

语音/音乐/音频论文速递 2026-04-29

2026-04-29 · 更新于 2026-07-24 · 19 min · 3856 words

A Functorial Formulation of Neighborhood Aggregating Deep Learning

2026-04-28 · 更新于 2026-07-24 · 1 min · 148 words

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

2026-04-28 · 更新于 2026-07-24 · 2 min · 368 words

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

2026-04-28 · 更新于 2026-07-24 · 2 min · 276 words

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

2026-04-28 · 更新于 2026-07-24 · 2 min · 265 words

Come Together: Analyzing Popular Songs Through Statistical Embeddings

2026-04-28 · 更新于 2026-07-24 · 2 min · 243 words

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-28 · 更新于 2026-07-24 · 1 min · 180 words

Explainable AI in Speaker Recognition – Making Latent Representations Understandable

2026-04-28 · 更新于 2026-07-24 · 2 min · 232 words

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

2026-04-28 · 更新于 2026-07-24 · 3 min · 491 words

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

2026-04-28 · 更新于 2026-07-24 · 2 min · 366 words

Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions

2026-04-28 · 更新于 2026-07-24 · 2 min · 344 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-28 · 更新于 2026-07-24 · 2 min · 368 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-28 · 更新于 2026-07-24 · 2 min · 411 words

Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

2026-04-28 · 更新于 2026-07-24 · 2 min · 362 words

Opening the Design Space: Two Years of Performance with Intelligent Musical Instruments

2026-04-28 · 更新于 2026-07-24 · 1 min · 194 words

Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

2026-04-28 · 更新于 2026-07-24 · 1 min · 206 words

Psychologically-Grounded Graph Modeling for Interpretable Depression Detection

2026-04-28 · 更新于 2026-07-24 · 3 min · 503 words

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2026-04-28 · 更新于 2026-07-24 · 2 min · 287 words

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

2026-04-28 · 更新于 2026-07-24 · 3 min · 431 words

RTCFake: Speech Deepfake Detection in Real-Time Communication

2026-04-28 · 更新于 2026-07-24 · 2 min · 337 words

Scaling Properties of Continuous Diffusion Spoken Language Models

2026-04-28 · 更新于 2026-07-24 · 2 min · 415 words

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

2026-04-28 · 更新于 2026-07-24 · 1 min · 208 words

Speech Enhancement Based on Drifting Models

2026-04-28 · 更新于 2026-07-24 · 2 min · 361 words

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

2026-04-28 · 更新于 2026-07-24 · 3 min · 612 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-28 · 更新于 2026-07-24 · 2 min · 409 words

语音/音乐/音频论文速递 2026-04-28

2026-04-28 · 更新于 2026-07-24 · 12 min · 2428 words

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

2026-04-27 · 更新于 2026-07-24 · 2 min · 343 words

Audio Effect Estimation with DNN-Based Prediction and Search Algorithm

2026-04-27 · 更新于 2026-07-24 · 2 min · 266 words

Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues

2026-04-27 · 更新于 2026-07-24 · 1 min · 159 words

Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis

2026-04-27 · 更新于 2026-07-24 · 3 min · 592 words

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

2026-04-27 · 更新于 2026-07-24 · 2 min · 395 words

Earable Platform with Integrated Simultaneous EEG Sensing and Auditory Stimulation

2026-04-27 · 更新于 2026-07-24 · 2 min · 270 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-27 · 更新于 2026-07-24 · 2 min · 318 words

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2026-04-27 · 更新于 2026-07-24 · 2 min · 260 words

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

2026-04-27 · 更新于 2026-07-24 · 2 min · 377 words

Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012

2026-04-27 · 更新于 2026-07-24 · 2 min · 236 words

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

2026-04-27 · 更新于 2026-07-24 · 2 min · 273 words

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2026-04-27 · 更新于 2026-07-24 · 2 min · 326 words

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2026-04-27 · 更新于 2026-07-24 · 4 min · 707 words

语音/音乐/音频论文速递 2026-04-27

2026-04-27 · 更新于 2026-07-24 · 8 min · 1673 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-25 · 更新于 2026-07-24 · 2 min · 320 words

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-25 · 更新于 2026-07-24 · 1 min · 176 words

语音/音乐/音频论文速递 2026-04-25

2026-04-25 · 更新于 2026-07-24 · 2 min · 225 words

“This Wasn’t Made for Me”: Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

2026-04-24 · 更新于 2026-07-24 · 1 min · 113 words

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-24 · 更新于 2026-07-24 · 3 min · 428 words

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

2026-04-24 · 更新于 2026-07-24 · 1 min · 132 words

Beyond Rules: Towards Basso Continuo Personal Style Identification

2026-04-24 · 更新于 2026-07-24 · 1 min · 133 words

DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline

2026-04-24 · 更新于 2026-07-24 · 2 min · 255 words

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

2026-04-24 · 更新于 2026-07-24 · 1 min · 117 words

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

2026-04-24 · 更新于 2026-07-24 · 2 min · 333 words

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

2026-04-24 · 更新于 2026-07-24 · 1 min · 153 words

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

2026-04-24 · 更新于 2026-07-24 · 1 min · 204 words

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

2026-04-24 · 更新于 2026-07-24 · 1 min · 178 words

Low-Rank Adaptation Redux for Large Models

2026-04-24 · 更新于 2026-07-24 · 1 min · 103 words

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

2026-04-24 · 更新于 2026-07-24 · 3 min · 439 words

Materialistic RIR: Material Conditioned Realistic RIR Generation

2026-04-24 · 更新于 2026-07-24 · 2 min · 400 words

MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding

2026-04-24 · 更新于 2026-07-24 · 2 min · 296 words

Misinformation Span Detection in Videos via Audio Transcripts

2026-04-24 · 更新于 2026-07-24 · 2 min · 285 words

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

2026-04-24 · 更新于 2026-07-24 · 1 min · 18 words

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

2026-04-24 · 更新于 2026-07-24 · 2 min · 280 words

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-24 · 更新于 2026-07-24 · 3 min · 487 words

Sema: Semantic Transport for Real-Time Multimodal Agents

2026-04-24 · 更新于 2026-07-24 · 2 min · 266 words

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

2026-04-24 · 更新于 2026-07-24 · 2 min · 402 words

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-24 · 更新于 2026-07-24 · 3 min · 483 words

语音/音乐/音频论文速递 2026-04-24

2026-04-24 · 更新于 2026-07-24 · 11 min · 2180 words

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

2026-04-23 · 更新于 2026-07-24 · 1 min · 165 words

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

2026-04-23 · 更新于 2026-07-24 · 1 min · 170 words

Before the Mic: Physical-Layer Voiceprint Anonymization with Acoustic Metamaterials

2026-04-23 · 更新于 2026-07-24 · 2 min · 236 words

Centering Ecological Goals in Automated Identification of Individual Animals

2026-04-23 · 更新于 2026-07-24 · 2 min · 233 words

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

2026-04-23 · 更新于 2026-07-24 · 2 min · 276 words

Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis

2026-04-23 · 更新于 2026-07-24 · 2 min · 311 words

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

2026-04-23 · 更新于 2026-07-24 · 2 min · 221 words

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

2026-04-23 · 更新于 2026-07-24 · 2 min · 293 words

Enhancing Speaker Verification with Whispered Speech via Post-Processing

2026-04-23 · 更新于 2026-07-24 · 2 min · 259 words

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-23 · 更新于 2026-07-24 · 2 min · 267 words

Explicit Dropout: Deterministic Regularization for Transformer Architectures

2026-04-23 · 更新于 2026-07-24 · 1 min · 111 words

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

2026-04-23 · 更新于 2026-07-24 · 2 min · 302 words

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-23 · 更新于 2026-07-24 · 2 min · 266 words

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

2026-04-23 · 更新于 2026-07-24 · 2 min · 386 words

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

2026-04-23 · 更新于 2026-07-24 · 1 min · 201 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-23 · 更新于 2026-07-24 · 2 min · 215 words

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

2026-04-23 · 更新于 2026-07-24 · 1 min · 207 words

Qwen3.5-Omni Technical Report

2026-04-23 · 更新于 2026-07-24 · 2 min · 251 words

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-23 · 更新于 2026-07-24 · 2 min · 231 words

SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

2026-04-23 · 更新于 2026-07-24 · 1 min · 182 words

Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones

2026-04-23 · 更新于 2026-07-24 · 2 min · 237 words

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

2026-04-23 · 更新于 2026-07-24 · 1 min · 200 words

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-23 · 更新于 2026-07-24 · 1 min · 191 words

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-23 · 更新于 2026-07-24 · 2 min · 266 words

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-23 · 更新于 2026-07-24 · 2 min · 316 words

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

2026-04-23 · 更新于 2026-07-24 · 2 min · 223 words

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-23 · 更新于 2026-07-24 · 2 min · 307 words

语音/音乐/音频论文速递 2026-04-23

2026-04-23 · 更新于 2026-07-24 · 13 min · 2679 words

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

2026-04-22 · 更新于 2026-07-24 · 3 min · 428 words

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

2026-04-22 · 更新于 2026-07-24 · 3 min · 465 words

Audio Spoof Detection with GaborNet

2026-04-22 · 更新于 2026-07-24 · 4 min · 689 words

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

2026-04-22 · 更新于 2026-07-24 · 2 min · 335 words

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-22 · 更新于 2026-07-24 · 2 min · 277 words

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

2026-04-22 · 更新于 2026-07-24 · 2 min · 221 words

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

2026-04-22 · 更新于 2026-07-24 · 3 min · 465 words

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

2026-04-22 · 更新于 2026-07-24 · 2 min · 290 words

Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

2026-04-22 · 更新于 2026-07-24 · 2 min · 419 words

Environmental Sound Deepfake Detection Using Deep-Learning Framework

2026-04-22 · 更新于 2026-07-24 · 2 min · 276 words

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

2026-04-22 · 更新于 2026-07-24 · 2 min · 305 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-22 · 更新于 2026-07-24 · 1 min · 24 words

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

2026-04-22 · 更新于 2026-07-24 · 2 min · 237 words

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-22 · 更新于 2026-07-24 · 2 min · 269 words

Qwen3.5-Omni Technical Report

2026-04-22 · 更新于 2026-07-24 · 2 min · 392 words

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-22 · 更新于 2026-07-24 · 2 min · 405 words

Tadabur: A Large-Scale Quran Audio Dataset

2026-04-22 · 更新于 2026-07-24 · 2 min · 327 words

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

2026-04-22 · 更新于 2026-07-24 · 2 min · 397 words

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

2026-04-22 · 更新于 2026-07-24 · 2 min · 338 words

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

2026-04-22 · 更新于 2026-07-24 · 3 min · 435 words

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

2026-04-22 · 更新于 2026-07-24 · 2 min · 385 words

语音/音乐/音频论文速递 2026-04-22

2026-04-22 · 更新于 2026-07-24 · 8 min · 1620 words

A novel LSTM music generator based on the fractional time-frequency feature extraction

2026-04-21 · 更新于 2026-07-24 · 1 min · 209 words

A state-space representation of the boundary integral equation for room acoustic modelling

2026-04-21 · 更新于 2026-07-24 · 2 min · 251 words

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

2026-04-21 · 更新于 2026-07-24 · 2 min · 390 words

Anonymization, Not Elimination: Utility-Preserved Speech Anonymization

2026-04-21 · 更新于 2026-07-24 · 3 min · 568 words

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-21 · 更新于 2026-07-24 · 2 min · 311 words

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-21 · 更新于 2026-07-24 · 2 min · 278 words

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

2026-04-21 · 更新于 2026-07-24 · 3 min · 497 words

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

2026-04-21 · 更新于 2026-07-24 · 2 min · 384 words

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

2026-04-21 · 更新于 2026-07-24 · 2 min · 230 words

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

2026-04-21 · 更新于 2026-07-24 · 1 min · 140 words

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-21 · 更新于 2026-07-24 · 1 min · 213 words

Coexisting Tempo Traditions in Beethoven’s Piano and Cello Sonatas: A K-means Clustering Analysis of Recorded Performances, 1930-2012

2026-04-21 · 更新于 2026-07-24 · 2 min · 246 words

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

2026-04-21 · 更新于 2026-07-24 · 3 min · 447 words

FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

2026-04-21 · 更新于 2026-07-24 · 2 min · 367 words

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-21 · 更新于 2026-07-24 · 2 min · 223 words

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

2026-04-21 · 更新于 2026-07-24 · 2 min · 348 words

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

2026-04-21 · 更新于 2026-07-24 · 3 min · 483 words

ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection

2026-04-21 · 更新于 2026-07-24 · 2 min · 385 words

Incremental learning for audio classification with Hebbian Deep Neural Networks

2026-04-21 · 更新于 2026-07-24 · 2 min · 280 words

Latent Fourier Transform

2026-04-21 · 更新于 2026-07-24 · 2 min · 342 words

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

2026-04-21 · 更新于 2026-07-24 · 2 min · 391 words

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

2026-04-21 · 更新于 2026-07-24 · 3 min · 472 words

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

2026-04-21 · 更新于 2026-07-24 · 2 min · 284 words

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

2026-04-21 · 更新于 2026-07-24 · 2 min · 303 words

Neural Encoding Detection is Not All You Need for Synthetic Speech Detection

2026-04-21 · 更新于 2026-07-24 · 2 min · 263 words

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

2026-04-21 · 更新于 2026-07-24 · 2 min · 257 words

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

2026-04-21 · 更新于 2026-07-24 · 2 min · 271 words

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

2026-04-21 · 更新于 2026-07-24 · 3 min · 617 words

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

2026-04-21 · 更新于 2026-07-24 · 2 min · 370 words

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

2026-04-21 · 更新于 2026-07-24 · 1 min · 187 words

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

2026-04-21 · 更新于 2026-07-24 · 2 min · 276 words

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

2026-04-21 · 更新于 2026-07-24 · 2 min · 421 words

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-21 · 更新于 2026-07-24 · 2 min · 321 words

Where Do Self-Supervised Speech Models Become Unfair?

2026-04-21 · 更新于 2026-07-24 · 1 min · 166 words

语音/音乐/音频论文速递 2026-04-21

2026-04-21 · 更新于 2026-07-24 · 13 min · 2659 words

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

2026-04-20 · 更新于 2026-07-24 · 2 min · 386 words

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

2026-04-20 · 更新于 2026-07-24 · 2 min · 225 words

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

2026-04-20 · 更新于 2026-07-24 · 3 min · 447 words

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

2026-04-20 · 更新于 2026-07-24 · 3 min · 528 words

BlasBench: An Open Benchmark for Irish Speech Recognition

2026-04-20 · 更新于 2026-07-24 · 3 min · 435 words

Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

2026-04-20 · 更新于 2026-07-24 · 2 min · 388 words

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

2026-04-20 · 更新于 2026-07-24 · 3 min · 439 words

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

2026-04-20 · 更新于 2026-07-24 · 2 min · 372 words

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

2026-04-20 · 更新于 2026-07-24 · 3 min · 526 words

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

2026-04-20 · 更新于 2026-07-24 · 4 min · 779 words

Hierarchical Codec Diffusion for Video-to-Speech Generation

2026-04-20 · 更新于 2026-07-24 · 6 min · 1219 words

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

2026-04-20 · 更新于 2026-07-24 · 3 min · 588 words

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

2026-04-20 · 更新于 2026-07-24 · 2 min · 374 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-20 · 更新于 2026-07-24 · 2 min · 388 words

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

2026-04-20 · 更新于 2026-07-24 · 6 min · 1114 words

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

2026-04-20 · 更新于 2026-07-24 · 2 min · 377 words

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

2026-04-20 · 更新于 2026-07-24 · 2 min · 238 words

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

2026-04-20 · 更新于 2026-07-24 · 1 min · 163 words

Qwen3.5-Omni Technical Report

2026-04-20 · 更新于 2026-07-24 · 2 min · 424 words

Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

2026-04-20 · 更新于 2026-07-24 · 4 min · 761 words

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

2026-04-20 · 更新于 2026-07-24 · 5 min · 999 words

The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

2026-04-20 · 更新于 2026-07-24 · 2 min · 402 words

TinyMU: A Compact Audio-Language Model for Music Understanding

2026-04-20 · 更新于 2026-07-24 · 3 min · 611 words

VoxMind: An End-to-End Agentic Spoken Dialogue System

2026-04-20 · 更新于 2026-07-24 · 5 min · 909 words

语音/音乐/音频论文速递 2026-04-20

2026-04-20 · 更新于 2026-07-24 · 10 min · 2068 words

A Manual Bar-by-Bar Tempo Measurement Protocol for Polyphonic Chamber Music Recordings: Design, Validation, and Application to Beethoven’s Piano and Cello Sonatas

2026-04-19 · 更新于 2026-07-24 · 2 min · 253 words

Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

2026-04-19 · 更新于 2026-07-24 · 2 min · 423 words

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

2026-04-19 · 更新于 2026-07-24 · 2 min · 249 words

Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization

2026-04-19 · 更新于 2026-07-24 · 1 min · 123 words

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

2026-04-19 · 更新于 2026-07-24 · 2 min · 314 words

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

2026-04-19 · 更新于 2026-07-24 · 2 min · 300 words

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

2026-04-19 · 更新于 2026-07-24 · 2 min · 237 words

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

2026-04-19 · 更新于 2026-07-24 · 2 min · 325 words

Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

2026-04-19 · 更新于 2026-07-24 · 2 min · 294 words

Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals

2026-04-19 · 更新于 2026-07-24 · 1 min · 189 words

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

2026-04-19 · 更新于 2026-07-24 · 3 min · 517 words

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

2026-04-19 · 更新于 2026-07-24 · 2 min · 370 words

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

2026-04-19 · 更新于 2026-07-24 · 3 min · 482 words

Diffusion Language Models for Speech Recognition

2026-04-19 · 更新于 2026-07-24 · 2 min · 253 words

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

2026-04-19 · 更新于 2026-07-24 · 2 min · 273 words

Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

2026-04-19 · 更新于 2026-07-24 · 2 min · 385 words

Enhancing time-frequency resolution with optimal transport and barycentric fusion of multiple spectrogram

2026-04-19 · 更新于 2026-07-24 · 3 min · 508 words

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

2026-04-19 · 更新于 2026-07-24 · 2 min · 234 words

Four Decades of Digital Waveguides

2026-04-19 · 更新于 2026-07-24 · 1 min · 190 words

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

2026-04-19 · 更新于 2026-07-24 · 2 min · 289 words

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

2026-04-19 · 更新于 2026-07-24 · 3 min · 525 words

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

2026-04-19 · 更新于 2026-07-24 · 3 min · 430 words

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

2026-04-19 · 更新于 2026-07-24 · 2 min · 388 words

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

2026-04-19 · 更新于 2026-07-24 · 2 min · 258 words

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

2026-04-19 · 更新于 2026-07-24 · 2 min · 339 words

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

2026-04-19 · 更新于 2026-07-24 · 2 min · 366 words

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

2026-04-19 · 更新于 2026-07-24 · 2 min · 351 words

Room compensation for loudspeaker reproduction using a supporting source

2026-04-19 · 更新于 2026-07-24 · 2 min · 225 words

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

2026-04-19 · 更新于 2026-07-24 · 2 min · 304 words

SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion

2026-04-19 · 更新于 2026-07-24 · 2 min · 401 words

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

2026-04-19 · 更新于 2026-07-24 · 2 min · 341 words

StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

2026-04-19 · 更新于 2026-07-24 · 2 min · 297 words

TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants

2026-04-19 · 更新于 2026-07-24 · 1 min · 128 words

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

2026-04-19 · 更新于 2026-07-24 · 3 min · 531 words

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

2026-04-19 · 更新于 2026-07-24 · 2 min · 387 words

Transformer Based Machine Fault Detection From Audio Input

2026-04-19 · 更新于 2026-07-24 · 1 min · 100 words

UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

2026-04-19 · 更新于 2026-07-24 · 3 min · 580 words

VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

2026-04-19 · 更新于 2026-07-24 · 3 min · 444 words

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

2026-04-19 · 更新于 2026-07-24 · 1 min · 177 words

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

2026-04-19 · 更新于 2026-07-24 · 2 min · 284 words

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

2026-04-19 · 更新于 2026-07-24 · 2 min · 376 words

Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

2026-04-19 · 更新于 2026-07-24 · 3 min · 503 words

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

2026-04-19 · 更新于 2026-07-24 · 2 min · 371 words

语音/音乐/音频论文速递 2026-04-19

2026-04-19 · 更新于 2026-07-24 · 15 min · 3104 words

语音/音乐/音频论文速递 2026-04-18

2026-04-18 · 更新于 2026-07-24 · 43 min · 9080 words

2026 3731

July 611

An Evaluation Framework for Structured Audio Captions Validated by Controlled Perturbations

Designed Vocalizations Dataset: Sound-Designed Human and Animal Voices for Non-human Voice Conversion

DONDO: Open w2v-BERT Speech-Recognition Base Models for African Languages

Faster IndexTTS-2: Accelerating and Streaming Autoregressive Zero-Shot Text-to-Speech Synthesis on GPUs

From Read Speech to Spoken Digits: A Task-Specific Evaluation of Speech Privacy With Informed Attackers

Improving the performance of an ASV system using hybrid speech features

Instruct-FD: Can Your Full-Duplex Speech System Follow Turn-Taking Instructions?

Investigating Codec-Internal Latent Audio Watermarking for Neural Codec Robustness

OPOD: On-Policy Omni Distillation

Phonetic forced alignment for low-resource language varieties: Model training and evaluation on Chengdu Mandarin

Safeguards for Speech2Speech LLM-Assistants: A Case Study in Automotive Applications

SCoPE: Shift-Aware Speaker-Conditioned Priors for Emotion Recognition in Conversations

TF-MossFormer: Integrating Convolution Gated Local-Global Attentions for Enhanced Time-Frequency Domain Monaural Speech Separation

Toward Generalizable Cognitive Impairment Detection with Speech-Based Multimodal Large Language Models

Toward Interpretable Speech Deepfake Detection using Artifact-Specific Experts and Calibrated Detection Scores

VibeVoice-ASR-BitNet Technical Report

Word meaning co-determines vowel-inherent spectral change. A corpus-based investigation of conversational Mandarin

X^3-OPD: Distilling Reasoning into Large Audio-Language Models via On-Policy Alignment

语音/音乐/音频论文速递 2026-07-24

A Diagnostic Evaluation Framework for AI-Generated Cover Songs Using Music-Theoretic and Acoustic Features

Audio-Zero: Label-Free Self-Evolution for Fine-Grained Audio Reasoning

Black-Box Optimization for Identifying and Inverting Audio Dynamic Range Control Effects

CAPS: A Cascaded Reconstruction Model to Power Saving in Hearables Using Sub-Nyquist Sampling with Bandwidth Extension

Cross-Subject Semantic Decoding with Shared-Space Alignment for Generalized Neural Representation Learning

Cumsum-Composable Phase Transport for Low-Cost Streaming Keyword Spotting

Efficient Chain-of-Modality Reasoning via Progressive Compression for Spoken Language Models

Improved Monitoring of Honey bee Colony Strength via Audio IoT Sensors, Modulation Tensorgrams and Recurrent Neural Networks

Layer-Wise Decision Fusion for Fake Audio Detection Using XLS-R

Learning the Arabic Dialect Continuum as a Continuous Space: A Regression Approach to Speaker Origin Prediction

Multimodal Speaker Verification as a Threat to Speaker Anonymization

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

Pushing the Frontier of Full-Song Generation: Hierarchical Autoregressive Planning Meets Flow-Matching Rendering

RIME: Enabling Large-Scale Agentic Post-Production

RPPNet: Perceptually-Grouped Rhythm-Pitch Primitives for Long-Term Structure Melody Generation via Boundary-Aware Modeling

Scalable Keyword Spotting via Modular Network Expansion

SimulS2ST-Omni: Data-Efficient Streaming Speech-to-Speech Translation via Explicit Trajectory Supervision

StellarTTS: Sparse Temporal Embedding for Low-Latency and Robust Speech Synthesis

The Giant Hippocampus: From Structural Monoculture to a System of Systems

Ultra-Compact CNN Architectures for Tropical Bird Audio Detection on Microcontrollers

Validating the Single Item Kawaii Measure

语音/音乐/音频论文速递 2026-07-23

A Situational Speech Synthesizer for Yoruba: System Design, Phonological Rule Architecture, and Orthographic Extensions for Contour

Addressing Limited Data in Auditory Attention Decoding with Diffusion Generative Models

Benchmarking Human and Automatic Speech Recognition of Diverse Speech: Initial Results

Comparing Spectrogram Front-Ends for Abnormal Heart-Sound Detection with a Convolutional Neural Network

Constrained CTC Decoding for Efficient Diacritic Restoration

Content is What Remains: Invariant Speech Tokenization from Parallel Utterances

CS-ETS: Chaos-Inspired Samba-Based EMG-To-Speech Synthesis with Nonlinear Chaotic Losses

EmoEUS: Uncertainty Supervision for Multimodal Emotion Recognition in Conversation

End-to-End Markov State Sequence Learning for Auditory Attention Decoding

Fretiq: Browser-Native Electric Guitar String Classification via Engineered Spectral Features and Held-Out Free-Play Evaluation

From a Multilingual Streaming ASR Backbone to Kenyan-Language Systems: Data-Centric Adaptation of Nemotron 3.5 for Kikuyu, Dholuo, and Kalenjin

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings

Staged Depth-Pruning Distillation of a Flow-Matching Text-to-Speech Teacher: A Compact Hindi Speech Synthesizer

Summary of DCASE 2026 Task 5: Audio-Dependent Question Answering

Teleportation Game: Quantum Teleportation in Multi-Agent Systems for Interactive Music

Towards a reproducible cross-venue method for quantifying crowd noise in stadiums

Towards Array-Invariant Speech Enhancement via Geometry-Aware Dynamic Convolution

Transcription Policy as a Latent Variable: Activating Controllable Verbatim ASR with Word-Level Timing

What the Waveform Knows: Transparent-first Speech and Audio Intelligence with Caption Studio

语音/音乐/音频论文速递 2026-07-22

Adaptive Momentum Enhanced Distributed Multichannel Active Noise Control for Faster Convergence under Communication Delays

AI_LectureNote: A Retrospective Pilot Study of a Post-ASR Workflow for English-Script Rendering and Semantic Drift in Korean-English Medical Lectures

AMECxSV: Adaptive Metadata-Driven Embedding-Fusion Calibration for X-Lingual Speaker Verification

An Audio Language Model-Based Voice Concept Bottleneck Framework for Interpretable Health Assessment

Audio Cross Verification Using Dual Alignment Likelihood Ratio Test

Component-Level Ensemble Fusion for Speech and Environmental Sound Deepfake Detection

Dense-Sparse Dynamic Time Warping for Customizing Piano Concerto Accompaniments

Do Speech Tokens Leak Voiceprints? Speaker Inversion Attacks Against End-to-End Speech Language Models

Efficient Audio-Visual Event Recognition via Knowledge Distillation and Dynamic INT8 Quantization of a Hybrid Cross-Attention Network

EII-SCL: Harnessing Emotional Inertia for Multimodal Emotion Recognition in Conversation

ESCUCHA: A Spanish Speech Benchmark for Heterogeneous Acoustic Conditions

Explainable Lightweight Compact Deep Models for Speech Emotion Recognition

FillGauss: Fine-Grained Filling-Aware Impact Sound Generation for 3D Gaussian Splatting

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

FlowSonic: Stable Zero-Shot Music Editing via High-Order Trajectory Integration

Harness TTS: Towards Context-Aware Expressive Speech Synthesis with Harness Layer

2026 ³⁷³¹

July ⁶¹¹