语音/音乐/音频论文速递 2026-05-23
共分析 123 篇论文
⚡ 今日概览
📥 抓取 123 篇 → 🔬 深度分析完成
🏷️ 热门方向
| 方向 | 数量 | 分布 |
|---|---|---|
| #** | 4篇 | ████ |
📊 论文评分排行榜(123 篇,按分数降序)
📋 论文列表
🥇 INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments
🔥 8.5/10 | 前25% | arxiv
🥈 VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
🔥 8.3/10 | 前25% | arxiv
🥉 CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
✅ 7.2/10 | 前50% | arxiv
4. Language Model Augmented Semi-Supervised Statistical Inference
🔥 8.2/10 | 前25% | arxiv
5. DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing
🔥 8.2/10 | 前25% | arxiv
6. Abstraction Induces the Brain Alignment of Language and Speech Models
🔥 8.0/10 | 前25% | arxiv
7. Alethia: a Foundational Encoder for Voice Deepfakes
✅ 7.5/10 | 前25% | arxiv
8. OmniDenseCap: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
🔥 8.0/10 | 前25% | arxiv
9. FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors
🔥 8.0/10 | 前25% | arxiv
10. E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
🔥 8.0/10 | 前25% | arxiv
11. BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
🔥 8.0/10 | 前25% | arxiv
12. Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
✅ 7.8/10 | 前25% | arxiv
13. DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
✅ 7.8/10 | 前25% | arxiv
14. Real-World Unsupervised Models Generalize to Predict Brain Responses to Out-of-Distribution Stimuli
✅ 7.8/10 | 前25% | arxiv
15. AudioMosaic: Contrastive Masked Audio Representation Learning
✅ 7.5/10 | 前25% | arxiv
16. Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment
✅ 7.5/10 | 前25% | arxiv
17. LynX: Token Interface Alignment for Video+X LLMs
✅ 7.5/10 | 前25% | arxiv
18. Spherical Procrustes Alignment for Reliable Medical Audio Diagnosis
✅ 7.5/10 | 前25% | arxiv
19. MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
✅ 7.5/10 | 前25% | arxiv
20. Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis
✅ 7.5/10 | 前25% | arxiv
21. LightAVSeg: Lightweight Audio-Visual Segmentation
✅ 7.5/10 | 前25% | arxiv
22. Robust Signal Enhancement via Fractional Detail Views and Knowledge Guided Multi-view Fusion
✅ 7.5/10 | 前25% | arxiv
23. EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs
✅ 7.5/10 | 前25% | arxiv
24. Long Grounded Thoughts: Synthesizing Grounded Visual Problems and Distilling Reasoning Chains at Scale
✅ 7.5/10 | 前25% | arxiv
25. OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
✅ 7.5/10 | 前25% | arxiv
26. Ariadne’s Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses
✅ 7.5/10 | 前25% | arxiv
27. AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
✅ 7.5/10 | 前25% | arxiv
28. Simultaneous Speech-to-Speech Translation Without Aligned Data
🔥 8.2/10 | 前25% | arxiv
29. PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
✅ 7.5/10 | 前25% | arxiv
30. OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
✅ 7.5/10 | 前25% | arxiv
31. Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard
✅ 7.5/10 | 前25% | arxiv
32. Convex Low-resource Accent-Robust Language Detection in Speech Recognition
✅ 7.5/10 | 前25% | arxiv
33. PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs
✅ 7.5/10 | 前25% | arxiv
34. Listening Through the Noise: Cauchy-Driven Diffusion Bridges for Robust Gastrointestinal Auscultation and Clinical Benchmarking
✅ 7.5/10 | 前25% | arxiv
35. Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy
✅ 7.5/10 | 前25% | arxiv
36. Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage
✅ 7.5/10 | 前25% | arxiv
37. NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
✅ 7.5/10 | 前25% | arxiv
38. MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio
✅ 7.5/10 | 前25% | arxiv
39. Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources
✅ 7.5/10 | 前25% | arxiv
40. MusicDET: Zero-Shot AI-Generated Music Detection
✅ 7.5/10 | 前25% | arxiv
41. PCRNet: Phase-aware Complex Refinement Network for EEG-based Auditory Attention Decoding
✅ 7.5/10 | 前25% | arxiv
42. SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering
✅ 7.5/10 | 前25% | arxiv
43. STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation
✅ 7.5/10 | 前25% | arxiv
44. Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
✅ 7.5/10 | 前25% | arxiv
45. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
✅ 7.3/10 | 前50% | arxiv
46. Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models
✅ 7.3/10 | 前50% | arxiv
47. AVTrack: Audio-Visual Speaker Tracking in Complex Scenes
✅ 7.3/10 | 前50% | arxiv
48. Bioacoustic Geolocation: Species Sounds as Geographic Signals
✅ 7.2/10 | 前50% | arxiv
49. ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning
✅ 7.2/10 | 前50% | arxiv
50. MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
✅ 7.2/10 | 前50% | arxiv
51. SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
✅ 7.2/10 | 前50% | arxiv
52. PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation
✅ 7.2/10 | 前50% | arxiv
53. Multimodal Latent Language Modeling with Next-Token Diffusion
✅ 7.2/10 | 前50% | arxiv
54. Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration
✅ 7.0/10 | 前50% | arxiv
55. AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech
✅ 7.0/10 | 前50% | arxiv
56. Optimality of FSQ tokens for continuous diffusion for categorical data with application to text-to-speech
✅ 7.0/10 | 前50% | arxiv
57. JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
✅ 7.0/10 | 前50% | arxiv
58. SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
✅ 7.0/10 | 前50% | arxiv
59. VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion
✅ 7.0/10 | 前50% | arxiv
60. Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
✅ 7.0/10 | 前50% | arxiv
61. A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation
✅ 7.0/10 | 前50% | arxiv
62. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
🔥 8.5/10 | 前25% | arxiv
63. Learning Tight Rejection Boundaries without Negatives for Strict One-Class Audio Deepfake Detection
✅ 7.0/10 | 前50% | arxiv
64. Quaternion Self-Attention with Shared Scores
✅ 7.0/10 | 前50% | arxiv
65. Bridging Your Imagination with Audio-Video Generation via a Unified Director
✅ 7.0/10 | 前50% | arxiv
66. TextME: Bridging Unseen Modalities Through Text Descriptions
✅ 7.0/10 | 前50% | arxiv
67. ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models
✅ 7.0/10 | 前50% | arxiv
68. Polyphonia: Training-Free Context-Aware Music Editing with Acoustic-Informed Attention Calibration
✅ 7.0/10 | 前50% | arxiv
69. TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation
✅ 7.0/10 | 前50% | arxiv
70. Omni-Perception Policy Optimization for Multimodal Emotion Reasoning
✅ 7.0/10 | 前50% | arxiv
71. Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models
✅ 7.0/10 | 前50% | arxiv
72. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
✅ 7.0/10 | 前50% | arxiv
73. Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox
✅ 6.9/10 | 前50% | arxiv
74. From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection
✅ 6.8/10 | 前50% | arxiv
75. Multiple Choice Learning of Low-Rank Adapters for Language Modeling
✅ 6.8/10 | 前50% | arxiv
76. Multimodal Fusion via Self-Consistent Task-Gradient Fields
✅ 6.8/10 | 前50% | arxiv
77. Position: Beyond Text The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future
✅ 6.8/10 | 前50% | arxiv
78. MetaBio: Learning from metadata for bioacoustics foundation models
✅ 6.5/10 | 前50% | arxiv
79. Any-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
✅ 6.5/10 | 前50% | arxiv
80. SAM Audio: Segment Anything in Audio
✅ 6.5/10 | 前50% | arxiv
81. CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering
✅ 6.5/10 | 前50% | arxiv
82. HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection
✅ 6.5/10 | 前50% | arxiv
83. Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits
✅ 6.5/10 | 前50% | arxiv
84. Hearing Without Noticing? Attention-Aware Stealthy Black-box Adversarial Audio Attacks
✅ 6.5/10 | 前50% | arxiv
85. Two-dimensional quantization for geometry-aware audio coding
✅ 6.5/10 | 前50% | arxiv
86. SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
87. REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
✅ 6.5/10 | 前50% | arxiv
88. AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning
✅ 6.5/10 | 前50% | arxiv
89. Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
✅ 6.5/10 | 前50% | arxiv
90. Group Cognition Learning: Making Everything Better Through Controlled Two-Stage Agents Collaboration
✅ 6.5/10 | 前50% | arxiv
91. Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity
✅ 6.5/10 | 前50% | arxiv
92. T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
✅ 6.5/10 | 前50% | arxiv
93. S3Audio: Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
✅ 6.5/10 | 前50% | arxiv
94. Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
✅ 6.5/10 | 前50% | arxiv
95. BAT: Better Audio Transformer Guided by Convex Gated Probing
✅ 6.5/10 | 前50% | arxiv
96. AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
✅ 6.5/10 | 前50% | arxiv
97. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks
✅ 6.5/10 | 前50% | arxiv
98. Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation
✅ 6.5/10 | 前50% | arxiv
99. FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs
✅ 6.5/10 | 前50% | arxiv
100. ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
✅ 6.0/10 | 前50% | arxiv
101. video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM
✅ 6.0/10 | 前50% | arxiv
102. Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language
✅ 6.0/10 | 前50% | arxiv
103. Scaling Transformers for End-to-End Discrete Audio Tokenization
✅ 6.0/10 | 前50% | arxiv
104. Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability
✅ 6.5/10 | 前50% | arxiv
105. Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
📝 5.8/10 | 前50% | arxiv
106. Probing Cross-modal Information Hubs in Audio-Visual LLMs
📝 5.5/10 | 前50% | arxiv
107. OmniShow: Orchestrating Multimodal Conditions for Human-Object Interaction Video Generation
📝 5.5/10 | 前50% | arxiv
108. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
✅ 6.0/10 | 前50% | arxiv
109. PHALAR: Phasors for Learned Musical Audio Representations
📝 5.5/10 | 前50% | arxiv
110. Scaling Laws in Model Fine-tuning for Audio DeepFake Detection
📝 5.0/10 | 后50% | arxiv
111. PRIM:Cooperative Dynamic Token Compression for Efficient Large Multimodal Models
📝 4.8/10 | 后50% | arxiv
112. Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition
📝 4.5/10 | 后50% | arxiv
113. From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping
📝 4.3/10 | 后50% | arxiv
114. SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection
📝 4.0/10 | 后50% | arxiv
115. MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
📝 3.8/10 | 后50% | arxiv
116. STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits
📝 3.5/10 | 后50% | arxiv
117. WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention
📝 3.5/10 | 后50% | arxiv
118. \(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains
📝 3.5/10 | 后50% | arxiv
119. FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content
📝 3.5/10 | 后50% | arxiv
120. LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues
📝 3.5/10 | 后50% | arxiv
121. IVQ: Structured and Lightweight Vector Quantization via Binary Hierarchical Composition Inspired by \(\textit{IChing}\)
📝 3.2/10 | 后50% | arxiv
122. MFCL Audio: An Audio Function Calling Evaluation for Large Language Models
📝 3.0/10 | 后50% | arxiv
123. Position: Towards Responsible Evaluation for Text-to-Speech
📝 2.6/10 | 后50% | arxiv