语音/音乐/音频论文速递 2026-05-23

共分析 123 篇论文


⚡ 今日概览

📥 抓取 123 篇 → 🔬 深度分析完成

🏷️ 热门方向

方向数量分布
#**4篇████

📊 论文评分排行榜(123 篇,按分数降序)

排名论文评分分档主任务
🥇INFER: Learning Implicit Neural Frequency Response Fiel8.5分前25%-
🥈VocSim A Training-free Benchmark for Zero-shot Content8.3分前25%-
🥉CMI-RewardBench: Evaluating Music Reward Models with Co8.2分前25%-
4.Language Model Augmented Semi-Supervised Statistical In8.2分前25%-
5.DiscoForcing: A Unified Framework for Real-Time Audio-D8.2分前25%-
6.Abstraction Induces the Brain Alignment of Language and8.0分前25%#**
7.Alethia: a Foundational Encoder for Voice Deepfakes8.0分前25%-
8.OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw8.0分前25%-
9.FoeGlass: When Simple In-Context Learning Is Enough for8.0分前25%-
10.E-VAds: An E-commerce Short Videos Understanding Benchm8.0分前25%-
11.BEAT: Tokenizing and Generating Symbolic Music by Unifo8.0分前25%-
12.Pianist Transformer: Towards Expressive Piano Performan7.8分前25%-
13.DreamID-Omni: Unified Framework for Controllable Human-7.8分前25%-
14.Real-World Unsupervised Models Generalize to Predict Br7.8分前25%-
15.AudioMosaic: Contrastive Masked Audio Representation Le7.5分前25%-
16.Self-Guidance: Enhancing Neural Codecs via Decoder Mani7.5分前25%-
17.LynX: Token Interface Alignment for Video+X LLMs7.5分前25%#**
18.Spherical Procrustes Alignment for Reliable Medical Aud7.5分前25%-
19.MoST: Mixing Speech and Text with Modality-Aware Mixtur7.5分前25%-
20.Self-Supervised Flow Matching for Scalable Multi-Modal7.5分前25%-
21.LightAVSeg: Lightweight Audio-Visual Segmentation7.5分前25%-
22.Robust Signal Enhancement via Fractional Detail Views a7.5分前25%-
23.EchoingPixels: Aliasing-Resistant Joint Token Reduction7.5分前25%-
24.Long Grounded Thoughts: Synthesizing Grounded Visual Pr7.5分前25%-
25.OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q7.5分前25%-
26.Ariadne’s Thread of LipSync: Unraveling Forgeries via I7.5分前25%-
27.AVI-Bench: Toward Human-like Audio-Visual Intelligence7.5分前25%-
28.Simultaneous Speech-to-Speech Translation Without Align7.5分前25%-
29.PhoStream: Benchmarking Real-World Streaming for Omnimo7.5分前25%-
30.OmniSIFT: Modality-Asymmetric Token Compression for Eff7.5分前25%-
31.Speech-Audio Compositional Attacks on Multimodal LLMs a7.5分前25%-
32.Convex Low-resource Accent-Robust Language Detection in7.5分前25%#**
33.PhaseCoder: Microphone Geometry-Agnostic Spatial Audio7.5分前25%-
34.Listening Through the Noise: Cauchy-Driven Diffusion Br7.5分前25%-
35.Dual-View Predictive Diffusion: Lightweight Speech Enha7.5分前25%-
36.Stream RAG: Instant and Accurate Spoken Dialogue System7.5分前25%-
37.NAACA: Training-Free NeuroAuditory Attentive Cognitive7.5分前25%-
38.MedMosaic: A Challenging Large Scale Benchmark of Diver7.5分前25%-
39.Verifiable Multimodal Reasoning: Fact-level Attribution7.5分前25%-
40.MusicDET: Zero-Shot AI-Generated Music Detection7.5分前25%-
41.PCRNet: Phase-aware Complex Refinement Network for EEG-7.5分前25%-
42.SARSteer: Safeguarding Large Audio Language Models via7.5分前25%-
43.STAR-VAE: Structured Topology-Aware Regularization for7.5分前25%-
44.Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa7.5分前25%-
45.AVGen-Bench: A Task-Driven Benchmark for Multi-Granular7.3分前50%-
46.Bridging the Stability-Expressivity Gap: Synthetic Data7.3分前50%-
47.AVTrack: Audio-Visual Speaker Tracking in Complex Scene7.3分前50%-
48.Bioacoustic Geolocation: Species Sounds as Geographic S7.2分前50%-
49.ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide7.2分前50%-
50.MECAT: A Multi-Experts Constructed Benchmark for Fine-G7.2分前50%-
51.SPEAR: A Unified SSL Framework for Learning Speech and7.2分前50%-
52.PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A7.2分前50%-
53.Multimodal Latent Language Modeling with Next-Token Dif7.2分前50%-
54.Query-Based Asymmetric Modeling with Decoupled Input–Ou7.0分前50%-
55.AgentSteerTTS: A Multi-Agent Closed-Loop Framework for7.0分前50%-
56.Optimality of FSQ tokens for continuous diffusion for c7.0分前50%-
57.JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i7.0分前50%-
58.SonicMaster: Towards Controllable All-in-One Music Rest7.0分前50%-
59.VIBE: Disentangling Social Dynamics via Kinematics-Info7.0分前50%-
60.Reasoning LLM Improves Speaker Recognition in Long-form7.0分前50%-
61.A Semantically Consistent Dataset for Data-Efficient Qu7.0分前50%-
62.The Silent Thought: Modeling Internal Cognition in Full7.0分前50%-
63.Learning Tight Rejection Boundaries without Negatives f7.0分前50%-
64.Quaternion Self-Attention with Shared Scores7.0分前50%-
65.Bridging Your Imagination with Audio-Video Generation v7.0分前50%-
66.TextME: Bridging Unseen Modalities Through Text Descrip7.0分前50%-
67.ReGen: Hierarchical Multi-Prompt Representation Generat7.0分前50%-
68.Polyphonia: Training-Free Context-Aware Music Editing w7.0分前50%-
69.TMD-Bench: A Multi-Level Evaluation Paradigm for Music–7.0分前50%-
70.Omni-Perception Policy Optimization for Multimodal Emot7.0分前50%-
71.Acoustic Interference: A New Paradigm Weaponizing Acous7.0分前50%-
72.AudioChat: Unified Audio Storytelling, Editing, and Und7.0分前50%-
73.Do Audio LLMs Listen or Read? Analyzing and Mitigating6.9分前50%-
74.From Talking to Singing: A New Challenge for Audio-Visu6.8分前50%-
75.Multiple Choice Learning of Low-Rank Adapters for Langu6.8分前50%-
76.Multimodal Fusion via Self-Consistent Task-Gradient Fie6.8分前50%-
77.Position: Beyond Text The Text-Centric Bias in Founda6.8分前50%-
78.MetaBio: Learning from metadata for bioacoustics founda6.5分前50%-
79.Any-Diffusion: Unified Multimodal Understanding and Gen6.5分前50%-
80.SAM Audio: Segment Anything in Audio6.5分前50%#**
81.CoCoEmo: Composable and Controllable Human-Like Emotion6.5分前50%-
82.HyperPotter: Spell the Charm of High-Order Interactions6.5分前50%-
83.Joint Enhancement and Classification using Coupled Diff6.5分前50%-
84.Hearing Without Noticing? Attention-Aware Stealthy Blac6.5分前50%-
85.Two-dimensional quantization for geometry-aware audio c6.5分前50%-
86.SALSA-V: Shortcut-Augmented Long-form Synchronized Audi6.5分前50%-
87.REST: Diffusion-based Real-time End-to-end Streaming Ta6.5分前50%-
88.AuTAgent: A Reinforcement Learning Framework for Tool-A6.5分前50%-
89.Characterizing the Predictive Impact of Modalities with6.5分前50%-
90.Group Cognition Learning: Making Everything Better Thro6.5分前50%-
91.Rethinking Attention in Spiking Transformers: Overcomin6.5分前50%-
92.T2AV-Compass: Towards Unified Evaluation for Text-to-Au6.5分前50%-
93.S3Audio: Towards Streaming Synchronized Spatial Audio G6.5分前50%-
94.Sparse Autoencoders for Interpretable Emotion Control i6.5分前50%-
95.BAT: Better Audio Transformer Guided by Convex Gated Pr6.5分前50%-
96.AG-REPA: Causal Layer Selection for Representation Alig6.5分前50%-
97.CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do6.5分前50%-
98.Neural-Inspired Modeling of Auditory Selection and Comp6.5分前50%-
99.FutureOmni: Evaluating Future Forecasting from Omni-Mod6.5分前50%-
100.ProactiveLLM: Learning Active Interaction for Streaming6.0分前50%-
101.video-SALMONN S: Memory-Enhanced Streaming Audio-Visual6.0分前50%-
102.Zero-Shot Rankability: Revealing Latent Ordinal Structu6.0分前50%-
103.Scaling Transformers for End-to-End Discrete Audio Toke6.0分前50%-
104.Evaluating and Rewarding LALMs for Expressive Role-Play6.0分前50%-
105.Unlocking Speech–Text Compositional Powers: Instruction5.8分前50%-
106.Probing Cross-modal Information Hubs in Audio-Visual LL5.5分前50%-
107.OmniShow: Orchestrating Multimodal Conditions for Human5.5分前50%-
108.Sparse Tokens Suffice: Jailbreaking Audio Language Mode5.5分前50%-
109.PHALAR: Phasors for Learned Musical Audio Representatio5.5分前50%-
110.Scaling Laws in Model Fine-tuning for Audio DeepFake De5.0分后50%-
111.PRIM:Cooperative Dynamic Token Compression for Efficien4.8分后50%-
112.Towards Understanding Modality Interaction in Multimoda4.5分后50%-
113.From Inpainting to Editing: Unlocking Robust Mask-Free4.3分后50%-
114.SONAR: Spectral‑Contrastive Audio Residuals for General4.0分后50%-
115.MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup3.8分后50%-
116.STARCaster: Spatio-Temporal AutoRegressive Video Diffus3.5分后50%-
117.WaveSSM: Multiscale State-Space Models for Non-stationa3.5分后50%-
118.\(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on3.5分后50%-
119.FakeWorld 1.0: An Omni modal Benchmark for Fake Media a3.5分后50%-
120.LALM-as-a-Judge: Benchmarking Large Audio-Language Mode3.5分后50%-
121.IVQ: Structured and Lightweight Vector Quantization via3.2分后50%-
122.MFCL Audio: An Audio Function Calling Evaluation for La3.0分后50%-
123.Position: Towards Responsible Evaluation for Text-to-Sp2.6分后50%-

📋 论文列表

🥇 INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments

🔥 8.5/10 | 前25% | arxiv


🥈 VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

🔥 8.3/10 | 前25% | arxiv


🥉 CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

7.2/10 | 前50% | arxiv


4. Language Model Augmented Semi-Supervised Statistical Inference

🔥 8.2/10 | 前25% | arxiv


5. DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

🔥 8.2/10 | 前25% | arxiv


6. Abstraction Induces the Brain Alignment of Language and Speech Models

🔥 8.0/10 | 前25% | arxiv


7. Alethia: a Foundational Encoder for Voice Deepfakes

7.5/10 | 前25% | arxiv


8. OmniDenseCap: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

🔥 8.0/10 | 前25% | arxiv


9. FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

🔥 8.0/10 | 前25% | arxiv


10. E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

🔥 8.0/10 | 前25% | arxiv


11. BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

🔥 8.0/10 | 前25% | arxiv


12. Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

7.8/10 | 前25% | arxiv


13. DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

7.8/10 | 前25% | arxiv


14. Real-World Unsupervised Models Generalize to Predict Brain Responses to Out-of-Distribution Stimuli

7.8/10 | 前25% | arxiv


15. AudioMosaic: Contrastive Masked Audio Representation Learning

7.5/10 | 前25% | arxiv


16. Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

7.5/10 | 前25% | arxiv


17. LynX: Token Interface Alignment for Video+X LLMs

7.5/10 | 前25% | arxiv


18. Spherical Procrustes Alignment for Reliable Medical Audio Diagnosis

7.5/10 | 前25% | arxiv


19. MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

7.5/10 | 前25% | arxiv


20. Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

7.5/10 | 前25% | arxiv


21. LightAVSeg: Lightweight Audio-Visual Segmentation

7.5/10 | 前25% | arxiv


22. Robust Signal Enhancement via Fractional Detail Views and Knowledge Guided Multi-view Fusion

7.5/10 | 前25% | arxiv


23. EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

7.5/10 | 前25% | arxiv


24. Long Grounded Thoughts: Synthesizing Grounded Visual Problems and Distilling Reasoning Chains at Scale

7.5/10 | 前25% | arxiv


25. OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

7.5/10 | 前25% | arxiv


26. Ariadne’s Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses

7.5/10 | 前25% | arxiv


27. AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

7.5/10 | 前25% | arxiv


28. Simultaneous Speech-to-Speech Translation Without Aligned Data

🔥 8.2/10 | 前25% | arxiv


29. PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

7.5/10 | 前25% | arxiv


30. OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

7.5/10 | 前25% | arxiv


31. Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard

7.5/10 | 前25% | arxiv


32. Convex Low-resource Accent-Robust Language Detection in Speech Recognition

7.5/10 | 前25% | arxiv


33. PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

7.5/10 | 前25% | arxiv


34. Listening Through the Noise: Cauchy-Driven Diffusion Bridges for Robust Gastrointestinal Auscultation and Clinical Benchmarking

7.5/10 | 前25% | arxiv


35. Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

7.5/10 | 前25% | arxiv


36. Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

7.5/10 | 前25% | arxiv


37. NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

7.5/10 | 前25% | arxiv


38. MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

7.5/10 | 前25% | arxiv


39. Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources

7.5/10 | 前25% | arxiv


40. MusicDET: Zero-Shot AI-Generated Music Detection

7.5/10 | 前25% | arxiv


41. PCRNet: Phase-aware Complex Refinement Network for EEG-based Auditory Attention Decoding

7.5/10 | 前25% | arxiv


42. SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering

7.5/10 | 前25% | arxiv


43. STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation

7.5/10 | 前25% | arxiv


44. Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

7.5/10 | 前25% | arxiv


45. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

7.3/10 | 前50% | arxiv


46. Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

7.3/10 | 前50% | arxiv


47. AVTrack: Audio-Visual Speaker Tracking in Complex Scenes

7.3/10 | 前50% | arxiv


48. Bioacoustic Geolocation: Species Sounds as Geographic Signals

7.2/10 | 前50% | arxiv


49. ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning

7.2/10 | 前50% | arxiv


50. MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

7.2/10 | 前50% | arxiv


51. SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

7.2/10 | 前50% | arxiv


52. PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation

7.2/10 | 前50% | arxiv


53. Multimodal Latent Language Modeling with Next-Token Diffusion

7.2/10 | 前50% | arxiv


54. Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration

7.0/10 | 前50% | arxiv


55. AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

7.0/10 | 前50% | arxiv


56. Optimality of FSQ tokens for continuous diffusion for categorical data with application to text-to-speech

7.0/10 | 前50% | arxiv


57. JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

7.0/10 | 前50% | arxiv


58. SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

7.0/10 | 前50% | arxiv


59. VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion

7.0/10 | 前50% | arxiv


60. Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

7.0/10 | 前50% | arxiv


61. A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

7.0/10 | 前50% | arxiv


62. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

🔥 8.5/10 | 前25% | arxiv


63. Learning Tight Rejection Boundaries without Negatives for Strict One-Class Audio Deepfake Detection

7.0/10 | 前50% | arxiv


64. Quaternion Self-Attention with Shared Scores

7.0/10 | 前50% | arxiv


65. Bridging Your Imagination with Audio-Video Generation via a Unified Director

7.0/10 | 前50% | arxiv


66. TextME: Bridging Unseen Modalities Through Text Descriptions

7.0/10 | 前50% | arxiv


67. ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models

7.0/10 | 前50% | arxiv


68. Polyphonia: Training-Free Context-Aware Music Editing with Acoustic-Informed Attention Calibration

7.0/10 | 前50% | arxiv


69. TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation

7.0/10 | 前50% | arxiv


70. Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

7.0/10 | 前50% | arxiv


71. Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

7.0/10 | 前50% | arxiv


72. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

7.0/10 | 前50% | arxiv


73. Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

6.9/10 | 前50% | arxiv


74. From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

6.8/10 | 前50% | arxiv


75. Multiple Choice Learning of Low-Rank Adapters for Language Modeling

6.8/10 | 前50% | arxiv


76. Multimodal Fusion via Self-Consistent Task-Gradient Fields

6.8/10 | 前50% | arxiv


77. Position: Beyond Text The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future

6.8/10 | 前50% | arxiv


78. MetaBio: Learning from metadata for bioacoustics foundation models

6.5/10 | 前50% | arxiv


79. Any-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

6.5/10 | 前50% | arxiv


80. SAM Audio: Segment Anything in Audio

6.5/10 | 前50% | arxiv


81. CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

6.5/10 | 前50% | arxiv


82. HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

6.5/10 | 前50% | arxiv


83. Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

6.5/10 | 前50% | arxiv


84. Hearing Without Noticing? Attention-Aware Stealthy Black-box Adversarial Audio Attacks

6.5/10 | 前50% | arxiv


85. Two-dimensional quantization for geometry-aware audio coding

6.5/10 | 前50% | arxiv


86. SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

arxiv


87. REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

6.5/10 | 前50% | arxiv


88. AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

6.5/10 | 前50% | arxiv


89. Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

6.5/10 | 前50% | arxiv


90. Group Cognition Learning: Making Everything Better Through Controlled Two-Stage Agents Collaboration

6.5/10 | 前50% | arxiv


91. Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity

6.5/10 | 前50% | arxiv


92. T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

6.5/10 | 前50% | arxiv


93. S3Audio: Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

6.5/10 | 前50% | arxiv


94. Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

6.5/10 | 前50% | arxiv


95. BAT: Better Audio Transformer Guided by Convex Gated Probing

6.5/10 | 前50% | arxiv


96. AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

6.5/10 | 前50% | arxiv


97. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks

6.5/10 | 前50% | arxiv


98. Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation

6.5/10 | 前50% | arxiv


99. FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

6.5/10 | 前50% | arxiv


100. ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

6.0/10 | 前50% | arxiv


101. video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

6.0/10 | 前50% | arxiv


102. Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language

6.0/10 | 前50% | arxiv


103. Scaling Transformers for End-to-End Discrete Audio Tokenization

6.0/10 | 前50% | arxiv


104. Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

6.5/10 | 前50% | arxiv


105. Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

📝 5.8/10 | 前50% | arxiv


106. Probing Cross-modal Information Hubs in Audio-Visual LLMs

📝 5.5/10 | 前50% | arxiv


107. OmniShow: Orchestrating Multimodal Conditions for Human-Object Interaction Video Generation

📝 5.5/10 | 前50% | arxiv


108. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

6.0/10 | 前50% | arxiv


109. PHALAR: Phasors for Learned Musical Audio Representations

📝 5.5/10 | 前50% | arxiv


110. Scaling Laws in Model Fine-tuning for Audio DeepFake Detection

📝 5.0/10 | 后50% | arxiv


111. PRIM:Cooperative Dynamic Token Compression for Efficient Large Multimodal Models

📝 4.8/10 | 后50% | arxiv


112. Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

📝 4.5/10 | 后50% | arxiv


113. From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

📝 4.3/10 | 后50% | arxiv


114. SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection

📝 4.0/10 | 后50% | arxiv


115. MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

📝 3.8/10 | 后50% | arxiv


116. STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

📝 3.5/10 | 后50% | arxiv


117. WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention

📝 3.5/10 | 后50% | arxiv


118. \(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

📝 3.5/10 | 后50% | arxiv


119. FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content

📝 3.5/10 | 后50% | arxiv


120. LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

📝 3.5/10 | 后50% | arxiv


121. IVQ: Structured and Lightweight Vector Quantization via Binary Hierarchical Composition Inspired by \(\textit{IChing}\)

📝 3.2/10 | 后50% | arxiv


122. MFCL Audio: An Audio Function Calling Evaluation for Large Language Models

📝 3.0/10 | 后50% | arxiv


123. Position: Towards Responsible Evaluation for Text-to-Speech

📝 2.6/10 | 后50% | arxiv