Alignment

语音/音乐/音频论文速递 2026-05-23 共分析 123 篇论文 ⚡ 今日概览 📥 抓取 123 篇 → 🔬 深度分析完成 🏷️ 热门方向方向数量分布 #** 4篇 ████ 📊 论文评分排行榜（123 篇，按分数降序）排名论文评分分档主任务 🥇 INFER: Learning Implicit Neural Frequency Response Fiel 8.5分前25% - 🥈 VocSim A Training-free Benchmark for Zero-shot Content 8.3分前25% - 🥉 CMI-RewardBench: Evaluating Music Reward Models with Co 8.2分前25% - 4. Language Model Augmented Semi-Supervised Statistical In 8.2分前25% - 5. DiscoForcing: A Unified Framework for Real-Time Audio-D 8.2分前25% - 6. Abstraction Induces the Brain Alignment of Language and 8.0分前25% #** 7. Alethia: a Foundational Encoder for Voice Deepfakes 8.0分前25% - 8. OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw 8.0分前25% - 9. FoeGlass: When Simple In-Context Learning Is Enough for 8.0分前25% - 10. E-VAds: An E-commerce Short Videos Understanding Benchm 8.0分前25% - 11. BEAT: Tokenizing and Generating Symbolic Music by Unifo 8.0分前25% - 12. Pianist Transformer: Towards Expressive Piano Performan 7.8分前25% - 13. DreamID-Omni: Unified Framework for Controllable Human- 7.8分前25% - 14. Real-World Unsupervised Models Generalize to Predict Br 7.8分前25% - 15. AudioMosaic: Contrastive Masked Audio Representation Le 7.5分前25% - 16. Self-Guidance: Enhancing Neural Codecs via Decoder Mani 7.5分前25% - 17. LynX: Token Interface Alignment for Video+X LLMs 7.5分前25% #** 18. Spherical Procrustes Alignment for Reliable Medical Aud 7.5分前25% - 19. MoST: Mixing Speech and Text with Modality-Aware Mixtur 7.5分前25% - 20. Self-Supervised Flow Matching for Scalable Multi-Modal 7.5分前25% - 21. LightAVSeg: Lightweight Audio-Visual Segmentation 7.5分前25% - 22. Robust Signal Enhancement via Fractional Detail Views a 7.5分前25% - 23. EchoingPixels: Aliasing-Resistant Joint Token Reduction 7.5分前25% - 24. Long Grounded Thoughts: Synthesizing Grounded Visual Pr 7.5分前25% - 25. OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q 7.5分前25% - 26. Ariadne’s Thread of LipSync: Unraveling Forgeries via I 7.5分前25% - 27. AVI-Bench: Toward Human-like Audio-Visual Intelligence 7.5分前25% - 28. Simultaneous Speech-to-Speech Translation Without Align 7.5分前25% - 29. PhoStream: Benchmarking Real-World Streaming for Omnimo 7.5分前25% - 30. OmniSIFT: Modality-Asymmetric Token Compression for Eff 7.5分前25% - 31. Speech-Audio Compositional Attacks on Multimodal LLMs a 7.5分前25% - 32. Convex Low-resource Accent-Robust Language Detection in 7.5分前25% #** 33. PhaseCoder: Microphone Geometry-Agnostic Spatial Audio 7.5分前25% - 34. Listening Through the Noise: Cauchy-Driven Diffusion Br 7.5分前25% - 35. Dual-View Predictive Diffusion: Lightweight Speech Enha 7.5分前25% - 36. Stream RAG: Instant and Accurate Spoken Dialogue System 7.5分前25% - 37. NAACA: Training-Free NeuroAuditory Attentive Cognitive 7.5分前25% - 38. MedMosaic: A Challenging Large Scale Benchmark of Diver 7.5分前25% - 39. Verifiable Multimodal Reasoning: Fact-level Attribution 7.5分前25% - 40. MusicDET: Zero-Shot AI-Generated Music Detection 7.5分前25% - 41. PCRNet: Phase-aware Complex Refinement Network for EEG- 7.5分前25% - 42. SARSteer: Safeguarding Large Audio Language Models via 7.5分前25% - 43. STAR-VAE: Structured Topology-Aware Regularization for 7.5分前25% - 44. Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa 7.5分前25% - 45. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular 7.3分前50% - 46. Bridging the Stability-Expressivity Gap: Synthetic Data 7.3分前50% - 47. AVTrack: Audio-Visual Speaker Tracking in Complex Scene 7.3分前50% - 48. Bioacoustic Geolocation: Species Sounds as Geographic S 7.2分前50% - 49. ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide 7.2分前50% - 50. MECAT: A Multi-Experts Constructed Benchmark for Fine-G 7.2分前50% - 51. SPEAR: A Unified SSL Framework for Learning Speech and 7.2分前50% - 52. PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A 7.2分前50% - 53. Multimodal Latent Language Modeling with Next-Token Dif 7.2分前50% - 54. Query-Based Asymmetric Modeling with Decoupled Input–Ou 7.0分前50% - 55. AgentSteerTTS: A Multi-Agent Closed-Loop Framework for 7.0分前50% - 56. Optimality of FSQ tokens for continuous diffusion for c 7.0分前50% - 57. JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i 7.0分前50% - 58. SonicMaster: Towards Controllable All-in-One Music Rest 7.0分前50% - 59. VIBE: Disentangling Social Dynamics via Kinematics-Info 7.0分前50% - 60. Reasoning LLM Improves Speaker Recognition in Long-form 7.0分前50% - 61. A Semantically Consistent Dataset for Data-Efficient Qu 7.0分前50% - 62. The Silent Thought: Modeling Internal Cognition in Full 7.0分前50% - 63. Learning Tight Rejection Boundaries without Negatives f 7.0分前50% - 64. Quaternion Self-Attention with Shared Scores 7.0分前50% - 65. Bridging Your Imagination with Audio-Video Generation v 7.0分前50% - 66. TextME: Bridging Unseen Modalities Through Text Descrip 7.0分前50% - 67. ReGen: Hierarchical Multi-Prompt Representation Generat 7.0分前50% - 68. Polyphonia: Training-Free Context-Aware Music Editing w 7.0分前50% - 69. TMD-Bench: A Multi-Level Evaluation Paradigm for Music– 7.0分前50% - 70. Omni-Perception Policy Optimization for Multimodal Emot 7.0分前50% - 71. Acoustic Interference: A New Paradigm Weaponizing Acous 7.0分前50% - 72. AudioChat: Unified Audio Storytelling, Editing, and Und 7.0分前50% - 73. Do Audio LLMs Listen or Read? Analyzing and Mitigating 6.9分前50% - 74. From Talking to Singing: A New Challenge for Audio-Visu 6.8分前50% - 75. Multiple Choice Learning of Low-Rank Adapters for Langu 6.8分前50% - 76. Multimodal Fusion via Self-Consistent Task-Gradient Fie 6.8分前50% - 77. Position: Beyond Text The Text-Centric Bias in Founda 6.8分前50% - 78. MetaBio: Learning from metadata for bioacoustics founda 6.5分前50% - 79. Any-Diffusion: Unified Multimodal Understanding and Gen 6.5分前50% - 80. SAM Audio: Segment Anything in Audio 6.5分前50% #** 81. CoCoEmo: Composable and Controllable Human-Like Emotion 6.5分前50% - 82. HyperPotter: Spell the Charm of High-Order Interactions 6.5分前50% - 83. Joint Enhancement and Classification using Coupled Diff 6.5分前50% - 84. Hearing Without Noticing? Attention-Aware Stealthy Blac 6.5分前50% - 85. Two-dimensional quantization for geometry-aware audio c 6.5分前50% - 86. SALSA-V: Shortcut-Augmented Long-form Synchronized Audi 6.5分前50% - 87. REST: Diffusion-based Real-time End-to-end Streaming Ta 6.5分前50% - 88. AuTAgent: A Reinforcement Learning Framework for Tool-A 6.5分前50% - 89. Characterizing the Predictive Impact of Modalities with 6.5分前50% - 90. Group Cognition Learning: Making Everything Better Thro 6.5分前50% - 91. Rethinking Attention in Spiking Transformers: Overcomin 6.5分前50% - 92. T2AV-Compass: Towards Unified Evaluation for Text-to-Au 6.5分前50% - 93. S3Audio: Towards Streaming Synchronized Spatial Audio G 6.5分前50% - 94. Sparse Autoencoders for Interpretable Emotion Control i 6.5分前50% - 95. BAT: Better Audio Transformer Guided by Convex Gated Pr 6.5分前50% - 96. AG-REPA: Causal Layer Selection for Representation Alig 6.5分前50% - 97. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do 6.5分前50% - 98. Neural-Inspired Modeling of Auditory Selection and Comp 6.5分前50% - 99. FutureOmni: Evaluating Future Forecasting from Omni-Mod 6.5分前50% - 100. ProactiveLLM: Learning Active Interaction for Streaming 6.0分前50% - 101. video-SALMONN S: Memory-Enhanced Streaming Audio-Visual 6.0分前50% - 102. Zero-Shot Rankability: Revealing Latent Ordinal Structu 6.0分前50% - 103. Scaling Transformers for End-to-End Discrete Audio Toke 6.0分前50% - 104. Evaluating and Rewarding LALMs for Expressive Role-Play 6.0分前50% - 105. Unlocking Speech–Text Compositional Powers: Instruction 5.8分前50% - 106. Probing Cross-modal Information Hubs in Audio-Visual LL 5.5分前50% - 107. OmniShow: Orchestrating Multimodal Conditions for Human 5.5分前50% - 108. Sparse Tokens Suffice: Jailbreaking Audio Language Mode 5.5分前50% - 109. PHALAR: Phasors for Learned Musical Audio Representatio 5.5分前50% - 110. Scaling Laws in Model Fine-tuning for Audio DeepFake De 5.0分后50% - 111. PRIM：Cooperative Dynamic Token Compression for Efficien 4.8分后50% - 112. Towards Understanding Modality Interaction in Multimoda 4.5分后50% - 113. From Inpainting to Editing: Unlocking Robust Mask-Free 4.3分后50% - 114. SONAR: Spectral‑Contrastive Audio Residuals for General 4.0分后50% - 115. MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup 3.8分后50% - 116. STARCaster: Spatio-Temporal AutoRegressive Video Diffus 3.5分后50% - 117. WaveSSM: Multiscale State-Space Models for Non-stationa 3.5分后50% - 118. \(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on 3.5分后50% - 119. FakeWorld 1.0: An Omni modal Benchmark for Fake Media a 3.5分后50% - 120. LALM-as-a-Judge: Benchmarking Large Audio-Language Mode 3.5分后50% - 121. IVQ: Structured and Lightweight Vector Quantization via 3.2分后50% - 122. MFCL Audio: An Audio Function Calling Evaluation for La 3.0分后50% - 123. Position: Towards Responsible Evaluation for Text-to-Sp 2.6分后50% - 📋 论文列表 🥇 INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments 🔥 8.5/10 | 前25% | arxiv ...