Two-dimensional quantization for geometry-aware audio coding

📄 Two-dimensional quantization for geometry-aware audio coding ✅ 6.5/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 17 words

Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

📄 Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning 📝 5.8/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 22 words

Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources

📄 Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources ✅ 7.5/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 19 words

VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion

📄 VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion ✅ 7.0/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 22 words

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

📄 video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM ✅ 6.0/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 17 words

VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

📄 VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio 🔥 8.3/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 22 words

WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention

📄 WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention 📝 3.5/10 | 后50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 19 words

Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language

📄 Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language ✅ 6.0/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 24 words

语音/音乐/音频论文速递 2026-05-23

语音/音乐/音频论文速递 2026-05-23 共分析 123 篇论文 ⚡ 今日概览 📥 抓取 123 篇 → 🔬 深度分析完成 🏷️ 热门方向 方向 数量 分布 #** 4篇 ████ 📊 论文评分排行榜(123 篇,按分数降序) 排名 论文 评分 分档 主任务 🥇 INFER: Learning Implicit Neural Frequency Response Fiel 8.5分 前25% - 🥈 VocSim A Training-free Benchmark for Zero-shot Content 8.3分 前25% - 🥉 CMI-RewardBench: Evaluating Music Reward Models with Co 8.2分 前25% - 4. Language Model Augmented Semi-Supervised Statistical In 8.2分 前25% - 5. DiscoForcing: A Unified Framework for Real-Time Audio-D 8.2分 前25% - 6. Abstraction Induces the Brain Alignment of Language and 8.0分 前25% #** 7. Alethia: a Foundational Encoder for Voice Deepfakes 8.0分 前25% - 8. OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw 8.0分 前25% - 9. FoeGlass: When Simple In-Context Learning Is Enough for 8.0分 前25% - 10. E-VAds: An E-commerce Short Videos Understanding Benchm 8.0分 前25% - 11. BEAT: Tokenizing and Generating Symbolic Music by Unifo 8.0分 前25% - 12. Pianist Transformer: Towards Expressive Piano Performan 7.8分 前25% - 13. DreamID-Omni: Unified Framework for Controllable Human- 7.8分 前25% - 14. Real-World Unsupervised Models Generalize to Predict Br 7.8分 前25% - 15. AudioMosaic: Contrastive Masked Audio Representation Le 7.5分 前25% - 16. Self-Guidance: Enhancing Neural Codecs via Decoder Mani 7.5分 前25% - 17. LynX: Token Interface Alignment for Video+X LLMs 7.5分 前25% #** 18. Spherical Procrustes Alignment for Reliable Medical Aud 7.5分 前25% - 19. MoST: Mixing Speech and Text with Modality-Aware Mixtur 7.5分 前25% - 20. Self-Supervised Flow Matching for Scalable Multi-Modal 7.5分 前25% - 21. LightAVSeg: Lightweight Audio-Visual Segmentation 7.5分 前25% - 22. Robust Signal Enhancement via Fractional Detail Views a 7.5分 前25% - 23. EchoingPixels: Aliasing-Resistant Joint Token Reduction 7.5分 前25% - 24. Long Grounded Thoughts: Synthesizing Grounded Visual Pr 7.5分 前25% - 25. OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q 7.5分 前25% - 26. Ariadne’s Thread of LipSync: Unraveling Forgeries via I 7.5分 前25% - 27. AVI-Bench: Toward Human-like Audio-Visual Intelligence 7.5分 前25% - 28. Simultaneous Speech-to-Speech Translation Without Align 7.5分 前25% - 29. PhoStream: Benchmarking Real-World Streaming for Omnimo 7.5分 前25% - 30. OmniSIFT: Modality-Asymmetric Token Compression for Eff 7.5分 前25% - 31. Speech-Audio Compositional Attacks on Multimodal LLMs a 7.5分 前25% - 32. Convex Low-resource Accent-Robust Language Detection in 7.5分 前25% #** 33. PhaseCoder: Microphone Geometry-Agnostic Spatial Audio 7.5分 前25% - 34. Listening Through the Noise: Cauchy-Driven Diffusion Br 7.5分 前25% - 35. Dual-View Predictive Diffusion: Lightweight Speech Enha 7.5分 前25% - 36. Stream RAG: Instant and Accurate Spoken Dialogue System 7.5分 前25% - 37. NAACA: Training-Free NeuroAuditory Attentive Cognitive 7.5分 前25% - 38. MedMosaic: A Challenging Large Scale Benchmark of Diver 7.5分 前25% - 39. Verifiable Multimodal Reasoning: Fact-level Attribution 7.5分 前25% - 40. MusicDET: Zero-Shot AI-Generated Music Detection 7.5分 前25% - 41. PCRNet: Phase-aware Complex Refinement Network for EEG- 7.5分 前25% - 42. SARSteer: Safeguarding Large Audio Language Models via 7.5分 前25% - 43. STAR-VAE: Structured Topology-Aware Regularization for 7.5分 前25% - 44. Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa 7.5分 前25% - 45. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular 7.3分 前50% - 46. Bridging the Stability-Expressivity Gap: Synthetic Data 7.3分 前50% - 47. AVTrack: Audio-Visual Speaker Tracking in Complex Scene 7.3分 前50% - 48. Bioacoustic Geolocation: Species Sounds as Geographic S 7.2分 前50% - 49. ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide 7.2分 前50% - 50. MECAT: A Multi-Experts Constructed Benchmark for Fine-G 7.2分 前50% - 51. SPEAR: A Unified SSL Framework for Learning Speech and 7.2分 前50% - 52. PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A 7.2分 前50% - 53. Multimodal Latent Language Modeling with Next-Token Dif 7.2分 前50% - 54. Query-Based Asymmetric Modeling with Decoupled Input–Ou 7.0分 前50% - 55. AgentSteerTTS: A Multi-Agent Closed-Loop Framework for 7.0分 前50% - 56. Optimality of FSQ tokens for continuous diffusion for c 7.0分 前50% - 57. JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i 7.0分 前50% - 58. SonicMaster: Towards Controllable All-in-One Music Rest 7.0分 前50% - 59. VIBE: Disentangling Social Dynamics via Kinematics-Info 7.0分 前50% - 60. Reasoning LLM Improves Speaker Recognition in Long-form 7.0分 前50% - 61. A Semantically Consistent Dataset for Data-Efficient Qu 7.0分 前50% - 62. The Silent Thought: Modeling Internal Cognition in Full 7.0分 前50% - 63. Learning Tight Rejection Boundaries without Negatives f 7.0分 前50% - 64. Quaternion Self-Attention with Shared Scores 7.0分 前50% - 65. Bridging Your Imagination with Audio-Video Generation v 7.0分 前50% - 66. TextME: Bridging Unseen Modalities Through Text Descrip 7.0分 前50% - 67. ReGen: Hierarchical Multi-Prompt Representation Generat 7.0分 前50% - 68. Polyphonia: Training-Free Context-Aware Music Editing w 7.0分 前50% - 69. TMD-Bench: A Multi-Level Evaluation Paradigm for Music– 7.0分 前50% - 70. Omni-Perception Policy Optimization for Multimodal Emot 7.0分 前50% - 71. Acoustic Interference: A New Paradigm Weaponizing Acous 7.0分 前50% - 72. AudioChat: Unified Audio Storytelling, Editing, and Und 7.0分 前50% - 73. Do Audio LLMs Listen or Read? Analyzing and Mitigating 6.9分 前50% - 74. From Talking to Singing: A New Challenge for Audio-Visu 6.8分 前50% - 75. Multiple Choice Learning of Low-Rank Adapters for Langu 6.8分 前50% - 76. Multimodal Fusion via Self-Consistent Task-Gradient Fie 6.8分 前50% - 77. Position: Beyond Text The Text-Centric Bias in Founda 6.8分 前50% - 78. MetaBio: Learning from metadata for bioacoustics founda 6.5分 前50% - 79. Any-Diffusion: Unified Multimodal Understanding and Gen 6.5分 前50% - 80. SAM Audio: Segment Anything in Audio 6.5分 前50% #** 81. CoCoEmo: Composable and Controllable Human-Like Emotion 6.5分 前50% - 82. HyperPotter: Spell the Charm of High-Order Interactions 6.5分 前50% - 83. Joint Enhancement and Classification using Coupled Diff 6.5分 前50% - 84. Hearing Without Noticing? Attention-Aware Stealthy Blac 6.5分 前50% - 85. Two-dimensional quantization for geometry-aware audio c 6.5分 前50% - 86. SALSA-V: Shortcut-Augmented Long-form Synchronized Audi 6.5分 前50% - 87. REST: Diffusion-based Real-time End-to-end Streaming Ta 6.5分 前50% - 88. AuTAgent: A Reinforcement Learning Framework for Tool-A 6.5分 前50% - 89. Characterizing the Predictive Impact of Modalities with 6.5分 前50% - 90. Group Cognition Learning: Making Everything Better Thro 6.5分 前50% - 91. Rethinking Attention in Spiking Transformers: Overcomin 6.5分 前50% - 92. T2AV-Compass: Towards Unified Evaluation for Text-to-Au 6.5分 前50% - 93. S3Audio: Towards Streaming Synchronized Spatial Audio G 6.5分 前50% - 94. Sparse Autoencoders for Interpretable Emotion Control i 6.5分 前50% - 95. BAT: Better Audio Transformer Guided by Convex Gated Pr 6.5分 前50% - 96. AG-REPA: Causal Layer Selection for Representation Alig 6.5分 前50% - 97. CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do 6.5分 前50% - 98. Neural-Inspired Modeling of Auditory Selection and Comp 6.5分 前50% - 99. FutureOmni: Evaluating Future Forecasting from Omni-Mod 6.5分 前50% - 100. ProactiveLLM: Learning Active Interaction for Streaming 6.0分 前50% - 101. video-SALMONN S: Memory-Enhanced Streaming Audio-Visual 6.0分 前50% - 102. Zero-Shot Rankability: Revealing Latent Ordinal Structu 6.0分 前50% - 103. Scaling Transformers for End-to-End Discrete Audio Toke 6.0分 前50% - 104. Evaluating and Rewarding LALMs for Expressive Role-Play 6.0分 前50% - 105. Unlocking Speech–Text Compositional Powers: Instruction 5.8分 前50% - 106. Probing Cross-modal Information Hubs in Audio-Visual LL 5.5分 前50% - 107. OmniShow: Orchestrating Multimodal Conditions for Human 5.5分 前50% - 108. Sparse Tokens Suffice: Jailbreaking Audio Language Mode 5.5分 前50% - 109. PHALAR: Phasors for Learned Musical Audio Representatio 5.5分 前50% - 110. Scaling Laws in Model Fine-tuning for Audio DeepFake De 5.0分 后50% - 111. PRIM:Cooperative Dynamic Token Compression for Efficien 4.8分 后50% - 112. Towards Understanding Modality Interaction in Multimoda 4.5分 后50% - 113. From Inpainting to Editing: Unlocking Robust Mask-Free 4.3分 后50% - 114. SONAR: Spectral‑Contrastive Audio Residuals for General 4.0分 后50% - 115. MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup 3.8分 后50% - 116. STARCaster: Spatio-Temporal AutoRegressive Video Diffus 3.5分 后50% - 117. WaveSSM: Multiscale State-Space Models for Non-stationa 3.5分 后50% - 118. \(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on 3.5分 后50% - 119. FakeWorld 1.0: An Omni modal Benchmark for Fake Media a 3.5分 后50% - 120. LALM-as-a-Judge: Benchmarking Large Audio-Language Mode 3.5分 后50% - 121. IVQ: Structured and Lightweight Vector Quantization via 3.2分 后50% - 122. MFCL Audio: An Audio Function Calling Evaluation for La 3.0分 后50% - 123. Position: Towards Responsible Evaluation for Text-to-Sp 2.6分 后50% - 📋 论文列表 🥇 INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments 🔥 8.5/10 | 前25% | arxiv ...

2026-05-23 · 更新于 2026-06-19 · 16 min · 3402 words

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

📄 Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods #文本到音乐生成 #基准挑战赛 #公平比较 #评估指标 #音乐信息检索 🔥 9.9/10 | 前10% | #音乐生成 | #基准测试 | #文本到音乐生成 #基准挑战赛 | arxiv 学术质量 6.3/7 | 影响力 1.7/2 | 可复现性 1.9/2 | 置信度 0.9 👥 作者与机构 作者:Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang, Hung-yi Lee, Hao-Wen Dong, and Yi-Hsuan Yang 机构:未在论文标题及摘要中明确列出。论文脚注提到网站地址(https://ntu-musicailab.github.io/ICME26-ATTM-Grand-Challenge/),表明与NTU-MusicaILab相关。 💡 毒舌点评 这篇论文与其说是提出一个新方法,不如说是精心策划了一场“学术界的音乐AI奥运会”。它精准地戳中了当前领域的痛点:工业巨头用海量数据和算力筑起高墙,让学术界只能在墙边“精装修”(微调)。论文的亮点在于其极强的“公平性”设计哲学和开源执行力:从强制从零训练、数据清洗到评估流水线,一条龙服务,试图把所有参赛者拉回同一起跑线。CCS指标的想法不错,用大模型当“裁判”来细粒度地检查音乐概念是否生成,比单一的CLAP分数更有解释性。但问题也很明显:1.5亿参数的基线模型在10秒片段生成上的表现,能否真实反映架构潜力,很可疑;主观评估只有35人,且未明确分布,说服力打折扣。最终,这更像一篇出色的挑战赛报告,而非方法论突破,其价值在于为社区提供了一套“游戏规则”和基础设施。 📌 核心摘要 本文介绍了ICME 2026“学术文本到音乐生成”挑战赛(ATTM)的技术框架与概览。该挑战赛旨在解决当前文本到音乐生成领域被工业界大规模数据与计算资源主导,从而阻碍学术研究公平对比与创新的问题。其核心设计原则是要求所有参赛模型必须在标准化的、仅含乐器的MTG-Jamendo数据子集上从零开始训练。挑战赛分为效率赛道(核心模型参数≤5亿)和性能赛道(无参数限制)。评估采用多阶段流程:首先使用客观指标(FAD, CLAP, 以及新颖的基于大语言模型的CCS)进行筛选,随后对顶尖系统进行主观MOS测试。论文开源了数据预处理管道、基线模型FluxAudio-S以及评估代码,旨在促进透明、可复现的学术研究。 🔗 开源详情 代码: 预处理管道(人声分离): https://github.com/ntu-musicailab/ICME26-ATTM-GC-Preprocessing 音频字幕生成管道: https://github.comntu-musicailab/ICME26-ATTM-GC-ALM-captioning 官方基线模型(FluxAudio-S): https://github.com/ntu-musicailab/ICME26-ATTM-GC-FluxAudio 评估代码(用于计算FAD和CLAP): https://github.com/ntu-musicailab/ICME26-ATTM-GC-Evaluation 模型权重:论文明确提供了官方基线模型FluxAudio-S的代码库(包含训练脚本),模型权重可由代码从零训练得到。对于Topline模型(Stable Audio Open, MusicGen, MeanAudio),论文使用了其官方发布的检查点,但未提供额外的下载链接。用于CCS评估的Qwen3-Omni模型亦未提供直接链接。 数据集:使用MTG-Jamendo数据集的raw_30s子集。论文未提供直接下载链接,但明确说明了数据源(Jamendo平台,CC许可)及预处理方式(人声分离)。 Demo:未提及。 复现材料: 论文提供了详细的基线模型训练配置:使用单卡NVIDIA RTX A6000 (48GB VRAM),训练200,000步,批大小128,总训练时间约2天4小时。 提供了人声分离和字幕生成的具体代码和依赖的模型检查点名称(如melband-roformer-kim-vocals)。 提供了生成官方参考字幕所使用的具体提示词(Table I)。 提供了评估方法中Borda计数的具体公式和流程。 论文中引用的开源项目: MTG-Jamendo:原始开源数据集。 Mel-Band Roformer:用于人声分离的模型。 Qwen2-Audio-7B-Instruct:用于字幕生成(Pipeline A)。 Music Flamingo:用于字幕生成(Pipeline B第一阶段)。 Qwen3-4B-Instruct:用于字幕优化和测试提示词合成。 EnCodec:作为辅助音频解码器。 LAION-CLAP-Music (music_audioset_epoch_15_esc_90.14):用于FAD和CLAP评分的特征提取器。 Qwen3-Omni:用于评估指标CCS的大语言模型。 T5:用作文本编码器。 FluxAudio:作为基线模型的原始架构。 Stable Audio Open, MusicGen, MeanAudio:作为Topline的预训练模型。 🏗️ 方法概述和架构 本挑战赛的方法论核心是建立一个标准化的、公平的基准测试框架,其架构与流程可分为以下几个相互关联的模块: ...

2026-05-22 · 更新于 2026-06-19 · 2 min · 372 words