<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>LLMs on 语音/音乐/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/llms/</link>
    <description>每日 AI 自动生成的语音/AI 领域论文深度分析</description>
    <language>zh-cn</language>
    <lastBuildDate>Sat, 23 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/llms/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>LynX: Token Interface Alignment for Video&#43;X LLMs</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-lynx-token-interface-alignment-for-videox-llms/</link>
      <pubDate>Sat, 23 May 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-lynx-token-interface-alignment-for-videox-llms/</guid>
      <description>&lt;h1 id=&#34;-lynx-token-interface-alignment-for-videox-llms&#34;&gt;📄 LynX: Token Interface Alignment for Video+X LLMs&lt;/h1&gt;
&lt;p&gt;#** #Video #LLMs #Token #Interface #Alignment #多模态整合 #流形对齐 #单模态数据&lt;/p&gt;
&lt;p&gt;✅ &lt;strong&gt;7.5/10&lt;/strong&gt; | 前25% | #** | #Video | #LLMs #Token | &lt;a href=&#34;https://icml.cc/virtual/2026/poster/61725&#34;&gt;arxiv&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23/&#34;&gt;← 返回 2026-05-23 语音/音乐/音频论文速递&lt;/a&gt;&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="-lynx-token-interface-alignment-for-videox-llms">📄 LynX: Token Interface Alignment for Video+X LLMs</h1>
<p>#** #Video #LLMs #Token #Interface #Alignment #多模态整合 #流形对齐 #单模态数据</p>
<p>✅ <strong>7.5/10</strong> | 前25% | #** | #Video | #LLMs #Token | <a href="https://icml.cc/virtual/2026/poster/61725">arxiv</a></p>
<hr>
<p><a href="/audio-paper-digest-blog/posts/2026-05-23/">← 返回 2026-05-23 语音/音乐/音频论文速递</a></p>
]]></content:encoded>
      <category>Video</category>
      <category>LLMs</category>
      <category>Token</category>
      <category>Interface</category>
      <category>Alignment</category>
      <category>多模态整合</category>
      <category>流形对齐</category>
      <category>单模态数据</category>
    </item>
    <item>
      <title>语音/音乐/音频论文速递 2026-05-23</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23/</link>
      <pubDate>Sat, 23 May 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23/</guid>
      <description>&lt;h1 id=&#34;语音音乐音频论文速递-2026-05-23&#34;&gt;语音/音乐/音频论文速递 2026-05-23&lt;/h1&gt;
&lt;p&gt;共分析 &lt;strong&gt;123&lt;/strong&gt; 篇论文&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;-今日概览&#34;&gt;⚡ 今日概览&lt;/h2&gt;
&lt;p&gt;📥 抓取 123 篇 → 🔬 深度分析完成&lt;/p&gt;
&lt;h3 id=&#34;-热门方向&#34;&gt;🏷️ 热门方向&lt;/h3&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th&gt;方向&lt;/th&gt;
					&lt;th&gt;数量&lt;/th&gt;
					&lt;th&gt;分布&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td&gt;#**&lt;/td&gt;
					&lt;td&gt;4篇&lt;/td&gt;
					&lt;td&gt;████&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;-论文评分排行榜123-篇按分数降序&#34;&gt;📊 论文评分排行榜（123 篇，按分数降序）&lt;/h3&gt;
&lt;table&gt;
	&lt;thead&gt;
			&lt;tr&gt;
					&lt;th&gt;排名&lt;/th&gt;
					&lt;th&gt;论文&lt;/th&gt;
					&lt;th&gt;评分&lt;/th&gt;
					&lt;th&gt;分档&lt;/th&gt;
					&lt;th&gt;主任务&lt;/th&gt;
			&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tbody&gt;
			&lt;tr&gt;
					&lt;td&gt;🥇&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-infer-learning-implicit-neural-frequency-response&#34;&gt;INFER: Learning Implicit Neural Frequency Response Fiel&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;🥈&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-vocsim-a-training-free-benchmark-for-zero-shot&#34;&gt;VocSim A Training-free Benchmark for Zero-shot Content &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.3分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;🥉&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-cmi-rewardbench-evaluating-music-reward-models&#34;&gt;CMI-RewardBench: Evaluating Music Reward Models with Co&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.2分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;4.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-language-model-augmented-semi-supervised&#34;&gt;Language Model Augmented Semi-Supervised Statistical In&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.2分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;5.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-discoforcing-a-unified-framework-for-real-time&#34;&gt;DiscoForcing: A Unified Framework for Real-Time Audio-D&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.2分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;6.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-abstraction-induces-the-brain-alignment-of&#34;&gt;Abstraction Induces the Brain Alignment of Language and&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;#**&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;7.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-alethia-a-foundational-encoder-for-voice-deepfakes&#34;&gt;Alethia: a Foundational Encoder for Voice Deepfakes&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;8.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-omnidensecap-scripting-multi-scene-videos-with&#34;&gt;OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;9.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-foeglass-when-simple-in-context-learning-is&#34;&gt;FoeGlass: When Simple In-Context Learning Is Enough for&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;10.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-e-vads-an-e-commerce-short-videos-understanding&#34;&gt;E-VAds: An E-commerce Short Videos Understanding Benchm&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;11.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-beat-tokenizing-and-generating-symbolic-music-by&#34;&gt;BEAT: Tokenizing and Generating Symbolic Music by Unifo&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;8.0分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;12.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-pianist-transformer-towards-expressive-piano&#34;&gt;Pianist Transformer: Towards Expressive Piano Performan&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.8分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;13.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-dreamid-omni-unified-framework-for-controllable&#34;&gt;DreamID-Omni: Unified Framework for Controllable Human-&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.8分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;14.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-real-world-unsupervised-models-generalize-to&#34;&gt;Real-World Unsupervised Models Generalize to Predict Br&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.8分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;15.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-audiomosaic-contrastive-masked-audio&#34;&gt;AudioMosaic: Contrastive Masked Audio Representation Le&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;16.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-self-guidance-enhancing-neural-codecs-via-decoder&#34;&gt;Self-Guidance: Enhancing Neural Codecs via Decoder Mani&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;17.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-lynx-token-interface-alignment-for-videox-llms&#34;&gt;LynX: Token Interface Alignment for Video+X LLMs&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;#**&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;18.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-spherical-procrustes-alignment-for-reliable&#34;&gt;Spherical Procrustes Alignment for Reliable Medical Aud&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;19.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-most-mixing-speech-and-text-with-modality-aware&#34;&gt;MoST: Mixing Speech and Text with Modality-Aware Mixtur&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;20.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-self-supervised-flow-matching-for-scalable-multi&#34;&gt;Self-Supervised Flow Matching for Scalable Multi-Modal &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;21.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-lightavseg-lightweight-audio-visual-segmentation&#34;&gt;LightAVSeg: Lightweight Audio-Visual Segmentation&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;22.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-robust-signal-enhancement-via-fractional-detail&#34;&gt;Robust Signal Enhancement via Fractional Detail Views a&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;23.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-echoingpixels-aliasing-resistant-joint-token&#34;&gt;EchoingPixels: Aliasing-Resistant Joint Token Reduction&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;24.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-long-grounded-thoughts-synthesizing-grounded&#34;&gt;Long Grounded Thoughts: Synthesizing Grounded Visual Pr&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;25.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-omnivideo-r1-reinforcing-audio-visual-reasoning&#34;&gt;OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;26.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-ariadnes-thread-of-lipsync-unraveling-forgeries&#34;&gt;Ariadne&amp;rsquo;s Thread of LipSync: Unraveling Forgeries via I&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;27.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-avi-bench-toward-human-like-audio-visual&#34;&gt;AVI-Bench: Toward Human-like Audio-Visual Intelligence &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;28.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-simultaneous-speech-to-speech-translation-without&#34;&gt;Simultaneous Speech-to-Speech Translation Without Align&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;29.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-phostream-benchmarking-real-world-streaming-for&#34;&gt;PhoStream: Benchmarking Real-World Streaming for Omnimo&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;30.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-omnisift-modality-asymmetric-token-compression&#34;&gt;OmniSIFT: Modality-Asymmetric Token Compression for Eff&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;31.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-speech-audio-compositional-attacks-on-multimodal&#34;&gt;Speech-Audio Compositional Attacks on Multimodal LLMs a&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;32.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-convex-low-resource-accent-robust-language&#34;&gt;Convex Low-resource Accent-Robust Language Detection in&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;#**&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;33.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-phasecoder-microphone-geometry-agnostic-spatial&#34;&gt;PhaseCoder: Microphone Geometry-Agnostic Spatial Audio &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;34.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-listening-through-the-noise-cauchy-driven&#34;&gt;Listening Through the Noise: Cauchy-Driven Diffusion Br&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;35.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-dual-view-predictive-diffusion-lightweight-speech&#34;&gt;Dual-View Predictive Diffusion: Lightweight Speech Enha&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;36.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-stream-rag-instant-and-accurate-spoken-dialogue&#34;&gt;Stream RAG: Instant and Accurate Spoken Dialogue System&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;37.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-naaca-training-free-neuroauditory-attentive&#34;&gt;NAACA: Training-Free NeuroAuditory Attentive Cognitive &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;38.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-medmosaic-a-challenging-large-scale-benchmark-of&#34;&gt;MedMosaic: A Challenging Large Scale Benchmark of Diver&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;39.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-verifiable-multimodal-reasoning-fact-level&#34;&gt;Verifiable Multimodal Reasoning: Fact-level Attribution&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;40.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-musicdet-zero-shot-ai-generated-music-detection&#34;&gt;MusicDET: Zero-Shot AI-Generated Music Detection&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;41.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-pcrnet-phase-aware-complex-refinement-network-for&#34;&gt;PCRNet: Phase-aware Complex Refinement Network for EEG-&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;42.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sarsteer-safeguarding-large-audio-language-models&#34;&gt;SARSteer: Safeguarding Large Audio Language Models via &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;43.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-star-vae-structured-topology-aware-regularization&#34;&gt;STAR-VAE: Structured Topology-Aware Regularization for &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;44.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-hidden-in-plain-tokens-simply-robust-gradient&#34;&gt;Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.5分&lt;/td&gt;
					&lt;td&gt;前25%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;45.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-avgen-bench-a-task-driven-benchmark-for-multi&#34;&gt;AVGen-Bench: A Task-Driven Benchmark for Multi-Granular&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.3分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;46.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-bridging-the-stability-expressivity-gap-synthetic&#34;&gt;Bridging the Stability-Expressivity Gap: Synthetic Data&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.3分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;47.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-avtrack-audio-visual-speaker-tracking-in-complex&#34;&gt;AVTrack: Audio-Visual Speaker Tracking in Complex Scene&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.3分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;48.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-bioacoustic-geolocation-species-sounds-as&#34;&gt;Bioacoustic Geolocation: Species Sounds as Geographic S&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;49.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-adept-rl-aligned-agentic-decoding-of-emotion-via&#34;&gt;ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;50.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-mecat-a-multi-experts-constructed-benchmark-for&#34;&gt;MECAT: A Multi-Experts Constructed Benchmark for Fine-G&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;51.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-spear-a-unified-ssl-framework-for-learning-speech&#34;&gt;SPEAR: A Unified SSL Framework for Learning Speech and &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;52.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-pads-tal-padding-annealed-diffusion-sampling-in&#34;&gt;PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;53.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-multimodal-latent-language-modeling-with-next&#34;&gt;Multimodal Latent Language Modeling with Next-Token Dif&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.2分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;54.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-query-based-asymmetric-modeling-with-decoupled&#34;&gt;Query-Based Asymmetric Modeling with Decoupled Input–Ou&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;55.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-agentsteertts-a-multi-agent-closed-loop-framework&#34;&gt;AgentSteerTTS: A Multi-Agent Closed-Loop Framework for &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;56.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-optimality-of-fsq-tokens-for-continuous-diffusion&#34;&gt;Optimality of FSQ tokens for continuous diffusion for c&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;57.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-jaeger-joint-3d-audio-visual-grounding-and&#34;&gt;JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;58.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sonicmaster-towards-controllable-all-in-one-music&#34;&gt;SonicMaster: Towards Controllable All-in-One Music Rest&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;59.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-vibe-disentangling-social-dynamics-via-kinematics&#34;&gt;VIBE: Disentangling Social Dynamics via Kinematics-Info&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;60.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-reasoning-llm-improves-speaker-recognition-in&#34;&gt;Reasoning LLM Improves Speaker Recognition in Long-form&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;61.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-a-semantically-consistent-dataset-for-data&#34;&gt;A Semantically Consistent Dataset for Data-Efficient Qu&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;62.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-the-silent-thought-modeling-internal-cognition-in&#34;&gt;The Silent Thought: Modeling Internal Cognition in Full&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;63.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-learning-tight-rejection-boundaries-without&#34;&gt;Learning Tight Rejection Boundaries without Negatives f&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;64.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-quaternion-self-attention-with-shared-scores&#34;&gt;Quaternion Self-Attention with Shared Scores&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;65.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-bridging-your-imagination-with-audio-video&#34;&gt;Bridging Your Imagination with Audio-Video Generation v&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;66.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-textme-bridging-unseen-modalities-through-text&#34;&gt;TextME: Bridging Unseen Modalities Through Text Descrip&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;67.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-regen-hierarchical-multi-prompt-representation&#34;&gt;ReGen: Hierarchical Multi-Prompt Representation Generat&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;68.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-polyphonia-training-free-context-aware-music&#34;&gt;Polyphonia: Training-Free Context-Aware Music Editing w&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;69.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-tmd-bench-a-multi-level-evaluation-paradigm-for&#34;&gt;TMD-Bench: A Multi-Level Evaluation Paradigm for Music–&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;70.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-omni-perception-policy-optimization-for&#34;&gt;Omni-Perception Policy Optimization for Multimodal Emot&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;71.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-acoustic-interference-a-new-paradigm-weaponizing&#34;&gt;Acoustic Interference: A New Paradigm Weaponizing Acous&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;72.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-audiochat-unified-audio-storytelling-editing-and&#34;&gt;AudioChat: Unified Audio Storytelling, Editing, and Und&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;7.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;73.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-do-audio-llms-listen-or-read-analyzing-and&#34;&gt;Do Audio LLMs Listen or Read? Analyzing and Mitigating &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.9分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;74.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-from-talking-to-singing-a-new-challenge-for-audio&#34;&gt;From Talking to Singing: A New Challenge for Audio-Visu&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.8分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;75.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-multiple-choice-learning-of-low-rank-adapters-for&#34;&gt;Multiple Choice Learning of Low-Rank Adapters for Langu&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.8分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;76.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-multimodal-fusion-via-self-consistent-task&#34;&gt;Multimodal Fusion via Self-Consistent Task-Gradient Fie&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.8分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;77.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-position-beyond-text-the-text-centric-bias-in&#34;&gt;Position: &lt;em&gt;Beyond Text&lt;/em&gt; The Text-Centric Bias in Founda&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.8分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;78.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-metabio-learning-from-metadata-for-bioacoustics&#34;&gt;MetaBio: Learning from metadata for bioacoustics founda&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;79.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-any-diffusion-unified-multimodal-understanding&#34;&gt;Any-Diffusion: Unified Multimodal Understanding and Gen&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;80.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sam-audio-segment-anything-in-audio&#34;&gt;SAM Audio: Segment Anything in Audio&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;#**&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;81.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-cocoemo-composable-and-controllable-human-like&#34;&gt;CoCoEmo: Composable and Controllable Human-Like Emotion&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;82.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-hyperpotter-spell-the-charm-of-high-order&#34;&gt;HyperPotter: Spell the Charm of High-Order Interactions&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;83.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-joint-enhancement-and-classification-using&#34;&gt;Joint Enhancement and Classification using Coupled Diff&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;84.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-hearing-without-noticing-attention-aware-stealthy&#34;&gt;Hearing Without Noticing? Attention-Aware Stealthy Blac&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;85.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-two-dimensional-quantization-for-geometry-aware&#34;&gt;Two-dimensional quantization for geometry-aware audio c&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;86.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-salsa-v-shortcut-augmented-long-form-synchronized&#34;&gt;SALSA-V: Shortcut-Augmented Long-form Synchronized Audi&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;87.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-rest-diffusion-based-real-time-end-to-end&#34;&gt;REST: Diffusion-based Real-time End-to-end Streaming Ta&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;88.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-autagent-a-reinforcement-learning-framework-for&#34;&gt;AuTAgent: A Reinforcement Learning Framework for Tool-A&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;89.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-characterizing-the-predictive-impact-of&#34;&gt;Characterizing the Predictive Impact of Modalities with&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;90.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-group-cognition-learning-making-everything-better&#34;&gt;Group Cognition Learning: Making Everything Better Thro&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;91.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-rethinking-attention-in-spiking-transformers&#34;&gt;Rethinking Attention in Spiking Transformers: Overcomin&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;92.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-t2av-compass-towards-unified-evaluation-for-text&#34;&gt;T2AV-Compass: Towards Unified Evaluation for Text-to-Au&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;93.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-s3audio-towards-streaming-synchronized-spatial&#34;&gt;S3Audio: Towards Streaming Synchronized Spatial Audio G&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;94.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sparse-autoencoders-for-interpretable-emotion&#34;&gt;Sparse Autoencoders for Interpretable Emotion Control i&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;95.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-bat-better-audio-transformer-guided-by-convex&#34;&gt;BAT: Better Audio Transformer Guided by Convex Gated Pr&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;96.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-ag-repa-causal-layer-selection-for-representation&#34;&gt;AG-REPA: Causal Layer Selection for Representation Alig&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;97.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-cola-cross-modal-low-rank-adaptation-for&#34;&gt;CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;98.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-neural-inspired-modeling-of-auditory-selection&#34;&gt;Neural-Inspired Modeling of Auditory Selection and Comp&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;99.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-futureomni-evaluating-future-forecasting-from&#34;&gt;FutureOmni: Evaluating Future Forecasting from Omni-Mod&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;100.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-proactivellm-learning-active-interaction-for&#34;&gt;ProactiveLLM: Learning Active Interaction for Streaming&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;101.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-video-salmonn-s-memory-enhanced-streaming-audio&#34;&gt;video-SALMONN S: Memory-Enhanced Streaming Audio-Visual&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;102.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-zero-shot-rankability-revealing-latent-ordinal&#34;&gt;Zero-Shot Rankability: Revealing Latent Ordinal Structu&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;103.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-scaling-transformers-for-end-to-end-discrete&#34;&gt;Scaling Transformers for End-to-End Discrete Audio Toke&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;104.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-evaluating-and-rewarding-lalms-for-expressive&#34;&gt;Evaluating and Rewarding LALMs for Expressive Role-Play&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;6.0分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;105.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-unlocking-speechtext-compositional-powers&#34;&gt;Unlocking Speech–Text Compositional Powers: Instruction&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.8分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;106.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-probing-cross-modal-information-hubs-in-audio&#34;&gt;Probing Cross-modal Information Hubs in Audio-Visual LL&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;107.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-omnishow-orchestrating-multimodal-conditions-for&#34;&gt;OmniShow: Orchestrating Multimodal Conditions for Human&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;108.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sparse-tokens-suffice-jailbreaking-audio-language&#34;&gt;Sparse Tokens Suffice: Jailbreaking Audio Language Mode&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;109.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-phalar-phasors-for-learned-musical-audio&#34;&gt;PHALAR: Phasors for Learned Musical Audio Representatio&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.5分&lt;/td&gt;
					&lt;td&gt;前50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;110.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-scaling-laws-in-model-fine-tuning-for-audio&#34;&gt;Scaling Laws in Model Fine-tuning for Audio DeepFake De&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;5.0分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;111.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-primcooperative-dynamic-token-compression-for&#34;&gt;PRIM：Cooperative Dynamic Token Compression for Efficien&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;4.8分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;112.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-towards-understanding-modality-interaction-in&#34;&gt;Towards Understanding Modality Interaction in Multimoda&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;4.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;113.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-from-inpainting-to-editing-unlocking-robust-mask&#34;&gt;From Inpainting to Editing: Unlocking Robust Mask-Free &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;4.3分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;114.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-sonar-spectralcontrastive-audio-residuals-for&#34;&gt;SONAR: Spectral‑Contrastive Audio Residuals for General&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;4.0分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;115.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-moshirag-asynchronous-knowledge-retrieval-for&#34;&gt;MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.8分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;116.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-starcaster-spatio-temporal-autoregressive-video&#34;&gt;STARCaster: Spatio-Temporal AutoRegressive Video Diffus&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;117.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-wavessm-multiscale-state-space-models-for-non&#34;&gt;WaveSSM: Multiscale State-Space Models for Non-stationa&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;118.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-tau-voice-benchmarking-full-duplex-voice-agents&#34;&gt;\(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on &lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;119.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-fakeworld-10-an-omni-modal-benchmark-for-fake&#34;&gt;FakeWorld 1.0: An Omni modal Benchmark for Fake Media a&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;120.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-lalm-as-a-judge-benchmarking-large-audio-language&#34;&gt;LALM-as-a-Judge: Benchmarking Large Audio-Language Mode&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.5分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;121.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-ivq-structured-and-lightweight-vector&#34;&gt;IVQ: Structured and Lightweight Vector Quantization via&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.2分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;122.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-mfcl-audio-an-audio-function-calling-evaluation&#34;&gt;MFCL Audio: An Audio Function Calling Evaluation for La&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;3.0分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
			&lt;tr&gt;
					&lt;td&gt;123.&lt;/td&gt;
					&lt;td&gt;&lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-position-towards-responsible-evaluation-for-text&#34;&gt;Position: Towards Responsible Evaluation for Text-to-Sp&lt;/a&gt;&lt;/td&gt;
					&lt;td&gt;2.6分&lt;/td&gt;
					&lt;td&gt;后50%&lt;/td&gt;
					&lt;td&gt;-&lt;/td&gt;
			&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id=&#34;-论文列表&#34;&gt;📋 论文列表&lt;/h2&gt;
&lt;h3 id=&#34;-infer-learning-implicit-neural-frequency-response-fields-for-confined-acoustic-environments&#34;&gt;🥇 &lt;a href=&#34;https://nanless.github.io/audio-paper-digest-blog/posts/2026-05-23-infer-learning-implicit-neural-frequency-response&#34;&gt;INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;🔥 &lt;strong&gt;8.5/10&lt;/strong&gt; | 前25% | &lt;a href=&#34;https://icml.cc/virtual/2026/poster/66526&#34;&gt;arxiv&lt;/a&gt;&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="语音音乐音频论文速递-2026-05-23">语音/音乐/音频论文速递 2026-05-23</h1>
<p>共分析 <strong>123</strong> 篇论文</p>
<hr>
<h2 id="-今日概览">⚡ 今日概览</h2>
<p>📥 抓取 123 篇 → 🔬 深度分析完成</p>
<h3 id="-热门方向">🏷️ 热门方向</h3>
<table>
	<thead>
			<tr>
					<th>方向</th>
					<th>数量</th>
					<th>分布</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>#**</td>
					<td>4篇</td>
					<td>████</td>
			</tr>
	</tbody>
</table>
<h3 id="-论文评分排行榜123-篇按分数降序">📊 论文评分排行榜（123 篇，按分数降序）</h3>
<table>
	<thead>
			<tr>
					<th>排名</th>
					<th>论文</th>
					<th>评分</th>
					<th>分档</th>
					<th>主任务</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>🥇</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-infer-learning-implicit-neural-frequency-response">INFER: Learning Implicit Neural Frequency Response Fiel</a></td>
					<td>8.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>🥈</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-vocsim-a-training-free-benchmark-for-zero-shot">VocSim A Training-free Benchmark for Zero-shot Content </a></td>
					<td>8.3分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>🥉</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-cmi-rewardbench-evaluating-music-reward-models">CMI-RewardBench: Evaluating Music Reward Models with Co</a></td>
					<td>8.2分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>4.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-language-model-augmented-semi-supervised">Language Model Augmented Semi-Supervised Statistical In</a></td>
					<td>8.2分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>5.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-discoforcing-a-unified-framework-for-real-time">DiscoForcing: A Unified Framework for Real-Time Audio-D</a></td>
					<td>8.2分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>6.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-abstraction-induces-the-brain-alignment-of">Abstraction Induces the Brain Alignment of Language and</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>#**</td>
			</tr>
			<tr>
					<td>7.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-alethia-a-foundational-encoder-for-voice-deepfakes">Alethia: a Foundational Encoder for Voice Deepfakes</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>8.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-omnidensecap-scripting-multi-scene-videos-with">OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>9.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-foeglass-when-simple-in-context-learning-is">FoeGlass: When Simple In-Context Learning Is Enough for</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>10.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-e-vads-an-e-commerce-short-videos-understanding">E-VAds: An E-commerce Short Videos Understanding Benchm</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>11.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-beat-tokenizing-and-generating-symbolic-music-by">BEAT: Tokenizing and Generating Symbolic Music by Unifo</a></td>
					<td>8.0分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>12.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-pianist-transformer-towards-expressive-piano">Pianist Transformer: Towards Expressive Piano Performan</a></td>
					<td>7.8分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>13.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-dreamid-omni-unified-framework-for-controllable">DreamID-Omni: Unified Framework for Controllable Human-</a></td>
					<td>7.8分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>14.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-real-world-unsupervised-models-generalize-to">Real-World Unsupervised Models Generalize to Predict Br</a></td>
					<td>7.8分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>15.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-audiomosaic-contrastive-masked-audio">AudioMosaic: Contrastive Masked Audio Representation Le</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>16.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-self-guidance-enhancing-neural-codecs-via-decoder">Self-Guidance: Enhancing Neural Codecs via Decoder Mani</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>17.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-lynx-token-interface-alignment-for-videox-llms">LynX: Token Interface Alignment for Video+X LLMs</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>#**</td>
			</tr>
			<tr>
					<td>18.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-spherical-procrustes-alignment-for-reliable">Spherical Procrustes Alignment for Reliable Medical Aud</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>19.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-most-mixing-speech-and-text-with-modality-aware">MoST: Mixing Speech and Text with Modality-Aware Mixtur</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>20.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-self-supervised-flow-matching-for-scalable-multi">Self-Supervised Flow Matching for Scalable Multi-Modal </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>21.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-lightavseg-lightweight-audio-visual-segmentation">LightAVSeg: Lightweight Audio-Visual Segmentation</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>22.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-robust-signal-enhancement-via-fractional-detail">Robust Signal Enhancement via Fractional Detail Views a</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>23.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-echoingpixels-aliasing-resistant-joint-token">EchoingPixels: Aliasing-Resistant Joint Token Reduction</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>24.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-long-grounded-thoughts-synthesizing-grounded">Long Grounded Thoughts: Synthesizing Grounded Visual Pr</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>25.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-omnivideo-r1-reinforcing-audio-visual-reasoning">OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>26.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-ariadnes-thread-of-lipsync-unraveling-forgeries">Ariadne&rsquo;s Thread of LipSync: Unraveling Forgeries via I</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>27.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-avi-bench-toward-human-like-audio-visual">AVI-Bench: Toward Human-like Audio-Visual Intelligence </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>28.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-simultaneous-speech-to-speech-translation-without">Simultaneous Speech-to-Speech Translation Without Align</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>29.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-phostream-benchmarking-real-world-streaming-for">PhoStream: Benchmarking Real-World Streaming for Omnimo</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>30.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-omnisift-modality-asymmetric-token-compression">OmniSIFT: Modality-Asymmetric Token Compression for Eff</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>31.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-speech-audio-compositional-attacks-on-multimodal">Speech-Audio Compositional Attacks on Multimodal LLMs a</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>32.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-convex-low-resource-accent-robust-language">Convex Low-resource Accent-Robust Language Detection in</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>#**</td>
			</tr>
			<tr>
					<td>33.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-phasecoder-microphone-geometry-agnostic-spatial">PhaseCoder: Microphone Geometry-Agnostic Spatial Audio </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>34.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-listening-through-the-noise-cauchy-driven">Listening Through the Noise: Cauchy-Driven Diffusion Br</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>35.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-dual-view-predictive-diffusion-lightweight-speech">Dual-View Predictive Diffusion: Lightweight Speech Enha</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>36.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-stream-rag-instant-and-accurate-spoken-dialogue">Stream RAG: Instant and Accurate Spoken Dialogue System</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>37.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-naaca-training-free-neuroauditory-attentive">NAACA: Training-Free NeuroAuditory Attentive Cognitive </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>38.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-medmosaic-a-challenging-large-scale-benchmark-of">MedMosaic: A Challenging Large Scale Benchmark of Diver</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>39.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-verifiable-multimodal-reasoning-fact-level">Verifiable Multimodal Reasoning: Fact-level Attribution</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>40.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-musicdet-zero-shot-ai-generated-music-detection">MusicDET: Zero-Shot AI-Generated Music Detection</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>41.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-pcrnet-phase-aware-complex-refinement-network-for">PCRNet: Phase-aware Complex Refinement Network for EEG-</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>42.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sarsteer-safeguarding-large-audio-language-models">SARSteer: Safeguarding Large Audio Language Models via </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>43.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-star-vae-structured-topology-aware-regularization">STAR-VAE: Structured Topology-Aware Regularization for </a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>44.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-hidden-in-plain-tokens-simply-robust-gradient">Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa</a></td>
					<td>7.5分</td>
					<td>前25%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>45.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-avgen-bench-a-task-driven-benchmark-for-multi">AVGen-Bench: A Task-Driven Benchmark for Multi-Granular</a></td>
					<td>7.3分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>46.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-bridging-the-stability-expressivity-gap-synthetic">Bridging the Stability-Expressivity Gap: Synthetic Data</a></td>
					<td>7.3分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>47.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-avtrack-audio-visual-speaker-tracking-in-complex">AVTrack: Audio-Visual Speaker Tracking in Complex Scene</a></td>
					<td>7.3分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>48.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-bioacoustic-geolocation-species-sounds-as">Bioacoustic Geolocation: Species Sounds as Geographic S</a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>49.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-adept-rl-aligned-agentic-decoding-of-emotion-via">ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide</a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>50.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-mecat-a-multi-experts-constructed-benchmark-for">MECAT: A Multi-Experts Constructed Benchmark for Fine-G</a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>51.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-spear-a-unified-ssl-framework-for-learning-speech">SPEAR: A Unified SSL Framework for Learning Speech and </a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>52.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-pads-tal-padding-annealed-diffusion-sampling-in">PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A</a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>53.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-multimodal-latent-language-modeling-with-next">Multimodal Latent Language Modeling with Next-Token Dif</a></td>
					<td>7.2分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>54.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-query-based-asymmetric-modeling-with-decoupled">Query-Based Asymmetric Modeling with Decoupled Input–Ou</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>55.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-agentsteertts-a-multi-agent-closed-loop-framework">AgentSteerTTS: A Multi-Agent Closed-Loop Framework for </a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>56.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-optimality-of-fsq-tokens-for-continuous-diffusion">Optimality of FSQ tokens for continuous diffusion for c</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>57.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-jaeger-joint-3d-audio-visual-grounding-and">JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>58.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sonicmaster-towards-controllable-all-in-one-music">SonicMaster: Towards Controllable All-in-One Music Rest</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>59.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-vibe-disentangling-social-dynamics-via-kinematics">VIBE: Disentangling Social Dynamics via Kinematics-Info</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>60.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-reasoning-llm-improves-speaker-recognition-in">Reasoning LLM Improves Speaker Recognition in Long-form</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>61.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-a-semantically-consistent-dataset-for-data">A Semantically Consistent Dataset for Data-Efficient Qu</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>62.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-the-silent-thought-modeling-internal-cognition-in">The Silent Thought: Modeling Internal Cognition in Full</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>63.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-learning-tight-rejection-boundaries-without">Learning Tight Rejection Boundaries without Negatives f</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>64.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-quaternion-self-attention-with-shared-scores">Quaternion Self-Attention with Shared Scores</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>65.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-bridging-your-imagination-with-audio-video">Bridging Your Imagination with Audio-Video Generation v</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>66.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-textme-bridging-unseen-modalities-through-text">TextME: Bridging Unseen Modalities Through Text Descrip</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>67.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-regen-hierarchical-multi-prompt-representation">ReGen: Hierarchical Multi-Prompt Representation Generat</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>68.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-polyphonia-training-free-context-aware-music">Polyphonia: Training-Free Context-Aware Music Editing w</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>69.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-tmd-bench-a-multi-level-evaluation-paradigm-for">TMD-Bench: A Multi-Level Evaluation Paradigm for Music–</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>70.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-omni-perception-policy-optimization-for">Omni-Perception Policy Optimization for Multimodal Emot</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>71.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-acoustic-interference-a-new-paradigm-weaponizing">Acoustic Interference: A New Paradigm Weaponizing Acous</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>72.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-audiochat-unified-audio-storytelling-editing-and">AudioChat: Unified Audio Storytelling, Editing, and Und</a></td>
					<td>7.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>73.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-do-audio-llms-listen-or-read-analyzing-and">Do Audio LLMs Listen or Read? Analyzing and Mitigating </a></td>
					<td>6.9分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>74.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-from-talking-to-singing-a-new-challenge-for-audio">From Talking to Singing: A New Challenge for Audio-Visu</a></td>
					<td>6.8分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>75.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-multiple-choice-learning-of-low-rank-adapters-for">Multiple Choice Learning of Low-Rank Adapters for Langu</a></td>
					<td>6.8分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>76.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-multimodal-fusion-via-self-consistent-task">Multimodal Fusion via Self-Consistent Task-Gradient Fie</a></td>
					<td>6.8分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>77.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-position-beyond-text-the-text-centric-bias-in">Position: <em>Beyond Text</em> The Text-Centric Bias in Founda</a></td>
					<td>6.8分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>78.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-metabio-learning-from-metadata-for-bioacoustics">MetaBio: Learning from metadata for bioacoustics founda</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>79.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-any-diffusion-unified-multimodal-understanding">Any-Diffusion: Unified Multimodal Understanding and Gen</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>80.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sam-audio-segment-anything-in-audio">SAM Audio: Segment Anything in Audio</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>#**</td>
			</tr>
			<tr>
					<td>81.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-cocoemo-composable-and-controllable-human-like">CoCoEmo: Composable and Controllable Human-Like Emotion</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>82.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-hyperpotter-spell-the-charm-of-high-order">HyperPotter: Spell the Charm of High-Order Interactions</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>83.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-joint-enhancement-and-classification-using">Joint Enhancement and Classification using Coupled Diff</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>84.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-hearing-without-noticing-attention-aware-stealthy">Hearing Without Noticing? Attention-Aware Stealthy Blac</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>85.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-two-dimensional-quantization-for-geometry-aware">Two-dimensional quantization for geometry-aware audio c</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>86.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-salsa-v-shortcut-augmented-long-form-synchronized">SALSA-V: Shortcut-Augmented Long-form Synchronized Audi</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>87.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-rest-diffusion-based-real-time-end-to-end">REST: Diffusion-based Real-time End-to-end Streaming Ta</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>88.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-autagent-a-reinforcement-learning-framework-for">AuTAgent: A Reinforcement Learning Framework for Tool-A</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>89.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-characterizing-the-predictive-impact-of">Characterizing the Predictive Impact of Modalities with</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>90.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-group-cognition-learning-making-everything-better">Group Cognition Learning: Making Everything Better Thro</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>91.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-rethinking-attention-in-spiking-transformers">Rethinking Attention in Spiking Transformers: Overcomin</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>92.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-t2av-compass-towards-unified-evaluation-for-text">T2AV-Compass: Towards Unified Evaluation for Text-to-Au</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>93.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-s3audio-towards-streaming-synchronized-spatial">S3Audio: Towards Streaming Synchronized Spatial Audio G</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>94.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sparse-autoencoders-for-interpretable-emotion">Sparse Autoencoders for Interpretable Emotion Control i</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>95.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-bat-better-audio-transformer-guided-by-convex">BAT: Better Audio Transformer Guided by Convex Gated Pr</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>96.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-ag-repa-causal-layer-selection-for-representation">AG-REPA: Causal Layer Selection for Representation Alig</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>97.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-cola-cross-modal-low-rank-adaptation-for">CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>98.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-neural-inspired-modeling-of-auditory-selection">Neural-Inspired Modeling of Auditory Selection and Comp</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>99.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-futureomni-evaluating-future-forecasting-from">FutureOmni: Evaluating Future Forecasting from Omni-Mod</a></td>
					<td>6.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>100.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-proactivellm-learning-active-interaction-for">ProactiveLLM: Learning Active Interaction for Streaming</a></td>
					<td>6.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>101.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-video-salmonn-s-memory-enhanced-streaming-audio">video-SALMONN S: Memory-Enhanced Streaming Audio-Visual</a></td>
					<td>6.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>102.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-zero-shot-rankability-revealing-latent-ordinal">Zero-Shot Rankability: Revealing Latent Ordinal Structu</a></td>
					<td>6.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>103.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-scaling-transformers-for-end-to-end-discrete">Scaling Transformers for End-to-End Discrete Audio Toke</a></td>
					<td>6.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>104.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-evaluating-and-rewarding-lalms-for-expressive">Evaluating and Rewarding LALMs for Expressive Role-Play</a></td>
					<td>6.0分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>105.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-unlocking-speechtext-compositional-powers">Unlocking Speech–Text Compositional Powers: Instruction</a></td>
					<td>5.8分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>106.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-probing-cross-modal-information-hubs-in-audio">Probing Cross-modal Information Hubs in Audio-Visual LL</a></td>
					<td>5.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>107.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-omnishow-orchestrating-multimodal-conditions-for">OmniShow: Orchestrating Multimodal Conditions for Human</a></td>
					<td>5.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>108.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sparse-tokens-suffice-jailbreaking-audio-language">Sparse Tokens Suffice: Jailbreaking Audio Language Mode</a></td>
					<td>5.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>109.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-phalar-phasors-for-learned-musical-audio">PHALAR: Phasors for Learned Musical Audio Representatio</a></td>
					<td>5.5分</td>
					<td>前50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>110.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-scaling-laws-in-model-fine-tuning-for-audio">Scaling Laws in Model Fine-tuning for Audio DeepFake De</a></td>
					<td>5.0分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>111.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-primcooperative-dynamic-token-compression-for">PRIM：Cooperative Dynamic Token Compression for Efficien</a></td>
					<td>4.8分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>112.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-towards-understanding-modality-interaction-in">Towards Understanding Modality Interaction in Multimoda</a></td>
					<td>4.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>113.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-from-inpainting-to-editing-unlocking-robust-mask">From Inpainting to Editing: Unlocking Robust Mask-Free </a></td>
					<td>4.3分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>114.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-sonar-spectralcontrastive-audio-residuals-for">SONAR: Spectral‑Contrastive Audio Residuals for General</a></td>
					<td>4.0分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>115.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-moshirag-asynchronous-knowledge-retrieval-for">MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup</a></td>
					<td>3.8分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>116.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-starcaster-spatio-temporal-autoregressive-video">STARCaster: Spatio-Temporal AutoRegressive Video Diffus</a></td>
					<td>3.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>117.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-wavessm-multiscale-state-space-models-for-non">WaveSSM: Multiscale State-Space Models for Non-stationa</a></td>
					<td>3.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>118.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-tau-voice-benchmarking-full-duplex-voice-agents">\(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on </a></td>
					<td>3.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>119.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-fakeworld-10-an-omni-modal-benchmark-for-fake">FakeWorld 1.0: An Omni modal Benchmark for Fake Media a</a></td>
					<td>3.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>120.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-lalm-as-a-judge-benchmarking-large-audio-language">LALM-as-a-Judge: Benchmarking Large Audio-Language Mode</a></td>
					<td>3.5分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>121.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-ivq-structured-and-lightweight-vector">IVQ: Structured and Lightweight Vector Quantization via</a></td>
					<td>3.2分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>122.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-mfcl-audio-an-audio-function-calling-evaluation">MFCL Audio: An Audio Function Calling Evaluation for La</a></td>
					<td>3.0分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
			<tr>
					<td>123.</td>
					<td><a href="/audio-paper-digest-blog/posts/2026-05-23-position-towards-responsible-evaluation-for-text">Position: Towards Responsible Evaluation for Text-to-Sp</a></td>
					<td>2.6分</td>
					<td>后50%</td>
					<td>-</td>
			</tr>
	</tbody>
</table>
<hr>
<h2 id="-论文列表">📋 论文列表</h2>
<h3 id="-infer-learning-implicit-neural-frequency-response-fields-for-confined-acoustic-environments">🥇 <a href="/audio-paper-digest-blog/posts/2026-05-23-infer-learning-implicit-neural-frequency-response">INFER: Learning Implicit Neural Frequency Response Fields for Confined Acoustic Environments</a></h3>
<p>🔥 <strong>8.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/66526">arxiv</a></p>
<hr>
<h3 id="-vocsim-a-training-free-benchmark-for-zero-shot-content-identity-in-single-source-audio">🥈 <a href="/audio-paper-digest-blog/posts/2026-05-23-vocsim-a-training-free-benchmark-for-zero-shot">VocSim A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio</a></h3>
<p>🔥 <strong>8.3/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/61780">arxiv</a></p>
<hr>
<h3 id="-cmi-rewardbench-evaluating-music-reward-models-with-compositional-multimodal-instruction">🥉 <a href="/audio-paper-digest-blog/posts/2026-05-23-cmi-rewardbench-evaluating-music-reward-models">CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://arxiv.org/abs/2603.00610">arxiv</a></p>
<hr>
<h3 id="4-language-model-augmented-semi-supervised-statistical-inference">4. <a href="/audio-paper-digest-blog/posts/2026-05-23-language-model-augmented-semi-supervised">Language Model Augmented Semi-Supervised Statistical Inference</a></h3>
<p>🔥 <strong>8.2/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62984">arxiv</a></p>
<hr>
<h3 id="5-discoforcing-a-unified-framework-for-real-time-audio-driven-character-control-with-diffusion-forcing">5. <a href="/audio-paper-digest-blog/posts/2026-05-23-discoforcing-a-unified-framework-for-real-time">DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing</a></h3>
<p>🔥 <strong>8.2/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/66598">arxiv</a></p>
<hr>
<h3 id="6-abstraction-induces-the-brain-alignment-of-language-and-speech-models">6. <a href="/audio-paper-digest-blog/posts/2026-05-23-abstraction-induces-the-brain-alignment-of">Abstraction Induces the Brain Alignment of Language and Speech Models</a></h3>
<p>🔥 <strong>8.0/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/61772">arxiv</a></p>
<hr>
<h3 id="7-alethia-a-foundational-encoder-for-voice-deepfakes">7. <a href="/audio-paper-digest-blog/posts/2026-05-23-alethia-a-foundational-encoder-for-voice-deepfakes">Alethia: a Foundational Encoder for Voice Deepfakes</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://arxiv.org/abs/2605.00251">arxiv</a></p>
<hr>
<h3 id="8-omnidensecap-scripting-multi-scene-videos-with-time-aware-and-structural-audio-visual-captions">8. <a href="/audio-paper-digest-blog/posts/2026-05-23-omnidensecap-scripting-multi-scene-videos-with">OmniDenseCap: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions</a></h3>
<p>🔥 <strong>8.0/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63829">arxiv</a></p>
<hr>
<h3 id="9-foeglass-when-simple-in-context-learning-is-enough-for-red-teaming-audio-deepfake-detectors">9. <a href="/audio-paper-digest-blog/posts/2026-05-23-foeglass-when-simple-in-context-learning-is">FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors</a></h3>
<p>🔥 <strong>8.0/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64852">arxiv</a></p>
<hr>
<h3 id="10-e-vads-an-e-commerce-short-videos-understanding-benchmark-for-mllms">10. <a href="/audio-paper-digest-blog/posts/2026-05-23-e-vads-an-e-commerce-short-videos-understanding">E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs</a></h3>
<p>🔥 <strong>8.0/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64911">arxiv</a></p>
<hr>
<h3 id="11-beat-tokenizing-and-generating-symbolic-music-by-uniform-temporal-steps">11. <a href="/audio-paper-digest-blog/posts/2026-05-23-beat-tokenizing-and-generating-symbolic-music-by">BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps</a></h3>
<p>🔥 <strong>8.0/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63349">arxiv</a></p>
<hr>
<h3 id="12-pianist-transformer-towards-expressive-piano-performance-rendering-via-scalable-self-supervised-pre-training">12. <a href="/audio-paper-digest-blog/posts/2026-05-23-pianist-transformer-towards-expressive-piano">Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training</a></h3>
<p>✅ <strong>7.8/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/61542">arxiv</a></p>
<hr>
<h3 id="13-dreamid-omni-unified-framework-for-controllable-human-centric-audio-video-generation">13. <a href="/audio-paper-digest-blog/posts/2026-05-23-dreamid-omni-unified-framework-for-controllable">DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation</a></h3>
<p>✅ <strong>7.8/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62948">arxiv</a></p>
<hr>
<h3 id="14-real-world-unsupervised-models-generalize-to-predict-brain-responses-to-out-of-distribution-stimuli">14. <a href="/audio-paper-digest-blog/posts/2026-05-23-real-world-unsupervised-models-generalize-to">Real-World Unsupervised Models Generalize to Predict Brain Responses to Out-of-Distribution Stimuli</a></h3>
<p>✅ <strong>7.8/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/65072">arxiv</a></p>
<hr>
<h3 id="15-audiomosaic-contrastive-masked-audio-representation-learning">15. <a href="/audio-paper-digest-blog/posts/2026-05-23-audiomosaic-contrastive-masked-audio">AudioMosaic: Contrastive Masked Audio Representation Learning</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64345">arxiv</a></p>
<hr>
<h3 id="16-self-guidance-enhancing-neural-codecs-via-decoder-manifold-alignment">16. <a href="/audio-paper-digest-blog/posts/2026-05-23-self-guidance-enhancing-neural-codecs-via-decoder">Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62733">arxiv</a></p>
<hr>
<h3 id="17-lynx-token-interface-alignment-for-videox-llms">17. <a href="/audio-paper-digest-blog/posts/2026-05-23-lynx-token-interface-alignment-for-videox-llms">LynX: Token Interface Alignment for Video+X LLMs</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/61725">arxiv</a></p>
<hr>
<h3 id="18-spherical-procrustes-alignment-for-reliable-medical-audio-diagnosis">18. <a href="/audio-paper-digest-blog/posts/2026-05-23-spherical-procrustes-alignment-for-reliable">Spherical Procrustes Alignment for Reliable Medical Audio Diagnosis</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62817">arxiv</a></p>
<hr>
<h3 id="19-most-mixing-speech-and-text-with-modality-aware-mixture-of-experts">19. <a href="/audio-paper-digest-blog/posts/2026-05-23-most-mixing-speech-and-text-with-modality-aware">MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/66513">arxiv</a></p>
<hr>
<h3 id="20-self-supervised-flow-matching-for-scalable-multi-modal-synthesis">20. <a href="/audio-paper-digest-blog/posts/2026-05-23-self-supervised-flow-matching-for-scalable-multi">Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/65011">arxiv</a></p>
<hr>
<h3 id="21-lightavseg-lightweight-audio-visual-segmentation">21. <a href="/audio-paper-digest-blog/posts/2026-05-23-lightavseg-lightweight-audio-visual-segmentation">LightAVSeg: Lightweight Audio-Visual Segmentation</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63361">arxiv</a></p>
<hr>
<h3 id="22-robust-signal-enhancement-via-fractional-detail-views-and-knowledge-guided-multi-view-fusion">22. <a href="/audio-paper-digest-blog/posts/2026-05-23-robust-signal-enhancement-via-fractional-detail">Robust Signal Enhancement via Fractional Detail Views and Knowledge Guided Multi-view Fusion</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62255">arxiv</a></p>
<hr>
<h3 id="23-echoingpixels-aliasing-resistant-joint-token-reduction-for-audio-visual-llms">23. <a href="/audio-paper-digest-blog/posts/2026-05-23-echoingpixels-aliasing-resistant-joint-token">EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62697">arxiv</a></p>
<hr>
<h3 id="24-long-grounded-thoughts-synthesizing-grounded-visual-problems-and-distilling-reasoning-chains-at-scale">24. <a href="/audio-paper-digest-blog/posts/2026-05-23-long-grounded-thoughts-synthesizing-grounded">Long Grounded Thoughts: Synthesizing Grounded Visual Problems and Distilling Reasoning Chains at Scale</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63138">arxiv</a></p>
<hr>
<h3 id="25-omnivideo-r1-reinforcing-audio-visual-reasoning-with-query-intention-and-modality-attention">25. <a href="/audio-paper-digest-blog/posts/2026-05-23-omnivideo-r1-reinforcing-audio-visual-reasoning">OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://arxiv.org/abs/2602.05847">arxiv</a></p>
<hr>
<h3 id="26-ariadne">26. <a href="/audio-paper-digest-blog/posts/2026-05-23-ariadnes-thread-of-lipsync-unraveling-forgeries">Ariadne&rsquo;s Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/60674">arxiv</a></p>
<hr>
<h3 id="27-avi-bench-toward-human-like-audio-visual-intelligence-of-omni-mllms">27. <a href="/audio-paper-digest-blog/posts/2026-05-23-avi-bench-toward-human-like-audio-visual">AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/60474">arxiv</a></p>
<hr>
<h3 id="28-simultaneous-speech-to-speech-translation-without-aligned-data">28. <a href="/audio-paper-digest-blog/posts/2026-05-23-simultaneous-speech-to-speech-translation-without">Simultaneous Speech-to-Speech Translation Without Aligned Data</a></h3>
<p>🔥 <strong>8.2/10</strong> | 前25% | <a href="https://arxiv.org/abs/2602.11072">arxiv</a></p>
<hr>
<h3 id="29-phostream-benchmarking-real-world-streaming-for-omnimodal-assistants-in-mobile-scenarios">29. <a href="/audio-paper-digest-blog/posts/2026-05-23-phostream-benchmarking-real-world-streaming-for">PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62888">arxiv</a></p>
<hr>
<h3 id="30-omnisift-modality-asymmetric-token-compression-for-efficient-omni-modal-large-language-models">30. <a href="/audio-paper-digest-blog/posts/2026-05-23-omnisift-modality-asymmetric-token-compression">OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63063">arxiv</a></p>
<hr>
<h3 id="31-speech-audio-compositional-attacks-on-multimodal-llms-and-their-defense-with-salmonn-guard">31. <a href="/audio-paper-digest-blog/posts/2026-05-23-speech-audio-compositional-attacks-on-multimodal">Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64737">arxiv</a></p>
<hr>
<h3 id="32-convex-low-resource-accent-robust-language-detection-in-speech-recognition">32. <a href="/audio-paper-digest-blog/posts/2026-05-23-convex-low-resource-accent-robust-language">Convex Low-resource Accent-Robust Language Detection in Speech Recognition</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64615">arxiv</a></p>
<hr>
<h3 id="33-phasecoder-microphone-geometry-agnostic-spatial-audio-understanding-for-multimodal-llms">33. <a href="/audio-paper-digest-blog/posts/2026-05-23-phasecoder-microphone-geometry-agnostic-spatial">PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/61119">arxiv</a></p>
<hr>
<h3 id="34-listening-through-the-noise-cauchy-driven-diffusion-bridges-for-robust-gastrointestinal-auscultation-and-clinical-benchmarking">34. <a href="/audio-paper-digest-blog/posts/2026-05-23-listening-through-the-noise-cauchy-driven">Listening Through the Noise: Cauchy-Driven Diffusion Bridges for Robust Gastrointestinal Auscultation and Clinical Benchmarking</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/65341">arxiv</a></p>
<hr>
<h3 id="35-dual-view-predictive-diffusion-lightweight-speech-enhancement-via-spectrogram-image-synergy">35. <a href="/audio-paper-digest-blog/posts/2026-05-23-dual-view-predictive-diffusion-lightweight-speech">Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/66431">arxiv</a></p>
<hr>
<h3 id="36-stream-rag-instant-and-accurate-spoken-dialogue-systems-with-streaming-tool-usage">36. <a href="/audio-paper-digest-blog/posts/2026-05-23-stream-rag-instant-and-accurate-spoken-dialogue">Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64464">arxiv</a></p>
<hr>
<h3 id="37-naaca-training-free-neuroauditory-attentive-cognitive-architecture-with-oscillatory-working-memory-for-salience-driven-attention-gating">37. <a href="/audio-paper-digest-blog/posts/2026-05-23-naaca-training-free-neuroauditory-attentive">NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/60523">arxiv</a></p>
<hr>
<h3 id="38-medmosaic-a-challenging-large-scale-benchmark-of-diverse-medical-audio">38. <a href="/audio-paper-digest-blog/posts/2026-05-23-medmosaic-a-challenging-large-scale-benchmark-of">MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64360">arxiv</a></p>
<hr>
<h3 id="39-verifiable-multimodal-reasoning-fact-level-attribution-with-multimodal-sources">39. <a href="/audio-paper-digest-blog/posts/2026-05-23-verifiable-multimodal-reasoning-fact-level">Verifiable Multimodal Reasoning: Fact-level Attribution with Multimodal Sources</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/64973">arxiv</a></p>
<hr>
<h3 id="40-musicdet-zero-shot-ai-generated-music-detection">40. <a href="/audio-paper-digest-blog/posts/2026-05-23-musicdet-zero-shot-ai-generated-music-detection">MusicDET: Zero-Shot AI-Generated Music Detection</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/65106">arxiv</a></p>
<hr>
<h3 id="41-pcrnet-phase-aware-complex-refinement-network-for-eeg-based-auditory-attention-decoding">41. <a href="/audio-paper-digest-blog/posts/2026-05-23-pcrnet-phase-aware-complex-refinement-network-for">PCRNet: Phase-aware Complex Refinement Network for EEG-based Auditory Attention Decoding</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/65989">arxiv</a></p>
<hr>
<h3 id="42-sarsteer-safeguarding-large-audio-language-models-via-safe-ablated-refusal-steering">42. <a href="/audio-paper-digest-blog/posts/2026-05-23-sarsteer-safeguarding-large-audio-language-models">SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/66551">arxiv</a></p>
<hr>
<h3 id="43-star-vae-structured-topology-aware-regularization-for-audio-reconstruction-and-generation">43. <a href="/audio-paper-digest-blog/posts/2026-05-23-star-vae-structured-topology-aware-regularization">STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/63959">arxiv</a></p>
<hr>
<h3 id="44-hidden-in-plain-tokens-simply-robust-gradient-free-watermark-for-synthetic-audio">44. <a href="/audio-paper-digest-blog/posts/2026-05-23-hidden-in-plain-tokens-simply-robust-gradient">Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio</a></h3>
<p>✅ <strong>7.5/10</strong> | 前25% | <a href="https://icml.cc/virtual/2026/poster/62388">arxiv</a></p>
<hr>
<h3 id="45-avgen-bench-a-task-driven-benchmark-for-multi-granular-evaluation-of-text-to-audio-video-generation">45. <a href="/audio-paper-digest-blog/posts/2026-05-23-avgen-bench-a-task-driven-benchmark-for-multi">AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation</a></h3>
<p>✅ <strong>7.3/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63075">arxiv</a></p>
<hr>
<h3 id="46-bridging-the-stability-expressivity-gap-synthetic-data-scaling-and-preference-alignment-for-low-resource-spoken-language-models">46. <a href="/audio-paper-digest-blog/posts/2026-05-23-bridging-the-stability-expressivity-gap-synthetic">Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models</a></h3>
<p>✅ <strong>7.3/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63107">arxiv</a></p>
<hr>
<h3 id="47-avtrack-audio-visual-speaker-tracking-in-complex-scenes">47. <a href="/audio-paper-digest-blog/posts/2026-05-23-avtrack-audio-visual-speaker-tracking-in-complex">AVTrack: Audio-Visual Speaker Tracking in Complex Scenes</a></h3>
<p>✅ <strong>7.3/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65747">arxiv</a></p>
<hr>
<h3 id="48-bioacoustic-geolocation-species-sounds-as-geographic-signals">48. <a href="/audio-paper-digest-blog/posts/2026-05-23-bioacoustic-geolocation-species-sounds-as">Bioacoustic Geolocation: Species Sounds as Geographic Signals</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61581">arxiv</a></p>
<hr>
<h3 id="49-adept-rl-aligned-agentic-decoding-of-emotion-via-evidence-probing-tools--from-consensus-learning-to-ambiguity-driven-emotion-reasoning">49. <a href="/audio-paper-digest-blog/posts/2026-05-23-adept-rl-aligned-agentic-decoding-of-emotion-via">ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61830">arxiv</a></p>
<hr>
<h3 id="50-mecat-a-multi-experts-constructed-benchmark-for-fine-grained-audio-understanding-tasks">50. <a href="/audio-paper-digest-blog/posts/2026-05-23-mecat-a-multi-experts-constructed-benchmark-for">MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64529">arxiv</a></p>
<hr>
<h3 id="51-spear-a-unified-ssl-framework-for-learning-speech-and-audio-representations">51. <a href="/audio-paper-digest-blog/posts/2026-05-23-spear-a-unified-ssl-framework-for-learning-speech">SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60930">arxiv</a></p>
<hr>
<h3 id="52-pads-tal-padding-annealed-diffusion-sampling-in-text-aware-latent-space-for-robust-and-diverse-text-to-music-generation">52. <a href="/audio-paper-digest-blog/posts/2026-05-23-pads-tal-padding-annealed-diffusion-sampling-in">PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62907">arxiv</a></p>
<hr>
<h3 id="53-multimodal-latent-language-modeling-with-next-token-diffusion">53. <a href="/audio-paper-digest-blog/posts/2026-05-23-multimodal-latent-language-modeling-with-next">Multimodal Latent Language Modeling with Next-Token Diffusion</a></h3>
<p>✅ <strong>7.2/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64225">arxiv</a></p>
<hr>
<h3 id="54-query-based-asymmetric-modeling-with-decoupled-inputoutput-rates-for-speech-restoration">54. <a href="/audio-paper-digest-blog/posts/2026-05-23-query-based-asymmetric-modeling-with-decoupled">Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61662">arxiv</a></p>
<hr>
<h3 id="55-agentsteertts-a-multi-agent-closed-loop-framework-for-composite-instruction-text-to-speech">55. <a href="/audio-paper-digest-blog/posts/2026-05-23-agentsteertts-a-multi-agent-closed-loop-framework">AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62320">arxiv</a></p>
<hr>
<h3 id="56-optimality-of-fsq-tokens-for-continuous-diffusion-for-categorical-data-with-application-to-text-to-speech">56. <a href="/audio-paper-digest-blog/posts/2026-05-23-optimality-of-fsq-tokens-for-continuous-diffusion">Optimality of FSQ tokens for continuous diffusion for categorical data with application to text-to-speech</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66740">arxiv</a></p>
<hr>
<h3 id="57-jaeger-joint-3d-audio-visual-grounding-and-reasoning-in-simulated-physical-environments">57. <a href="/audio-paper-digest-blog/posts/2026-05-23-jaeger-joint-3d-audio-visual-grounding-and">JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61417">arxiv</a></p>
<hr>
<h3 id="58-sonicmaster-towards-controllable-all-in-one-music-restoration-and-mastering">58. <a href="/audio-paper-digest-blog/posts/2026-05-23-sonicmaster-towards-controllable-all-in-one-music">SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63870">arxiv</a></p>
<hr>
<h3 id="59-vibe-disentangling-social-dynamics-via-kinematics-informed-variational-inference-for-behavioral-emotion">59. <a href="/audio-paper-digest-blog/posts/2026-05-23-vibe-disentangling-social-dynamics-via-kinematics">VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62321">arxiv</a></p>
<hr>
<h3 id="60-reasoning-llm-improves-speaker-recognition-in-long-form-tv-dramas">60. <a href="/audio-paper-digest-blog/posts/2026-05-23-reasoning-llm-improves-speaker-recognition-in">Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62391">arxiv</a></p>
<hr>
<h3 id="61-a-semantically-consistent-dataset-for-data-efficient-query-based-universal-sound-separation">61. <a href="/audio-paper-digest-blog/posts/2026-05-23-a-semantically-consistent-dataset-for-data">A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60936">arxiv</a></p>
<hr>
<h3 id="62-the-silent-thought-modeling-internal-cognition-in-full-duplex-spoken-dialogue-models-via-latent-reasoning">62. <a href="/audio-paper-digest-blog/posts/2026-05-23-the-silent-thought-modeling-internal-cognition-in">The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning</a></h3>
<p>🔥 <strong>8.5/10</strong> | 前25% | <a href="https://arxiv.org/abs/2603.17837">arxiv</a></p>
<hr>
<h3 id="63-learning-tight-rejection-boundaries-without-negatives-for-strict-one-class-audio-deepfake-detection">63. <a href="/audio-paper-digest-blog/posts/2026-05-23-learning-tight-rejection-boundaries-without">Learning Tight Rejection Boundaries without Negatives for Strict One-Class Audio Deepfake Detection</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63118">arxiv</a></p>
<hr>
<h3 id="64-quaternion-self-attention-with-shared-scores">64. <a href="/audio-paper-digest-blog/posts/2026-05-23-quaternion-self-attention-with-shared-scores">Quaternion Self-Attention with Shared Scores</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63383">arxiv</a></p>
<hr>
<h3 id="65-bridging-your-imagination-with-audio-video-generation-via-a-unified-director">65. <a href="/audio-paper-digest-blog/posts/2026-05-23-bridging-your-imagination-with-audio-video">Bridging Your Imagination with Audio-Video Generation via a Unified Director</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65259">arxiv</a></p>
<hr>
<h3 id="66-textme-bridging-unseen-modalities-through-text-descriptions">66. <a href="/audio-paper-digest-blog/posts/2026-05-23-textme-bridging-unseen-modalities-through-text">TextME: Bridging Unseen Modalities Through Text Descriptions</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63946">arxiv</a></p>
<hr>
<h3 id="67-regen-hierarchical-multi-prompt-representation-generation-for-efficient-waveform-diffusion-models">67. <a href="/audio-paper-digest-blog/posts/2026-05-23-regen-hierarchical-multi-prompt-representation">ReGen: Hierarchical Multi-Prompt Representation Generation for Efficient Waveform Diffusion Models</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63856">arxiv</a></p>
<hr>
<h3 id="68-polyphonia-training-free-context-aware-music-editing-with-acoustic-informed-attention-calibration">68. <a href="/audio-paper-digest-blog/posts/2026-05-23-polyphonia-training-free-context-aware-music">Polyphonia: Training-Free Context-Aware Music Editing with Acoustic-Informed Attention Calibration</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64662">arxiv</a></p>
<hr>
<h3 id="69-tmd-bench-a-multi-level-evaluation-paradigm-for-musicdance-co-generation">69. <a href="/audio-paper-digest-blog/posts/2026-05-23-tmd-bench-a-multi-level-evaluation-paradigm-for">TMD-Bench: A Multi-Level Evaluation Paradigm for Music–Dance Co-Generation</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65225">arxiv</a></p>
<hr>
<h3 id="70-omni-perception-policy-optimization-for-multimodal-emotion-reasoning">70. <a href="/audio-paper-digest-blog/posts/2026-05-23-omni-perception-policy-optimization-for">Omni-Perception Policy Optimization for Multimodal Emotion Reasoning</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66328">arxiv</a></p>
<hr>
<h3 id="71-acoustic-interference-a-new-paradigm-weaponizing-acoustic-latent-semantic-for-universal-jailbreak-against-large-audio-language-models">71. <a href="/audio-paper-digest-blog/posts/2026-05-23-acoustic-interference-a-new-paradigm-weaponizing">Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65189">arxiv</a></p>
<hr>
<h3 id="72-audiochat-unified-audio-storytelling-editing-and-understanding-with-transfusion-forcing">72. <a href="/audio-paper-digest-blog/posts/2026-05-23-audiochat-unified-audio-storytelling-editing-and">AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing</a></h3>
<p>✅ <strong>7.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66763">arxiv</a></p>
<hr>
<h3 id="73-do-audio-llms-listen-or-read-analyzing-and-mitigating-paralinguistic-failures-with-voxparadox">73. <a href="/audio-paper-digest-blog/posts/2026-05-23-do-audio-llms-listen-or-read-analyzing-and">Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox</a></h3>
<p>✅ <strong>6.9/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60946">arxiv</a></p>
<hr>
<h3 id="74-from-talking-to-singing-a-new-challenge-for-audio-visual-deepfake-detection">74. <a href="/audio-paper-digest-blog/posts/2026-05-23-from-talking-to-singing-a-new-challenge-for-audio">From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection</a></h3>
<p>✅ <strong>6.8/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62663">arxiv</a></p>
<hr>
<h3 id="75-multiple-choice-learning-of-low-rank-adapters-for-language-modeling">75. <a href="/audio-paper-digest-blog/posts/2026-05-23-multiple-choice-learning-of-low-rank-adapters-for">Multiple Choice Learning of Low-Rank Adapters for Language Modeling</a></h3>
<p>✅ <strong>6.8/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65583">arxiv</a></p>
<hr>
<h3 id="76-multimodal-fusion-via-self-consistent-task-gradient-fields">76. <a href="/audio-paper-digest-blog/posts/2026-05-23-multimodal-fusion-via-self-consistent-task">Multimodal Fusion via Self-Consistent Task-Gradient Fields</a></h3>
<p>✅ <strong>6.8/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64750">arxiv</a></p>
<hr>
<h3 id="77-position">77. <a href="/audio-paper-digest-blog/posts/2026-05-23-position-beyond-text-the-text-centric-bias-in">Position: <em>Beyond Text</em> The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future</a></h3>
<p>✅ <strong>6.8/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/67080">arxiv</a></p>
<hr>
<h3 id="78-metabio-learning-from-metadata-for-bioacoustics-foundation-models">78. <a href="/audio-paper-digest-blog/posts/2026-05-23-metabio-learning-from-metadata-for-bioacoustics">MetaBio: Learning from metadata for bioacoustics foundation models</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61415">arxiv</a></p>
<hr>
<h3 id="79-any-diffusion-unified-multimodal-understanding-and-generation-with-masked-discrete-diffusion">79. <a href="/audio-paper-digest-blog/posts/2026-05-23-any-diffusion-unified-multimodal-understanding">Any-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65305">arxiv</a></p>
<hr>
<h3 id="80-sam-audio-segment-anything-in-audio">80. <a href="/audio-paper-digest-blog/posts/2026-05-23-sam-audio-segment-anything-in-audio">SAM Audio: Segment Anything in Audio</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64195">arxiv</a></p>
<hr>
<h3 id="81-cocoemo-composable-and-controllable-human-like-emotional-tts-via-activation-steering">81. <a href="/audio-paper-digest-blog/posts/2026-05-23-cocoemo-composable-and-controllable-human-like">CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60804">arxiv</a></p>
<hr>
<h3 id="82-hyperpotter-spell-the-charm-of-high-order-interactions-in-audio-deepfake-detection">82. <a href="/audio-paper-digest-blog/posts/2026-05-23-hyperpotter-spell-the-charm-of-high-order">HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60926">arxiv</a></p>
<hr>
<h3 id="83-joint-enhancement-and-classification-using-coupled-diffusion-models-of-signals-and-logits">83. <a href="/audio-paper-digest-blog/posts/2026-05-23-joint-enhancement-and-classification-using">Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/60943">arxiv</a></p>
<hr>
<h3 id="84-hearing-without-noticing-attention-aware-stealthy-black-box-adversarial-audio-attacks">84. <a href="/audio-paper-digest-blog/posts/2026-05-23-hearing-without-noticing-attention-aware-stealthy">Hearing Without Noticing? Attention-Aware Stealthy Black-box Adversarial Audio Attacks</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63275">arxiv</a></p>
<hr>
<h3 id="85-two-dimensional-quantization-for-geometry-aware-audio-coding">85. <a href="/audio-paper-digest-blog/posts/2026-05-23-two-dimensional-quantization-for-geometry-aware">Two-dimensional quantization for geometry-aware audio coding</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64809">arxiv</a></p>
<hr>
<h3 id="86-salsa-v-shortcut-augmented-long-form-synchronized-audio-from-videos">86. <a href="/audio-paper-digest-blog/posts/2026-05-23-salsa-v-shortcut-augmented-long-form-synchronized">SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos</a></h3>
<p><a href="https://icml.cc/virtual/2026/poster/63032">arxiv</a></p>
<hr>
<h3 id="87-rest-diffusion-based-real-time-end-to-end-streaming-talking-head-generation-via-id-context-caching-and-asynchronous-streaming-distillation">87. <a href="/audio-paper-digest-blog/posts/2026-05-23-rest-diffusion-based-real-time-end-to-end">REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61678">arxiv</a></p>
<hr>
<h3 id="88-autagent-a-reinforcement-learning-framework-for-tool-augmented-audio-reasoning">88. <a href="/audio-paper-digest-blog/posts/2026-05-23-autagent-a-reinforcement-learning-framework-for">AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64128">arxiv</a></p>
<hr>
<h3 id="89-characterizing-the-predictive-impact-of-modalities-with-supervised-latent-variable-modeling">89. <a href="/audio-paper-digest-blog/posts/2026-05-23-characterizing-the-predictive-impact-of">Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64594">arxiv</a></p>
<hr>
<h3 id="90-group-cognition-learning-making-everything-better-through-controlled-two-stage-agents-collaboration">90. <a href="/audio-paper-digest-blog/posts/2026-05-23-group-cognition-learning-making-everything-better">Group Cognition Learning: Making Everything Better Through Controlled Two-Stage Agents Collaboration</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61162">arxiv</a></p>
<hr>
<h3 id="91-rethinking-attention-in-spiking-transformers-overcoming-density-bias-with-set-similarity">91. <a href="/audio-paper-digest-blog/posts/2026-05-23-rethinking-attention-in-spiking-transformers">Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65945">arxiv</a></p>
<hr>
<h3 id="92-t2av-compass-towards-unified-evaluation-for-text-to-audio-video-generation">92. <a href="/audio-paper-digest-blog/posts/2026-05-23-t2av-compass-towards-unified-evaluation-for-text">T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65960">arxiv</a></p>
<hr>
<h3 id="93-s3audio-towards-streaming-synchronized-spatial-audio-generation-via-autoregressive-diffusion-transformer">93. <a href="/audio-paper-digest-blog/posts/2026-05-23-s3audio-towards-streaming-synchronized-spatial">S3Audio: Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64119">arxiv</a></p>
<hr>
<h3 id="94-sparse-autoencoders-for-interpretable-emotion-control-in-text-to-speech">94. <a href="/audio-paper-digest-blog/posts/2026-05-23-sparse-autoencoders-for-interpretable-emotion">Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65029">arxiv</a></p>
<hr>
<h3 id="95-bat-better-audio-transformer-guided-by-convex-gated-probing">95. <a href="/audio-paper-digest-blog/posts/2026-05-23-bat-better-audio-transformer-guided-by-convex">BAT: Better Audio Transformer Guided by Convex Gated Probing</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65873">arxiv</a></p>
<hr>
<h3 id="96-ag-repa-causal-layer-selection-for-representation-alignment-in-audio-flow-matching">96. <a href="/audio-paper-digest-blog/posts/2026-05-23-ag-repa-causal-layer-selection-for-representation">AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65899">arxiv</a></p>
<hr>
<h3 id="97-cola-cross-modal-low-rank-adaptation-for-multimodal-downstream-tasks">97. <a href="/audio-paper-digest-blog/posts/2026-05-23-cola-cross-modal-low-rank-adaptation-for">CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/65985">arxiv</a></p>
<hr>
<h3 id="98-neural-inspired-modeling-of-auditory-selection-and-compensation-for-audio-visual-speech-separation">98. <a href="/audio-paper-digest-blog/posts/2026-05-23-neural-inspired-modeling-of-auditory-selection">Neural-Inspired Modeling of Auditory Selection and Compensation for Audio-Visual Speech Separation</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66271">arxiv</a></p>
<hr>
<h3 id="99-futureomni-evaluating-future-forecasting-from-omni-modal-context-for-multimodal-llms">99. <a href="/audio-paper-digest-blog/posts/2026-05-23-futureomni-evaluating-future-forecasting-from">FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66092">arxiv</a></p>
<hr>
<h3 id="100-proactivellm-learning-active-interaction-for-streaming-large-language-models">100. <a href="/audio-paper-digest-blog/posts/2026-05-23-proactivellm-learning-active-interaction-for">ProactiveLLM: Learning Active Interaction for Streaming Large Language Models</a></h3>
<p>✅ <strong>6.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61196">arxiv</a></p>
<hr>
<h3 id="101-video-salmonn-s-memory-enhanced-streaming-audio-visual-llm">101. <a href="/audio-paper-digest-blog/posts/2026-05-23-video-salmonn-s-memory-enhanced-streaming-audio">video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM</a></h3>
<p>✅ <strong>6.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61140">arxiv</a></p>
<hr>
<h3 id="102-zero-shot-rankability-revealing-latent-ordinal-structure-in-multimodal-large-language-models-via-language">102. <a href="/audio-paper-digest-blog/posts/2026-05-23-zero-shot-rankability-revealing-latent-ordinal">Zero-Shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language</a></h3>
<p>✅ <strong>6.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/64000">arxiv</a></p>
<hr>
<h3 id="103-scaling-transformers-for-end-to-end-discrete-audio-tokenization">103. <a href="/audio-paper-digest-blog/posts/2026-05-23-scaling-transformers-for-end-to-end-discrete">Scaling Transformers for End-to-End Discrete Audio Tokenization</a></h3>
<p>✅ <strong>6.0/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/62215">arxiv</a></p>
<hr>
<h3 id="104-evaluating-and-rewarding-lalms-for-expressive-role-play-tts-via-mean-continuation-log-probability">104. <a href="/audio-paper-digest-blog/posts/2026-05-23-evaluating-and-rewarding-lalms-for-expressive">Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability</a></h3>
<p>✅ <strong>6.5/10</strong> | 前50% | <a href="https://arxiv.org/abs/2601.22661">arxiv</a></p>
<hr>
<h3 id="105-unlocking-speechtext-compositional-powers-instruction-following-speech-language-models-without-instruction-tuning">105. <a href="/audio-paper-digest-blog/posts/2026-05-23-unlocking-speechtext-compositional-powers">Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning</a></h3>
<p>📝 <strong>5.8/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/63961">arxiv</a></p>
<hr>
<h3 id="106-probing-cross-modal-information-hubs-in-audio-visual-llms">106. <a href="/audio-paper-digest-blog/posts/2026-05-23-probing-cross-modal-information-hubs-in-audio">Probing Cross-modal Information Hubs in Audio-Visual LLMs</a></h3>
<p>📝 <strong>5.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61664">arxiv</a></p>
<hr>
<h3 id="107-omnishow-orchestrating-multimodal-conditions-for-human-object-interaction-video-generation">107. <a href="/audio-paper-digest-blog/posts/2026-05-23-omnishow-orchestrating-multimodal-conditions-for">OmniShow: Orchestrating Multimodal Conditions for Human-Object Interaction Video Generation</a></h3>
<p>📝 <strong>5.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/61031">arxiv</a></p>
<hr>
<h3 id="108-sparse-tokens-suffice-jailbreaking-audio-language-models-via-token-aware-gradient-optimization">108. <a href="/audio-paper-digest-blog/posts/2026-05-23-sparse-tokens-suffice-jailbreaking-audio-language">Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization</a></h3>
<p>✅ <strong>6.0/10</strong> | 前50% | <a href="https://arxiv.org/abs/2605.04700">arxiv</a></p>
<hr>
<h3 id="109-phalar-phasors-for-learned-musical-audio-representations">109. <a href="/audio-paper-digest-blog/posts/2026-05-23-phalar-phasors-for-learned-musical-audio">PHALAR: Phasors for Learned Musical Audio Representations</a></h3>
<p>📝 <strong>5.5/10</strong> | 前50% | <a href="https://icml.cc/virtual/2026/poster/66020">arxiv</a></p>
<hr>
<h3 id="110-scaling-laws-in-model-fine-tuning-for-audio-deepfake-detection">110. <a href="/audio-paper-digest-blog/posts/2026-05-23-scaling-laws-in-model-fine-tuning-for-audio">Scaling Laws in Model Fine-tuning for Audio DeepFake Detection</a></h3>
<p>📝 <strong>5.0/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/60632">arxiv</a></p>
<hr>
<h3 id="111-primcooperative-dynamic-token-compression-for-efficient-large-multimodal-models">111. <a href="/audio-paper-digest-blog/posts/2026-05-23-primcooperative-dynamic-token-compression-for">PRIM：Cooperative Dynamic Token Compression for Efficient Large Multimodal Models</a></h3>
<p>📝 <strong>4.8/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/60866">arxiv</a></p>
<hr>
<h3 id="112-towards-understanding-modality-interaction-in-multimodal-language-models-via-partial-information-decomposition">112. <a href="/audio-paper-digest-blog/posts/2026-05-23-towards-understanding-modality-interaction-in">Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition</a></h3>
<p>📝 <strong>4.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/60550">arxiv</a></p>
<hr>
<h3 id="113-from-inpainting-to-editing-unlocking-robust-mask-free-visual-dubbing-via-generative-bootstrapping">113. <a href="/audio-paper-digest-blog/posts/2026-05-23-from-inpainting-to-editing-unlocking-robust-mask">From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping</a></h3>
<p>📝 <strong>4.3/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/64235">arxiv</a></p>
<hr>
<h3 id="114-sonar-spectralcontrastive-audio-residuals-for-generalizable-deepfake-detection">114. <a href="/audio-paper-digest-blog/posts/2026-05-23-sonar-spectralcontrastive-audio-residuals-for">SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection</a></h3>
<p>📝 <strong>4.0/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/64783">arxiv</a></p>
<hr>
<h3 id="115-moshirag-asynchronous-knowledge-retrieval-for-full-duplex-speech-language-models">115. <a href="/audio-paper-digest-blog/posts/2026-05-23-moshirag-asynchronous-knowledge-retrieval-for">MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models</a></h3>
<p>📝 <strong>3.8/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/66336">arxiv</a></p>
<hr>
<h3 id="116-starcaster-spatio-temporal-autoregressive-video-diffusion-for-identity--and-view-aware-talking-portraits">116. <a href="/audio-paper-digest-blog/posts/2026-05-23-starcaster-spatio-temporal-autoregressive-video">STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits</a></h3>
<p>📝 <strong>3.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/65916">arxiv</a></p>
<hr>
<h3 id="117-wavessm-multiscale-state-space-models-for-non-stationary-signal-attention">117. <a href="/audio-paper-digest-blog/posts/2026-05-23-wavessm-multiscale-state-space-models-for-non">WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention</a></h3>
<p>📝 <strong>3.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/65941">arxiv</a></p>
<hr>
<h3 id="118">118. <a href="/audio-paper-digest-blog/posts/2026-05-23-tau-voice-benchmarking-full-duplex-voice-agents">\(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains</a></h3>
<p>📝 <strong>3.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/66590">arxiv</a></p>
<hr>
<h3 id="119-fakeworld-10-an-omni-modal-benchmark-for-fake-media-and-content">119. <a href="/audio-paper-digest-blog/posts/2026-05-23-fakeworld-10-an-omni-modal-benchmark-for-fake">FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content</a></h3>
<p>📝 <strong>3.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/63697">arxiv</a></p>
<hr>
<h3 id="120-lalm-as-a-judge-benchmarking-large-audio-language-models-for-safety-evaluation-in-multi-turn-spoken-dialogues">120. <a href="/audio-paper-digest-blog/posts/2026-05-23-lalm-as-a-judge-benchmarking-large-audio-language">LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues</a></h3>
<p>📝 <strong>3.5/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/66557">arxiv</a></p>
<hr>
<h3 id="121-ivq-structured-and-lightweight-vector-quantization-via-binary-hierarchical-composition-inspired-by">121. <a href="/audio-paper-digest-blog/posts/2026-05-23-ivq-structured-and-lightweight-vector">IVQ: Structured and Lightweight Vector Quantization via Binary Hierarchical Composition Inspired by \(\textit{IChing}\)</a></h3>
<p>📝 <strong>3.2/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/63329">arxiv</a></p>
<hr>
<h3 id="122-mfcl-audio-an-audio-function-calling-evaluation-for-large-language-models">122. <a href="/audio-paper-digest-blog/posts/2026-05-23-mfcl-audio-an-audio-function-calling-evaluation">MFCL Audio: An Audio Function Calling Evaluation for Large Language Models</a></h3>
<p>📝 <strong>3.0/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/61489">arxiv</a></p>
<hr>
<h3 id="123-position-towards-responsible-evaluation-for-text-to-speech">123. <a href="/audio-paper-digest-blog/posts/2026-05-23-position-towards-responsible-evaluation-for-text">Position: Towards Responsible Evaluation for Text-to-Speech</a></h3>
<p>📝 <strong>2.6/10</strong> | 后50% | <a href="https://icml.cc/virtual/2026/poster/67095">arxiv</a></p>
<hr>
]]></content:encoded>
      <category>ADMM</category>
      <category>Alignment</category>
      <category>ECoG</category>
      <category>Interface</category>
      <category>LLMs</category>
      <category>Token</category>
      <category>Video</category>
      <category>FMRI</category>
      <category>脑部对齐</category>
    </item>
  </channel>
</rss>
