LynX: Token Interface Alignment for Video+X LLMs

Sat, 23 May 2026 00:00:00 +0000

📄 LynX: Token Interface Alignment for Video+X LLMs

#** #Video #LLMs #Token #Interface #Alignment #多模态整合 #流形对齐 #单模态数据

✅ 7.5/10 | 前25% | #** | #Video | #LLMs #Token | arxiv

← 返回 2026-05-23 语音/音乐/音频论文速递

语音/音乐/音频论文速递 2026-05-23

Sat, 23 May 2026 00:00:00 +0000

语音/音乐/音频论文速递 2026-05-23

共分析 123 篇论文

⚡ 今日概览

📥 抓取 123 篇 → 🔬 深度分析完成

🏷️ 热门方向

方向	数量	分布
#**	4篇	████

📊 论文评分排行榜（123 篇，按分数降序）

排名	论文	评分	分档	主任务
🥇	INFER: Learning Implicit Neural Frequency Response Fiel	8.5分	前25%	-
🥈	VocSim A Training-free Benchmark for Zero-shot Content	8.3分	前25%	-
🥉	CMI-RewardBench: Evaluating Music Reward Models with Co	8.2分	前25%	-
4.	Language Model Augmented Semi-Supervised Statistical In	8.2分	前25%	-
5.	DiscoForcing: A Unified Framework for Real-Time Audio-D	8.2分	前25%	-
6.	Abstraction Induces the Brain Alignment of Language and	8.0分	前25%	#**
7.	Alethia: a Foundational Encoder for Voice Deepfakes	8.0分	前25%	-
8.	OmniDenseCap: Scripting Multi-Scene Videos with Time-Aw	8.0分	前25%	-
9.	FoeGlass: When Simple In-Context Learning Is Enough for	8.0分	前25%	-
10.	E-VAds: An E-commerce Short Videos Understanding Benchm	8.0分	前25%	-
11.	BEAT: Tokenizing and Generating Symbolic Music by Unifo	8.0分	前25%	-
12.	Pianist Transformer: Towards Expressive Piano Performan	7.8分	前25%	-
13.	DreamID-Omni: Unified Framework for Controllable Human-	7.8分	前25%	-
14.	Real-World Unsupervised Models Generalize to Predict Br	7.8分	前25%	-
15.	AudioMosaic: Contrastive Masked Audio Representation Le	7.5分	前25%	-
16.	Self-Guidance: Enhancing Neural Codecs via Decoder Mani	7.5分	前25%	-
17.	LynX: Token Interface Alignment for Video+X LLMs	7.5分	前25%	#**
18.	Spherical Procrustes Alignment for Reliable Medical Aud	7.5分	前25%	-
19.	MoST: Mixing Speech and Text with Modality-Aware Mixtur	7.5分	前25%	-
20.	Self-Supervised Flow Matching for Scalable Multi-Modal	7.5分	前25%	-
21.	LightAVSeg: Lightweight Audio-Visual Segmentation	7.5分	前25%	-
22.	Robust Signal Enhancement via Fractional Detail Views a	7.5分	前25%	-
23.	EchoingPixels: Aliasing-Resistant Joint Token Reduction	7.5分	前25%	-
24.	Long Grounded Thoughts: Synthesizing Grounded Visual Pr	7.5分	前25%	-
25.	OmniVideo-R1: Reinforcing Audio-visual Reasoning with Q	7.5分	前25%	-
26.	Ariadne’s Thread of LipSync: Unraveling Forgeries via I	7.5分	前25%	-
27.	AVI-Bench: Toward Human-like Audio-Visual Intelligence	7.5分	前25%	-
28.	Simultaneous Speech-to-Speech Translation Without Align	7.5分	前25%	-
29.	PhoStream: Benchmarking Real-World Streaming for Omnimo	7.5分	前25%	-
30.	OmniSIFT: Modality-Asymmetric Token Compression for Eff	7.5分	前25%	-
31.	Speech-Audio Compositional Attacks on Multimodal LLMs a	7.5分	前25%	-
32.	Convex Low-resource Accent-Robust Language Detection in	7.5分	前25%	#**
33.	PhaseCoder: Microphone Geometry-Agnostic Spatial Audio	7.5分	前25%	-
34.	Listening Through the Noise: Cauchy-Driven Diffusion Br	7.5分	前25%	-
35.	Dual-View Predictive Diffusion: Lightweight Speech Enha	7.5分	前25%	-
36.	Stream RAG: Instant and Accurate Spoken Dialogue System	7.5分	前25%	-
37.	NAACA: Training-Free NeuroAuditory Attentive Cognitive	7.5分	前25%	-
38.	MedMosaic: A Challenging Large Scale Benchmark of Diver	7.5分	前25%	-
39.	Verifiable Multimodal Reasoning: Fact-level Attribution	7.5分	前25%	-
40.	MusicDET: Zero-Shot AI-Generated Music Detection	7.5分	前25%	-
41.	PCRNet: Phase-aware Complex Refinement Network for EEG-	7.5分	前25%	-
42.	SARSteer: Safeguarding Large Audio Language Models via	7.5分	前25%	-
43.	STAR-VAE: Structured Topology-Aware Regularization for	7.5分	前25%	-
44.	Hidden in Plain Tokens: Simply Robust, Gradient-Free Wa	7.5分	前25%	-
45.	AVGen-Bench: A Task-Driven Benchmark for Multi-Granular	7.3分	前50%	-
46.	Bridging the Stability-Expressivity Gap: Synthetic Data	7.3分	前50%	-
47.	AVTrack: Audio-Visual Speaker Tracking in Complex Scene	7.3分	前50%	-
48.	Bioacoustic Geolocation: Species Sounds as Geographic S	7.2分	前50%	-
49.	ADEPT: RL-Aligned Agentic Decoding of Emotion via Evide	7.2分	前50%	-
50.	MECAT: A Multi-Experts Constructed Benchmark for Fine-G	7.2分	前50%	-
51.	SPEAR: A Unified SSL Framework for Learning Speech and	7.2分	前50%	-
52.	PADS-TAL: Padding-Annealed Diffusion Sampling in Text-A	7.2分	前50%	-
53.	Multimodal Latent Language Modeling with Next-Token Dif	7.2分	前50%	-
54.	Query-Based Asymmetric Modeling with Decoupled Input–Ou	7.0分	前50%	-
55.	AgentSteerTTS: A Multi-Agent Closed-Loop Framework for	7.0分	前50%	-
56.	Optimality of FSQ tokens for continuous diffusion for c	7.0分	前50%	-
57.	JAEGER: Joint 3D Audio-Visual Grounding and Reasoning i	7.0分	前50%	-
58.	SonicMaster: Towards Controllable All-in-One Music Rest	7.0分	前50%	-
59.	VIBE: Disentangling Social Dynamics via Kinematics-Info	7.0分	前50%	-
60.	Reasoning LLM Improves Speaker Recognition in Long-form	7.0分	前50%	-
61.	A Semantically Consistent Dataset for Data-Efficient Qu	7.0分	前50%	-
62.	The Silent Thought: Modeling Internal Cognition in Full	7.0分	前50%	-
63.	Learning Tight Rejection Boundaries without Negatives f	7.0分	前50%	-
64.	Quaternion Self-Attention with Shared Scores	7.0分	前50%	-
65.	Bridging Your Imagination with Audio-Video Generation v	7.0分	前50%	-
66.	TextME: Bridging Unseen Modalities Through Text Descrip	7.0分	前50%	-
67.	ReGen: Hierarchical Multi-Prompt Representation Generat	7.0分	前50%	-
68.	Polyphonia: Training-Free Context-Aware Music Editing w	7.0分	前50%	-
69.	TMD-Bench: A Multi-Level Evaluation Paradigm for Music–	7.0分	前50%	-
70.	Omni-Perception Policy Optimization for Multimodal Emot	7.0分	前50%	-
71.	Acoustic Interference: A New Paradigm Weaponizing Acous	7.0分	前50%	-
72.	AudioChat: Unified Audio Storytelling, Editing, and Und	7.0分	前50%	-
73.	Do Audio LLMs Listen or Read? Analyzing and Mitigating	6.9分	前50%	-
74.	From Talking to Singing: A New Challenge for Audio-Visu	6.8分	前50%	-
75.	Multiple Choice Learning of Low-Rank Adapters for Langu	6.8分	前50%	-
76.	Multimodal Fusion via Self-Consistent Task-Gradient Fie	6.8分	前50%	-
77.	Position: Beyond Text The Text-Centric Bias in Founda	6.8分	前50%	-
78.	MetaBio: Learning from metadata for bioacoustics founda	6.5分	前50%	-
79.	Any-Diffusion: Unified Multimodal Understanding and Gen	6.5分	前50%	-
80.	SAM Audio: Segment Anything in Audio	6.5分	前50%	#**
81.	CoCoEmo: Composable and Controllable Human-Like Emotion	6.5分	前50%	-
82.	HyperPotter: Spell the Charm of High-Order Interactions	6.5分	前50%	-
83.	Joint Enhancement and Classification using Coupled Diff	6.5分	前50%	-
84.	Hearing Without Noticing? Attention-Aware Stealthy Blac	6.5分	前50%	-
85.	Two-dimensional quantization for geometry-aware audio c	6.5分	前50%	-
86.	SALSA-V: Shortcut-Augmented Long-form Synchronized Audi	6.5分	前50%	-
87.	REST: Diffusion-based Real-time End-to-end Streaming Ta	6.5分	前50%	-
88.	AuTAgent: A Reinforcement Learning Framework for Tool-A	6.5分	前50%	-
89.	Characterizing the Predictive Impact of Modalities with	6.5分	前50%	-
90.	Group Cognition Learning: Making Everything Better Thro	6.5分	前50%	-
91.	Rethinking Attention in Spiking Transformers: Overcomin	6.5分	前50%	-
92.	T2AV-Compass: Towards Unified Evaluation for Text-to-Au	6.5分	前50%	-
93.	S3Audio: Towards Streaming Synchronized Spatial Audio G	6.5分	前50%	-
94.	Sparse Autoencoders for Interpretable Emotion Control i	6.5分	前50%	-
95.	BAT: Better Audio Transformer Guided by Convex Gated Pr	6.5分	前50%	-
96.	AG-REPA: Causal Layer Selection for Representation Alig	6.5分	前50%	-
97.	CoLA: Cross-Modal Low-rank Adaptation for Multimodal Do	6.5分	前50%	-
98.	Neural-Inspired Modeling of Auditory Selection and Comp	6.5分	前50%	-
99.	FutureOmni: Evaluating Future Forecasting from Omni-Mod	6.5分	前50%	-
100.	ProactiveLLM: Learning Active Interaction for Streaming	6.0分	前50%	-
101.	video-SALMONN S: Memory-Enhanced Streaming Audio-Visual	6.0分	前50%	-
102.	Zero-Shot Rankability: Revealing Latent Ordinal Structu	6.0分	前50%	-
103.	Scaling Transformers for End-to-End Discrete Audio Toke	6.0分	前50%	-
104.	Evaluating and Rewarding LALMs for Expressive Role-Play	6.0分	前50%	-
105.	Unlocking Speech–Text Compositional Powers: Instruction	5.8分	前50%	-
106.	Probing Cross-modal Information Hubs in Audio-Visual LL	5.5分	前50%	-
107.	OmniShow: Orchestrating Multimodal Conditions for Human	5.5分	前50%	-
108.	Sparse Tokens Suffice: Jailbreaking Audio Language Mode	5.5分	前50%	-
109.	PHALAR: Phasors for Learned Musical Audio Representatio	5.5分	前50%	-
110.	Scaling Laws in Model Fine-tuning for Audio DeepFake De	5.0分	后50%	-
111.	PRIM：Cooperative Dynamic Token Compression for Efficien	4.8分	后50%	-
112.	Towards Understanding Modality Interaction in Multimoda	4.5分	后50%	-
113.	From Inpainting to Editing: Unlocking Robust Mask-Free	4.3分	后50%	-
114.	SONAR: Spectral‑Contrastive Audio Residuals for General	4.0分	后50%	-
115.	MoshiRAG: Asynchronous Knowledge Retrieval for Full-Dup	3.8分	后50%	-
116.	STARCaster: Spatio-Temporal AutoRegressive Video Diffus	3.5分	后50%	-
117.	WaveSSM: Multiscale State-Space Models for Non-stationa	3.5分	后50%	-
118.	\(\tau\)-Voice: Benchmarking Full-Duplex Voice Agents on	3.5分	后50%	-
119.	FakeWorld 1.0: An Omni modal Benchmark for Fake Media a	3.5分	后50%	-
120.	LALM-as-a-Judge: Benchmarking Large Audio-Language Mode	3.5分	后50%	-
121.	IVQ: Structured and Lightweight Vector Quantization via	3.2分	后50%	-
122.	MFCL Audio: An Audio Function Calling Evaluation for La	3.0分	后50%	-
123.	Position: Towards Responsible Evaluation for Text-to-Sp	2.6分	后50%	-