Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

📄 Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox ✅ 6.9/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 24 words

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

📄 DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation ✅ 7.8/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 19 words

Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

📄 Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy ✅ 7.5/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 20 words

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

📄 E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs 🔥 8.0/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 20 words

EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

📄 EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs ✅ 7.5/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 19 words

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

📄 Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability ✅ 6.0/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 23 words

FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content

📄 FakeWorld 1.0: An Omni modal Benchmark for Fake Media and Content 📝 3.5/10 | 后50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 22 words

FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

📄 FoeGlass: When Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors 🔥 8.0/10 | 前25% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 24 words

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

📄 From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping 📝 4.3/10 | 后50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 23 words

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

📄 From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection ✅ 6.8/10 | 前50% | arxiv ← 返回 2026-05-23 语音/音乐/音频论文速递

2026-05-23 · 更新于 2026-06-19 · 1 min · 22 words