<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>跨模态 on 语音/音频论文速递</title>
    <link>https://nanless.github.io/audio-paper-digest-blog/tags/%E8%B7%A8%E6%A8%A1%E6%80%81/</link>
    <description>Recent content in 跨模态 on 语音/音频论文速递</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nanless.github.io/audio-paper-digest-blog/tags/%E8%B7%A8%E6%A8%A1%E6%80%81/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A Dynamic Gated Cross-Attention Framework for Audio-Text Apparent Personality Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dynamic-gated-cross-attention-framework-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-dynamic-gated-cross-attention-framework-for/</guid>
      <description>音频分类 | 7.0/10</description>
    </item>
    <item>
      <title>A LLM-Driven Acoustic Semantic Enriched Framework for Underwater Acoustic Target Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-llm-driven-acoustic-semantic-enriched-framework/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-a-llm-driven-acoustic-semantic-enriched-framework/</guid>
      <description>音频分类 | 7.0/10</description>
    </item>
    <item>
      <title>ACIR-MACL: Effective Multimodal Sentiment Analysis via Attention-Based Causal Intervention Regularization and Multi-Aspect Contrastive Learning</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acir-macl-effective-multimodal-sentiment-analysis/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-acir-macl-effective-multimodal-sentiment-analysis/</guid>
      <description>情感分析 | 7.0/10</description>
    </item>
    <item>
      <title>An Unsupervised Alignment Feature Fusion System for Spoken Language-Based Dementia Detection</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-unsupervised-alignment-feature-fusion-system/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-an-unsupervised-alignment-feature-fusion-system/</guid>
      <description>语音生物标志物 | 7.0/10</description>
    </item>
    <item>
      <title>Audience-Aware Co-speech Gesture Generation in Public Speaking via Anticipation Tokens</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audience-aware-co-speech-gesture-generation-in/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audience-aware-co-speech-gesture-generation-in/</guid>
      <description>音频生成 | 8.0/10</description>
    </item>
    <item>
      <title>Audio-Text Jailbreak Attack on Large Audio-Language Models: Towards Generality and Stealthiness</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-text-jailbreak-attack-on-large-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-audio-text-jailbreak-attack-on-large-audio/</guid>
      <description>音频安全 | 7.0/10</description>
    </item>
    <item>
      <title>Auto-MatchCut: An Audio-Visual Retrieval Framework for Seamless Match Cutting</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auto-matchcut-an-audio-visual-retrieval-framework/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-auto-matchcut-an-audio-visual-retrieval-framework/</guid>
      <description>跨模态检索 | 7.0/10</description>
    </item>
    <item>
      <title>Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-isolated-utterances-cue-guided-interaction/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-beyond-isolated-utterances-cue-guided-interaction/</guid>
      <description>多模态模型 | 7.5/10</description>
    </item>
    <item>
      <title>Bimodal Fusion Framework for Dynamic Facial Expression Recognition In-The-Wild</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bimodal-fusion-framework-for-dynamic-facial/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-bimodal-fusion-framework-for-dynamic-facial/</guid>
      <description>语音情感识别 | 7.0/10</description>
    </item>
    <item>
      <title>CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-calm-joint-contextual-acoustic-linguistic/</guid>
      <description>语音识别 | 7.5/10</description>
    </item>
    <item>
      <title>Cross-Modal Bottleneck Fusion for Noise Robust Audio-Visual Speech Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-bottleneck-fusion-for-noise-robust/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-bottleneck-fusion-for-noise-robust/</guid>
      <description>语音识别 | 7.5/10</description>
    </item>
    <item>
      <title>Cross-Modal Knowledge Distillation for Speech Large Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-knowledge-distillation-for-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-cross-modal-knowledge-distillation-for-speech/</guid>
      <description>语音大模型 | 7.0/10</description>
    </item>
    <item>
      <title>DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-decaf-dynamic-envelope-context-aware-fusion-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-decaf-dynamic-envelope-context-aware-fusion-for/</guid>
      <description>语音增强 | 7.0/10</description>
    </item>
    <item>
      <title>Diffemotalk: Audio-Driven Facial Animation with Fine-Grained Emotion Control via Diffusion Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffemotalk-audio-driven-facial-animation-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-diffemotalk-audio-driven-facial-animation-with/</guid>
      <description>语音情感识别 | 7.5/10</description>
    </item>
    <item>
      <title>Disentangling Physiology from Fidelity: Latent-Guided Diffusion Models for Cross-Modal Cardiac Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-disentangling-physiology-from-fidelity-latent/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-disentangling-physiology-from-fidelity-latent/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>Do Speech LLMs Learn Crossmodal Embedding Spaces?</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-speech-llms-learn-crossmodal-embedding-spaces/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-do-speech-llms-learn-crossmodal-embedding-spaces/</guid>
      <description>音频检索 | 6.5/10</description>
    </item>
    <item>
      <title>DPT-Net: Dual-Path Transformer Network with Hierarchical Fusion for EEG-based Envelope Reconstruction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dpt-net-dual-path-transformer-network-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dpt-net-dual-path-transformer-network-with/</guid>
      <description>语音生物标志物 | 7.0/10</description>
    </item>
    <item>
      <title>Dynamic Balanced Cross-Modal Attention with Gated Sequence Restoration: Towards Robust Multimodal Sentiment Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dynamic-balanced-cross-modal-attention-with-gated/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-dynamic-balanced-cross-modal-attention-with-gated/</guid>
      <description>跨模态 | 7.5/10</description>
    </item>
    <item>
      <title>Estimating Hand-Related Features from Speech Using Machine Learning</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-estimating-hand-related-features-from-speech/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-estimating-hand-related-features-from-speech/</guid>
      <description>语音生物标志物 | 5.0/10</description>
    </item>
    <item>
      <title>Face-Voice Association with Inductive Bias for Maximum Class Separation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-face-voice-association-with-inductive-bias-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-face-voice-association-with-inductive-bias-for/</guid>
      <description>说话人验证 | 7.0/10</description>
    </item>
    <item>
      <title>From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-Modal Understanding in Multimodal LLMS</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-contrast-to-commonality-audio-commonality/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-from-contrast-to-commonality-audio-commonality/</guid>
      <description>音频场景理解 | 7.5/10</description>
    </item>
    <item>
      <title>HarmoNet: Music Grounding by Short Video via Harmonic Resample and Dynamic Sparse Alignment</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-harmonet-music-grounding-by-short-video-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-harmonet-music-grounding-by-short-video-via/</guid>
      <description>音乐检索 | 7.0/10</description>
    </item>
    <item>
      <title>ICASSP 2026 - 跨模态 论文列表</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-096/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/icassp2026-task-096/</guid>
      <description>共 2 篇 ICASSP 2026 跨模态 方向论文</description>
    </item>
    <item>
      <title>Inter-Dialog Contrastive Learning for Multimodal Emotion Recognition in Conversations</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-inter-dialog-contrastive-learning-for-multimodal/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-inter-dialog-contrastive-learning-for-multimodal/</guid>
      <description>语音情感识别 | 7.5/10</description>
    </item>
    <item>
      <title>KSDIFF: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ksdiff-keyframe-augmented-speech-aware-dual-path/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-ksdiff-keyframe-augmented-speech-aware-dual-path/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>LETPAV: Lexicon-Enhanced Text with Progressive Audio-Visual Fusion for Multimodal Sentiment Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-letpav-lexicon-enhanced-text-with-progressive/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-letpav-lexicon-enhanced-text-with-progressive/</guid>
      <description>语音情感识别 | 7.5/10</description>
    </item>
    <item>
      <title>Leveraging Large Multimodal Models for Audio-Video Deepfake Detection: A Pilot Study</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-multimodal-models-for-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-leveraging-large-multimodal-models-for-audio/</guid>
      <description>音频深度伪造检测 | 7.0/10</description>
    </item>
    <item>
      <title>Look, Listen and Segment: Towards Weakly Supervised Audio-Visual Semantic Segmentation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-look-listen-and-segment-towards-weakly-supervised/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-look-listen-and-segment-towards-weakly-supervised/</guid>
      <description>音视频 | 7.0/10</description>
    </item>
    <item>
      <title>MCI-OTFusion: A Multimodal Model for MCI Detection and Cognitive Score Prediction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mci-otfusion-a-multimodal-model-for-mci-detection/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mci-otfusion-a-multimodal-model-for-mci-detection/</guid>
      <description>轻度认知障碍检测 | 6.5/10</description>
    </item>
    <item>
      <title>Mitigating Shared-Private Branch Imbalance via Dual-Branch Rebalancing for Multimodal Sentiment Analysis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mitigating-shared-private-branch-imbalance-via/</guid>
      <description>多模态模型 | 7.5/10</description>
    </item>
    <item>
      <title>MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-mmeb-v3-measuring-the-performance-gaps-of-omni/</guid>
      <description>基准测试 | 7.5/10</description>
    </item>
    <item>
      <title>Motionbeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-motionbeat-motion-aligned-music-representation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-motionbeat-motion-aligned-music-representation/</guid>
      <description>舞蹈生成 | 7.5/10</description>
    </item>
    <item>
      <title>Multi-Scale Physiologically-Motivated Alignment for Auditory Attention Decoding</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-scale-physiologically-motivated-alignment/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multi-scale-physiologically-motivated-alignment/</guid>
      <description>听觉注意力解码 | 7.5/10</description>
    </item>
    <item>
      <title>Multimodal Fusion-Based IPCLIP Network for Mixed Reality Surgical Assistance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-fusion-based-ipclip-network-for-mixed/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-fusion-based-ipclip-network-for-mixed/</guid>
      <description>多模态模型 | 6.5/10</description>
    </item>
    <item>
      <title>Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-self-attention-network-with-temporal/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-multimodal-self-attention-network-with-temporal/</guid>
      <description>语音情感识别 | 8.0/10</description>
    </item>
    <item>
      <title>Natural Language to Spatial Audio Parameters: Lightweight Deterministic Rendering for Creative Authoring</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-natural-language-to-spatial-audio-parameters/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-natural-language-to-spatial-audio-parameters/</guid>
      <description>空间音频 | 7.5/10</description>
    </item>
    <item>
      <title>Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-nemotron-3-nano-omni-efficient-and-open/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-nemotron-3-nano-omni-efficient-and-open/</guid>
      <description>多模态模型 | 8.5/10</description>
    </item>
    <item>
      <title>NeuroSIFT: A Biologically-Inspired Framework with Explicit Signal-Noise Separation for Robust Multimodal Emotion Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-neurosift-a-biologically-inspired-framework-with/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-neurosift-a-biologically-inspired-framework-with/</guid>
      <description>多模态情感识别 | 8.0/10</description>
    </item>
    <item>
      <title>RCAL: Reinforced Cross-Modal Alignment for Multimodal Sentiment Analysis with Sparse Visual Frames</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rcal-reinforced-cross-modal-alignment-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-rcal-reinforced-cross-modal-alignment-for/</guid>
      <description>多模态模型 | 8.5/10</description>
    </item>
    <item>
      <title>Reliable AI via Age-Balanced Validation: Fair Model Selection for Parkinson’s Detection from Voice</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reliable-ai-via-age-balanced-validation-fair/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-reliable-ai-via-age-balanced-validation-fair/</guid>
      <description>语音生物标志物 | 7.5/10</description>
    </item>
    <item>
      <title>Savgbench: Benchmarking Spatially Aligned Audio-Video Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-savgbench-benchmarking-spatially-aligned-audio/</guid>
      <description>基准测试 | 7.5/10</description>
    </item>
    <item>
      <title>Selective Hub Fusion with Modality-Heterogeneous Experts for Multimodal Emotion Recognition</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-selective-hub-fusion-with-modality-heterogeneous/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-selective-hub-fusion-with-modality-heterogeneous/</guid>
      <description>多模态模型 | 6.5/10</description>
    </item>
    <item>
      <title>Sounds that Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-sounds-that-shape-audio-driven-3d-mesh-generation/</guid>
      <description>音频生成 | 7.0/10</description>
    </item>
    <item>
      <title>Spatial-CLAP: Learning Spatially-Aware Audio–Text Embeddings for Multi-Source Conditions</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatial-clap-learning-spatially-aware-audiotext/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-spatial-clap-learning-spatially-aware-audiotext/</guid>
      <description>空间音频 | 8.5/10</description>
    </item>
    <item>
      <title>StereoFoley: Object-Aware Stereo Audio Generation from Video</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stereofoley-object-aware-stereo-audio-generation/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-stereofoley-object-aware-stereo-audio-generation/</guid>
      <description>音频生成 | 7.5/10</description>
    </item>
    <item>
      <title>TextlessRAG: End-to-End Visual Document RAG by Speech without Text</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-textlessrag-end-to-end-visual-document-rag-by/</guid>
      <description>语音问答 | 8.5/10</description>
    </item>
    <item>
      <title>The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-the-structured-output-benchmark-a-multi-source/</guid>
      <description>基准测试 | 7.0/10</description>
    </item>
    <item>
      <title>Towards Multi-View Hierarchical Video-to-Piano Generation with MIDI Guidance</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-multi-view-hierarchical-video-to-piano/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-towards-multi-view-hierarchical-video-to-piano/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>UVT-LM: Unifying Visual and Tactile Perception with Language Model</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uvt-lm-unifying-visual-and-tactile-perception/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-uvt-lm-unifying-visual-and-tactile-perception/</guid>
      <description>跨模态 | 7.0/10</description>
    </item>
    <item>
      <title>Visual Keys to Symphonies: Latent Diffusion for Multi-Scene Video-to-Music Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-visual-keys-to-symphonies-latent-diffusion-for/</guid>
      <description>音乐生成 | 7.5/10</description>
    </item>
    <item>
      <title>VMSP: Video-to-Music Generation with Two-Stage Alignment and Synthesis</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vmsp-video-to-music-generation-with-two-stage/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-vmsp-video-to-music-generation-with-two-stage/</guid>
      <description>音乐生成 | 7.0/10</description>
    </item>
    <item>
      <title>When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-silence-matters-the-impact-of-irrelevant/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-29-when-silence-matters-the-impact-of-irrelevant/</guid>
      <description>模型评估 | 7.0/10</description>
    </item>
    <item>
      <title>CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-cineagi-character-consistent-movie-creation/</guid>
      <description>跨模态 | 8.0/10</description>
    </item>
    <item>
      <title>Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-robust-audio-text-retrieval-via-cross-modal/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-28-robust-audio-text-retrieval-via-cross-modal/</guid>
      <description>音频检索 | 7.5/10</description>
    </item>
    <item>
      <title>Sema: Semantic Transport for Real-Time Multimodal Agents</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-sema-semantic-transport-for-real-time-multimodal/</link>
      <pubDate>Fri, 24 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-24-sema-semantic-transport-for-real-time-multimodal/</guid>
      <description>实时处理 | 6.5/10</description>
    </item>
    <item>
      <title>FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-flip-towards-understanding-and-interpreting/</guid>
      <description>这篇论文旨在解决对多语言、多模态句子嵌入（如SONAR, LaBSE）的可解释性问题。核心方法是提出一种称为因子化线性投影（FLiP）的模型，通过将嵌入向量线性投影到词汇表空间来提取关键词，以此作为理解嵌入内容的代理任务。与之前非因子化的线性探测方法（如LiP）和SpLiCE相比，FLiP在关键词提</description>
    </item>
    <item>
      <title>ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-23-onote-benchmarking-omnimodal-notation-processing/</guid>
      <description>1.  **问题**：当前多模态大模型在音乐符号处理（Omnimodal Notation Processing, ONP）领域存在严重缺陷：研究碎片化、模型存在严重的符号偏差（偏向五线谱）、且普遍依赖不可靠的“LLM-as-a-Judge”评估方法，掩盖了模型在音乐理论推理上的系统性失败。 2. </description>
    </item>
    <item>
      <title>Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-comparison-of-semg-encoding-accuracy-across/</link>
      <pubDate>Wed, 22 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-22-comparison-of-semg-encoding-accuracy-across/</guid>
      <description>这篇论文旨在为无声言语接口（SSI）选择更优的中间表示目标。研究系统比较了发音特征（SPARC）和传统的音素独热编码，在预测表面肌电（sEMG）信号包络上的表现。核心发现是：1）在出声、默语和次发声三种模式下，SPARC特征的编码准确性均显著优于音素特征；2）出声和默语模式的编码性能相当，次发声模式</description>
    </item>
    <item>
      <title>Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-aligning-language-models-for-lyric-to-melody/</link>
      <pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-21-aligning-language-models-for-lyric-to-melody/</guid>
      <description>这篇论文旨在解决大语言模型在歌词到旋律生成任务中，通过监督微调（SFT）训练出的模型常产生音乐上不可行（如节奏怪异、音域超限）的“约束违反”问题。**核心贡献**是提出了一套无需人工标注、基于规则约束的自动化对齐框架。**关键方法**分为三步：首先对预训练LLM进行SFT以获得基础生成能力；其次，利</description>
    </item>
    <item>
      <title>Hierarchical Codec Diffusion for Video-to-Speech Generation</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-hierarchical-codec-diffusion-for-video-to-speech/</guid>
      <description>本论文针对 Video-to-Speech（VTS）生成中视觉-语音模态信息不对称的问题，提出现有方法忽略了语音从粗粒度语义到细粒度韵律的层次结构，导致视觉条件无法与语音表示精准对齐。为此，作者提出 HiCoDiT（Hierarchical Codec Diffusion Transformer），</description>
    </item>
    <item>
      <title>Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-joint-centric-dual-contrastive-alignment-with/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-joint-centric-dual-contrastive-alignment-with/</guid>
      <description>这篇论文旨在解决音频-文本多模态表示学习中的一个关键挑战：如何在低资源、长序列且模态维度严重不平衡（音频高维、文本低维）的情况下，实现有效的跨模态对齐，同时保留各自的特异性信息。为此，作者提出了HILBERT框架。该方法首先利用冻结的预训练音频（如HuBERT）和文本（如T5）编码器提取片段级特征，</description>
    </item>
    <item>
      <title>The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction</title>
      <link>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-the-acoustic-camouflage-phenomenon-re-evaluating/</link>
      <pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://nanless.github.io/audio-paper-digest-blog/posts/2026-04-20-the-acoustic-camouflage-phenomenon-re-evaluating/</guid>
      <description>本研究探讨了在企业财报电话会议中，副语言声学特征（音高、抖动、停顿等）对预测灾难性股价下跌的效用。作者基于MAEC数据集，提取了两种模态的特征：文本端使用FinBERT计算脚本化开场白与即兴Q&amp;amp;A之间的情感极性差异（Sentiment Delta），音频端提取临床语音压力标记的方差特征（音高方差、抖</description>
    </item>
  </channel>
</rss>
