arXiv · 大模型

Vision-OPD 提出多模态大模型细粒度视觉理解的自蒸馏方法

多模态大模型在细粒度视觉理解上仍存在短板，答案往往依赖图像中微小但关键的证据。Vision-OPD 提出一种基于策略的自蒸馏方法，让模型在训练中主动关注局部细节，缩小区域到全局的感知差距。该方法不依赖额外标注，即可提升 MLLM 对细节问题的回答准确率。

域名: arxiv.org
评分: 4 · 重要更新
发布: 2026-05-18

访问项目本体

导读

这条暂时没有深度导读，点上方「访问项目本体」直接到源页面查看。

原文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned …

Back to Latest