将语言融入医学视觉识别与推理：一项综述|文献速递-深度学习医疗AI最新文献

Title

题目

Integrating language into medical visual recognition and reasoning: A survey

将语言融入医学视觉识别与推理：一项综述

文献速递介绍

检测以及语义分割）是无数定量疾病评估和治疗规划的基石（利特延斯等人，2017），而像视觉问答（VQA）这样的医学视觉推理能够对复杂的视觉模式、关系和背景进行解读和理解，这对于准确的诊断决策至关重要（林等人，2023c）。尽管端到端的深度神经网络（DNNs）在视觉识别和推理任务中取得了显著成就（何等人，2016；罗内贝格尔等人，2015；多索维茨基等人，2020；基里洛夫等人，2023）。然而，在医学视觉任务中，仍然存在一个长期的挑战，这是由两个限制因素导致的，即大规模、特定任务、专家标注数据的费力收集，以及深度神经网络仅基于视觉信息的理解能力有限（利特延斯等人，2017）。由于隐私保护和伦理方面的考量，大规模的医学影像数据不容易公开获取。同时，获取一些医学任务（如语义分割和目标检测）的密集标注既耗费人力又耗时。事实上，与密集的专家标注相比，收集医学评估报告相对更容易。通常，医学影像总是附有相应的医学报告（什雷斯塔等人，2023）。因此，将文本监督融入视觉任务是很自然的。例如，视觉语言预训练（VLP）旨在探索一种数据高效的文本监督预训练方法，已在广泛的视觉任务中展示出巨大的有效性（拉德福德等人，2021；黄等人，2021；王等人，2022b；张等人，2022a；卡雷等人，2021）。在这个新范式下，深度神经网络可以首先使用带有现成临床报告的大规模医学影像进行预训练。随后，视觉语言预训练模型使用特定任务的标注训练数据进行微调。从报告中获取的文本知识，结合特定任务的微调，有助于视觉模型快速收敛，并在下游医学任务（包括分类（周等人，2022b；张等人，2023d；王等人，2023c）、分割（朱等人，2023；罗等人，2023）和检测（郭等人，2023））中表现良好。另一方面，除了视觉语言预训练，还有更多基于文本引导的医学任务值得探索。仅专注于视觉信息的深度神经网络有时无法获取足够的信息来准确理解复杂医学影像的语义。从医学领域的多模态学习（李等人，2023b；曾等人，2023；李等人，2023a）和多视图学习（卞等人，2023；江等人，2022）中获取灵感，多模态学习结合各种类型的数据，可以提供对背景更全面的理解，互补的图像和文本有助于模型更准确地识别病理。最近的大量研究（钟等人，2023；秦等人，2022；市野瀬等人，2023；孙等人，2023）强调了将文本描述融入医学视觉识别任务（包括人类观察和专家知识）的益处。遗憾的是，研究界缺乏一个系统的综述来阐明关于医学视觉语言模型在视觉感知和推理任务方面的现有研究。虽然现有的综述（肖等人，2024；哈索克和拉苏尔，2024）关注医学大语言模型（LLMs）和多模态大语言模型（MLLMs）在医学诊断和报告生成等文本任务中的应用，但我们的综述特别强调视觉任务，类似于张等人（2024）的综述。此外，近期的综述（赵等人，2023b；陈等人，2023b；什雷斯塔等人，2023）仅关注医学视觉语言预训练及其在下游任务中的应用。然而，我们认为视觉语言模型的潜力不止于此，更多关于将文本知识融入医学视觉任务的研究需要进行总结和讨论。尽管这些方法高度灵活且多样，但它们对于提高诊断的准确性和可靠性，以及改善临床环境中模型的可解释性至关重要。基于对视觉任务的视觉语言模型的首次综述（张等人，2024），我们旨在通过对医学视觉语言模型在一系列视觉识别和推理任务中的研究进行全面综述来填补这一空白。如图1所示，我们将综述分为7个部分，我们的讨论主要涉及视觉语言模型的方法、医学数据集、挑战以及未来的研究方向。总之，本综述的主要贡献如下： - 对用于视觉任务的前沿医学视觉语言模型进行了综述。据我们所知，这是首个专注于医学视觉语言模型的综述，重点关注如何将文本数据有效地融入视觉任务，而不仅仅是视觉语言预训练。 - 我们对医学视觉语言模型的方法进行了分类，并在表格中呈现了这些方法的关键信息，旨在通过提供结构良好的内容来帮助读者理解。 - 此外，我们提供了公开可用的用于视觉语言模型应用的医学图像-文本数据集的汇编，以简洁的方式对各种方法进行比较。 - 最后，讨论了视觉语言模型在医学领域的挑战和未来展望。

Abatract

摘要

Vision-Language Models (VLMs) are regarded as efficient paradigms that build a bridge between visualperception and textual interpretation. For medical visual tasks, they can benefit from expert observation andphysician knowledge extracted from textual context, thereby improving the visual understanding of models.Motivated by the fact that extensive medical reports are commonly attached to medical imaging, medicalVLMs have triggered more and more interest, serving not only as self-supervised learning in the pretrainingstage but also as a means to introduce auxiliary information into medical visual perception. To strengthen theunderstanding of such a promising direction, this survey aims to provide an in-depth exploration and reviewof medical VLMs for various visual recognition and reasoning tasks. Firstly, we present an introduction tomedical VLMs. Then, we provide preliminaries and delve into how to exploit language in medical visual tasksfrom diverse perspectives. Further, we investigate publicly available VLM datasets and discuss the challengesand future perspectives. We expect that the comprehensive discussion about state-of-the-art medical VLMs willmake researchers realize their significant potential.

视觉语言模型（VLMs）被视为有效的范式，在视觉感知和文本解读之间搭建了一座桥梁。对于医学视觉任务而言，它们可以从文本语境中提取的专家观察结果和医生知识中获益，从而提升模型的视觉理解能力。鉴于大量医学报告通常会与医学影像相关联，医学视觉语言模型引发了越来越多的关注，它不仅在预训练阶段作为自监督学习的方式，还可作为一种将辅助信息引入医学视觉感知的手段。为了加强对这一前景广阔的方向的理解，本次综述旨在对用于各种视觉识别和推理任务的医学视觉语言模型进行深入的探究和回顾。首先，我们对医学视觉语言模型进行介绍。然后，我们给出相关的预备知识，并从不同角度深入探讨如何在医学视觉任务中利用语言。此外，我们对公开可用的视觉语言模型数据集进行研究，并讨论其中存在的挑战以及未来的发展前景。我们期望，对最先进的医学视觉语言模型进行全面的讨论，能让研究人员认识到其巨大的潜力。

Method

方法

4.1. Medical vision-language pretrainingIn the discussion of medical VLP literature, we categorize VLPinto masked prediction, matching prediction, contrastive learning, andhybrid approaches according to deliberately designed self-supervisedobjective functions. An illustration of these categories is presented inFig. 5 and then we summarize relevant papers in Table 1.

4.1 医学视觉语言预训练

在对医学视觉语言预训练相关文献的讨论中，根据精心设计的自监督目标函数，我们将视觉语言预训练（VLP）分为掩码预测、匹配预测、对比学习以及混合方法。这些类别的图示展示在图 5 中，并且我们在表 1 中总结了相关的论文。

Conclusion

结论

In this comprehensive survey, we have delved into the emergingfield of medical vision-language models (medical VLMs). Commencingwith an elucidation of their background and definition, we traversedthrough a meticulous analysis of existing literature from five perspectives, presenting challenges and future directions. Throughout our review, we have emphasized their pivotal role in enhancing data efficacy,improving visual perception and reasoning accuracy, and advancinginterpretability. These models assist physicians in not only diagnosingdiseases and identifying lesions and abnormalities more accurately butalso formulating treatment plans and disseminating health information.Looking ahead, research in medical VLMs presents promising avenues for innovation and impact in both medical research and practicalapplications. The future of medical VLMs holds the potential to revolutionize healthcare delivery and augment clinical decision-making.

在这项全面的综述中，我们深入研究了新兴的医学视觉语言模型（医学VLM）领域。从阐释其背景和定义开始，我们从五个角度对现有文献进行了细致的分析，提出了面临的挑战和未来的发展方向。在整个综述过程中，我们强调了医学视觉语言模型在提高数据有效性、提升视觉感知和推理准确性以及推动可解释性方面的关键作用。这些模型不仅有助于医生更准确地诊断疾病、识别病变和异常情况，还有助于制定治疗方案和传播健康信息。展望未来，医学视觉语言模型的研究为医学研究和实际应用中的创新和产生影响力提供了充满希望的途径。医学视觉语言模型的未来有望彻底改变医疗服务的提供方式，并增强临床决策能力。

Figure

图

Fig. 1. The overview of medical VLMs in this review.

图1：本次综述中关于医学视觉语言模型的概述。

Fig. 2. Statistics of literature sources and publication years. (a) The pie chart illustrates the distribution of literature sources for the papers reviewed in our study. A significantportion of the selected papers are sourced from highly influential and authoritative conferences or journals in the fields of Machine Learning/Deep Learning (ML/DL) and Medicine.(b) Statistics of literature publication years. The bar chart shows the number of papers published each year up to the first quarter of 2024, indicating the trend and frequency ofpublications over the years. The pie chart depicts the distribution of different imaging modalities mentioned in the reviewed papers.

图2：文献来源和出版年份的统计数据。（a）饼状图展示了本研究中所回顾论文的文献来源分布情况。所选论文的很大一部分来源于机器学习/深度学习（ML/DL）和医学领域中具有高度影响力和权威性的会议或期刊。（b）文献出版年份统计数据。条形图显示了截至2024年第一季度每年发表的论文数量，体现了多年来的发表趋势和频率。饼状图描绘了所回顾论文中提及的不同成像模态的分布情况。

Fig. 3. Taxonomy of studies focusing on VLMs in the medical field

图3：聚焦于医学领域中视觉语言模型（VLM）的研究分类。

Fig. 4. Timeline of representative works in medical vision-language models (VLMs). This illustration traces the evolution of medical VLMs from five key perspectives, highlightingsignificant milestones and advancements in this field

图4：医学视觉语言模型（VLMs）代表性成果的时间线。此图示从五个关键视角追溯医学视觉语言模型的演变历程，突出该领域的重要里程碑和进展。

Fig. 5. An illustration of medical vision-language pretraining. Related methods are categorized into (a) Masked prediction, (b) Matching prediction, (c) Contrastive learning, and(d) Hybrid approaches. MIM: masked image modeling. MLM: masked language modeling. ITM: image–text matching. CL: contrastive learning

图5：医学视觉语言预训练的图示。相关方法分为：（a）掩码预测，（b）匹配预测，（c）对比学习，以及（d）混合方法。MIM：掩码图像建模。MLM：掩码语言建模。ITM：图像-文本匹配。CL：对比学习。

Fig. 6. Illustration of drawbacks of ITM and ITC pretraining. (a) From a globalperspective, these methods pull positive image–text representations closer while theypush image–text representations from different cases apart even if they share similarsemantics. (b) An example of multiple-to-multiple correspondences between medicalimaging and text from a local perspective

图6：图像-文本匹配（ITM）和图像-文本对比（ITC）预训练的缺点图示。（a）从全局角度来看，这些方法使正样本的图像-文本表示更接近，然而，即使不同实例的图像-文本具有相似语义，这些方法也会将它们的表示推离。（b）从局部角度展示医学图像与文本之间多对多对应关系的一个示例。

Fig. 7. Illustration of prompt tuning. (a) Textual prompt tuning. (b) Visual prompt tuning. For brevity, the illustration of visual-textual tuning is omitted, as it integrates bothtextual and visual prompt tuning. During the tuning process, all encoders’ parameters remain frozen. (c) A common placement of adapters among intermediate layers.

图7：提示调优的图示。（a）文本提示调优。（b）视觉提示调优。为简洁起见，省略了视觉-文本提示调优的图示，因为它融合了文本和视觉提示调优。在调优过程中，所有编码器的参数保持固定不变。（c）中间层中适配器的常见放置方式。

Fig. 8. Illustration of adapter tuning and LoRA. In (a) and (b), we highlight thedifferences between the vanilla adapter type and the LoRA type. (a) Details of a vanillaadapter. (b) Details of a LoRA adapter. For simplicity, we have omitted the bias terms.

图8：适配器调优和低秩适配器（LoRA）的图示。在（a）和（b）中，我们突出了普通适配器类型和低秩适配器（LoRA）类型之间的差异。（a）普通适配器的细节。（b）低秩适配器（LoRA）的细节。为了简化起见，我们省略了偏差项。

Fig. 9. Illustration of visual representation learning. The upper row visualizes thatimage features are mapped into one-hot encoding space as vanilla classification andsegmentation tasks do. In contrast, image features are mapped into the word representation space and aligned with the corresponding class-wise word representations.Compared with the former, image-word representation alignment presents a relationdependent distribution.

图9：视觉表征学习的图示。上排展示了图像特征被映射到独热编码空间，就像常规的分类和分割任务那样。相比之下，图像特征被映射到词表征空间，并与相应的按类别划分的词表征进行对齐。与前者相比，图像-词表征对齐呈现出一种依赖于关系的分布。

Fig. 10. Illustration of visual recognition enhancement. Position information can guide visual models for more accurate localization. Attributes such as shape, color, etc. capturethe target’s primary characteristics, aiding in precise target identification. Descriptions of the target’s appearance enhance the model’s ability to recognize the target. Such auxiliaryinformation can be introduced through manual prompts or automatic prompts without requiring extraannotations

图10：视觉识别增强的图示。位置信息能够引导视觉模型实现更精确的定位。诸如形状、颜色等属性能够捕捉目标的主要特征，有助于精确识别目标。对目标外观的描述提升了模型识别目标的能力。这类辅助信息可以通过手动提示或自动提示的方式引入，无需额外的注释。

Fig. 11. An illustration of medical image–text multimodal fusion. Related methods are categorized into (a) Early fusion, (b) Intermediate fusion, (c) Late fusion, and (d) Simultaneousfusion

图11：医学图像-文本多模态融合图示。相关方法分为：（a）早期融合，（b）中间融合，（c）后期融合，以及（d）同步融合。

Table

表

Table 1Summary of medical VLP methods. MP: Masked prediction. MaP: Matching prediction. CL: contrastive learning. HA: hybrid approach. MLM: masked language modeling. MIM:masked image modeling. ITM: image–text matching. GCL: global contrastive learning. LCL: local contrastive learning. NA: not available

表1 医学视觉语言预训练方法总结。MP：掩码预测。MaP：匹配预测。CL：对比学习。HA：混合方法。MLM：掩码语言建模。MIM：掩码图像建模。ITM：图像-文本匹配。GCL：全局对比学习。LCL：局部对比学习。NA：不可用

Table 2Summary of medical VLM transfer learning. CLS: classification. SEG: segmentation. VQA: visual question answering. AD: anomaly detection. MT: multiple tasks

表2 医学视觉语言模型迁移学习总结。CLS：分类。SEG：分割。VQA：视觉问答。AD：异常检测。MT：多项任务

Table 3Summary of medical VLM knowledge distillation. VRL: visual representation learning. VRE: visual reasoning enhancement. CLS: classification. SEG: segmentation. PP: positionprediction. DET: detection. VLP: vision-language pretraining. RG: report generation

表3 医学视觉语言模型知识蒸馏总结。VRL：视觉表征学习。VRE：视觉推理增强。CLS：分类。SEG：分割。PP：位置预测。DET：检测。VLP：视觉-语言预训练。RG：报告生成