【论文导读】多模态大语言模型综述(一)介绍
TIP! Right-click and select "Save link as..." to download.
视频简介
本篇综述介绍内容主要包括多模态大语言模型(MLLM)的相关概念(包括体系结构、训练策略和数据以及评估)、MLLM研究主题、MLLM幻觉、MLLM技术(包括多模态上下文学习、多模态思维链和语言模型辅助视觉推理)和展望。
综述论文:
@misc{yin2023survey,
title={A Survey on Multimodal Large Language Models},
author={Shukang Yin and Chaoyou Fu and Sirui Zhao and Ke Li and Xing Sun and Tong Xu and Enhong Chen},
year={2023},
eprint={2306.13549},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
推荐阅读:
[CLIP@OpenAI,OpenCLIP]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in CVPR, 2023.
[GPT-4@OpenAI]
OpenAI, “GPT-4 Technical Report,” Mar. 04, 2024, arXiv: arXiv:2303.08774.
[LLaVA,LLaVA-1.5]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, ‘Visual Instruction Tuning’, in Advances in Neural Information Processing Systems, Dec. 2023, pp. 34892–34916
H. Liu, C. Li, Y. Li, and Y. J. Lee, ‘Improved Baselines with Visual Instruction Tuning’, presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26296–26306.
[BLIP@Salesforce Research,BLIP-2@Salesforce Research]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping languageimage pre-training for unified vision-language understanding and generation,” in ICML, 2022
J. Li, D. Li, S. Savarese, and S. Hoi, ‘BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models’, in Proceedings of the 40th International Conference on Machine Learning, PMLR, Jul. 2023, pp. 19730–19742
[Chameleon@Meta AI]
Chameleon Team, ‘Chameleon: Mixed-Modal Early-Fusion Foundation Models’, May 16, 2024, arXiv: arXiv:2405.09818. doi: 10.48550/arXiv.2405.09818
[MiniGPT-4]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, ‘MiniGPT-4: Enhancing V