模型简介
模型特点
模型能力
使用案例
🚀 InternVL3-78B Transformers 🤗 实现
本项目是基于Hugging Face 🤗 Transformers库对OpenGVLab/InternVL3-78B模型的实现。它在功能上与原始的OpenGVLab版本等效,作为原生的Transformers模型,支持核心库的各种特性,如不同的注意力机制实现(包括SDPA和FA2),并能高效处理图像、视频和文本的批量推理任务。
📜 [InternVL 1.0] 📜 [InternVL 1.5] 📜 [InternVL 2.5] 📜 [InternVL2.5-MPO] 📜 [InternVL3]
🆕 [博客] 🗨️ [聊天演示] 🤗 [HF演示] 🚀 [快速开始] 📖 [文档]

🚀 快速开始
本仓库包含了基于Hugging Face 🤗 Transformers库对 OpenGVLab/InternVL3-78B 模型的实现。它在功能上与原始的OpenGVLab版本等效,作为原生的Transformers模型,支持核心库的各种特性,如不同的注意力机制实现(包括SDPA和FA2),并能高效处理图像、视频和文本的批量推理任务。
✨ 主要特性
- 多模态处理能力:支持图像、文本和视频的输入,能够处理多种模态的信息。
- 批量推理:支持批量图像、文本和视频的输入,提高推理效率。
- 多语言支持:支持多种语言的输入和输出。
📦 安装指南
文档中未提及安装步骤,故跳过此章节。
💻 使用示例
基础用法
使用Pipeline进行推理
以下是如何使用 image-text-to-text
管道对 InternVL3
模型进行推理的示例代码:
>>> from transformers import pipeline
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "image",
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
... },
... {"type": "text", "text": "Describe this image."},
... ],
... },
... ]
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-78B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
单张图像推理
此示例展示了如何使用聊天模板对InternVL模型进行单张图像推理。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
... {"type": "text", "text": "Please describe the image explicitly."},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'
纯文本生成
此示例展示了如何在不提供任何图像输入的情况下,使用InternVL模型生成文本。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "text", "text": "Write a haiku"},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."
高级用法
批量图像和文本输入
InternVL模型还支持批量图像和文本输入。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... },
... ],
... ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']
批量多图像输入
InternVL模型的此实现支持每个文本对应不同数量图像的批量文本 - 图像输入。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']
视频输入
InternVL模型还可以处理视频输入。以下是如何使用聊天模板对视频输入进行推理的示例。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "video",
... "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
... },
... {"type": "text", "text": "What type of shot is the man performing?"},
... ],
... }
>>> ]
>>> inputs = processor.apply_chat_template(
... messages,
... return_tensors="pt",
... add_generation_prompt=True,
... tokenize=True,
... return_dict=True,
>>> ).to(model.device, dtype=torch.float16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'
交错的图像和视频输入
此示例展示了如何使用聊天模板处理一批包含交错图像和视频输入的聊天对话。
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
... {"type": "text", "text": "What type of shot is the man performing?"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(
... messages,
... padding=True,
... add_generation_prompt=True,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)
>>> outputs = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
"user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."]
📚 详细文档
你可以在原始检查点 OpenGVLab/InternVL3-78B 中找到更多关于InternVL3系列的信息。
🔧 技术细节
文档中未提及技术实现细节,故跳过此章节。
📄 许可证
本项目遵循MIT许可证发布。本项目使用了预训练的Qwen2.5作为组件,该组件遵循Qwen许可证。
📜 引用
如果你在研究中发现本项目有用,请考虑引用以下文献:
@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}
@article{chen2024far,
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
journal={arXiv preprint arXiv:2404.16821},
year={2024}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={24185--24198},
year={2024}
}
信息表格
属性 | 详情 |
---|---|
模型类型 | 图像 - 文本到文本 |
库名称 | transformers |
基础模型 | OpenGVLab/InternVL3-78B-Instruct |
基础模型关系 | 微调 |
数据集 | OpenGVLab/MMPR-v1.2 |
语言 | 多语言 |
标签 | internvl |
重要提示
⚠️ 重要提示
本仓库包含了基于Hugging Face 🤗 Transformers库对 OpenGVLab/InternVL3-78B 模型的实现。它在功能上与原始的OpenGVLab版本等效。
💡 使用建议
请注意,模型是使用特定的聊天提示格式进行训练的。使用
processor.apply_chat_template(my_conversation_dict)
来正确格式化你的提示。








