inference: false
language:
- 泰语
- 英语
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2-VL-7B-Instruct
Typhoon2-Vision
Typhoon2-qwen2vl-7b-vision-instruct 是一款支持泰语 🇹🇭 的视觉语言模型,能够处理图像和视频输入。虽然 Qwen2-VL 设计用于同时处理图像和视频任务,但 Typhoon2-VL 特别针对基于图像的应用进行了优化。
技术报告请参阅我们的 arxiv。
模型描述
这里我们提供基于 Qwen2-VL-7B-Instruct 构建的 Typhoon2-qwen2vl-7b-vision-instruct。
快速开始
以下代码片段展示了如何使用 transformers 库运行该模型。
运行代码前,请先安装以下依赖项:
pip install torch transformers accelerate pillow
如何使用该模型
使用以下代码开始使用模型。
问题: 用泰语指出这张图片中的地点名称和国家
回答: 大皇宫,曼谷,泰国
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "用泰语指出这张图片中的地点名称和国家"},
],
}
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
处理多张图片
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{
"type": "image",
},
{"type": "text", "text": "指出这两张图片中的三个相似之处"},
],
}
]
urls = [
"https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
"https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
提示
为了平衡模型性能和计算成本,可以通过向处理器传递参数来设置最小和最大像素数。
min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_name, min_pixels=min_pixels, max_pixels=max_pixels
)
评估(图像)
基准测试 |
Llama-3.2-11B-Vision-Instruct |
Qwen2-VL-7B-Instruct |
Pathumma-llm-vision-1.0.0 |
Typhoon2-qwen2vl-7b-vision-instruct |
OCRBench Liu et al., 2024c |
72.84 / 51.10 |
72.31 / 57.90 |
32.74 / 25.87 |
64.38 / 49.60 |
MMBench (Dev) Liu et al., 2024b |
76.54 / - |
84.10 / - |
19.51 / - |
83.66 / - |
ChartQA Masry et al., 2022 |
13.41 / x |
47.45 / 45.00 |
64.20 / 57.83 |
75.71 / 72.56 |
TextVQA Singh et al., 2019 |
32.82 / x |
91.40 / 88.70 |
32.54 / 28.84 |
91.45 / 88.97 |
OCR (TH) OpenThaiGPT, 2024 |
64.41 / 35.58 |
56.47 / 55.34 |
6.38 / 2.88 |
64.24 / 63.11 |
M3Exam Images (TH) Zhang et al., 2023c |
25.46 / - |
32.17 / - |
29.01 / - |
33.67 / - |
GQA (TH) Hudson et al., 2019 |
31.33 / - |
34.55 / - |
10.20 / - |
50.25 / - |
MTVQ (TH) Tang et al., 2024b |
11.21 / 4.31 |
23.39 / 13.79 |
7.63 / 1.72 |
30.59 / 21.55 |
平均 |
37.67 / x |
54.26 / 53.85 |
25.61 / 23.67 |
62.77 / 59.02 |
注:每个单元格中的第一个值代表 Rouge-L,第二个值(/
后)代表 准确率,标准化后 Rouge-L = 100%。
预期用途与限制
这是一个指令模型,但仍处于开发阶段。虽然内置了一定程度的防护机制,但仍可能对用户提示产生不准确、有偏见或其他不当的回答。建议开发者在具体应用场景中评估这些风险。
关注我们
https://twitter.com/opentyphoon
支持
https://discord.gg/us5gAYmrxw
引用
- 如果您发现 Typhoon2 对您的工作有帮助,请引用:
@misc{typhoon2,
title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
year={2024},
eprint={2412.13702},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13702},
}