JPharmatron-7B-base开源大语言模型 - 助力制药应用与研究的日英双语工具

首页

Jpharmatron 7B Base

由 EQUES 开发

JPharmatron-7B-base是一个70亿参数的日语和英语大语言模型，专为制药应用和研究设计。

大型语言模型

Transformers

支持多种语言#制药领域专用 #日英双语支持 #持续预训练

下载量 104

发布时间 : 4/1/2025

模型简介

该模型基于Qwen2.5-7B架构，使用来自日语数据集的20亿个标记进行持续预训练，专注于制药领域的自然语言处理任务。

模型特点

领域针对性

专为制药应用和研究设计，具有领域特定的优化。

多语言支持

支持日语和英语，适用于跨语言制药研究。

持续预训练

基于Qwen2.5-7B，使用20亿个日语制药领域标记进行持续预训练。

模型能力

制药领域文本理解

跨语言术语标准化

制药知识问答

制药文档分析

使用案例

制药研究

药师资格考试问答

基于日本药师国家资格考试内容的问答系统

在YakugakuQA基准测试中表现优异

跨语言术语标准化

处理日语和英语之间的药品同义词和术语标准化

在NayoseQA基准测试中展现竞争力

声明一致性验证

评估成对声明之间的一致性推理

在SogoCheck任务中表现优于部分商业模型

🚀 JPharmatron-7B-base

JPharmatron-7B-base是一个70亿参数的大语言模型，专为制药应用和研究而设计。

🚀 快速开始

本模型未经过包括指令微调在内的任何后训练。因此，不建议直接将此模型用于下游任务。此外，该模型未经过医疗用途或其他对风险敏感的用途验证。

✨ 主要特性

领域针对性：专为制药应用和研究设计。
多语言支持：支持日语和英语。

📚 详细文档

🔍 模型详情

模型描述

JPharmatron-7B-base基于Qwen2.5-7B，使用来自日语数据集的20亿个标记进行持续预训练。

开发者：EQUES Inc.
资助方：GENIAC项目
模型类型：自回归解码器
支持语言：日语、英语
许可证：CC-BY-SA-4.0

模型来源

仓库地址：https://github.com/EQUES-Inc/pharma-LLM-eval
相关论文：A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

📖 引用信息

BibTeX：

@misc{sukeda_japanese_2025,
  title     = {A {Japanese} {Language} {Model} and {Three} {New} {Evaluation} {Benchmarks} for {Pharmaceutical} {NLP}},
  url       = {http://arxiv.org/abs/2505.16661},
  doi       = {10.48550/arXiv.2505.16661},
  abstract  = {We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.},
  urldate   = {2025-05-30},
  publisher = {arXiv},
  author    = {Sukeda, Issey and Fujii, Takuro and Buma, Kosei and Sasaki, Shunsuke and Ono, Shinnosuke},
  month     = may,
  year      = {2025},
  note      = {arXiv:2505.16661 [cs]},
  annote    = {Comment: 15 pages, 9 tables, 5 figures}
}

👥 模型卡片作者

@shinnosukeono

📄 许可证

本模型使用CC-BY-SA-4.0许可证。

📋 信息表格

属性	详情
模型类型	自回归解码器
训练数据	基于Qwen2.5-7B，使用来自日语数据集的20亿个标记进行持续预训练
支持语言	日语、英语
许可证	CC-BY-SA-4.0
开发者	EQUES Inc.
资助方	GENIAC项目
仓库地址	https://github.com/EQUES-Inc/pharma-LLM-eval
相关论文	A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP