herbert-base-ner开源波兰语命名实体识别模型 - 精准识别人员、地点和组织

首页

Herbert Base Ner

由 pczarnik 开发

基于HerBERT模型微调的波兰语命名实体识别模型，可识别人物、地点和组织三类实体

序列标注

Transformers

其他#波兰语命名实体识别 #高精度NER #HerBERT微调

下载量 394

发布时间 : 5/27/2023

模型简介

该模型是基于allegro/herbert-base-cased微调的命名实体识别模型，专门用于波兰语文本中的实体识别任务

模型特点

波兰语专用

专门针对波兰语命名实体识别任务优化，能准确处理波兰语特有的字符和语法结构

高精度识别

在wikiann测试集上达到0.89的精确度和0.91的召回率

三类实体识别

能够识别人物(PER)、地点(LOC)和组织(ORG)三类实体

模型能力

波兰语文本分析

命名实体识别

人物名称检测

地点名称检测

组织机构名称检测

使用案例

文本信息提取

个人信息提取

从文本中提取人名、地名等个人信息

能准确识别波兰语中复杂的姓名和地名

机构信息提取

识别文本中提到的组织机构名称

能识别政府机构、企业等组织名称

文档处理

文档自动标注

为波兰语文档自动标注命名实体

提高文档处理效率，减少人工标注工作量

🚀 herbert-base-ner

herbert-base-ner 是一个经过微调的 HerBERT 模型，可用于命名实体识别。它经过训练，能够识别三种类型的实体：人物（PER）、地点（LOC）和组织（ORG）。

具体而言，该模型是在 wikiann 数据集的波兰语子集上微调的 allegro/herbert-base-cased 模型。

🚀 快速开始

你可以使用 Transformers 的 pipeline 来使用这个用于命名实体识别（NER）的模型。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_checkpoint = "pczarnik/herbert-base-ner"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Nazywam się Grzegorz Brzęszczyszczykiewicz, pochodzę "\
    "z Chrząszczyżewoszczyc, pracuję w Łękołodzkim Urzędzie Powiatowym"

ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.99451494, 'index': 4, 'word': 'Grzegorz</w>', 'start': 12, 'end': 20},
 {'entity': 'I-PER', 'score': 0.99758506, 'index': 5, 'word': 'B', 'start': 21, 'end': 22},
 {'entity': 'I-PER', 'score': 0.99749386, 'index': 6, 'word': 'rzę', 'start': 22, 'end': 25},
 {'entity': 'I-PER', 'score': 0.9973041, 'index': 7, 'word': 'szczy', 'start': 25, 'end': 30},
 {'entity': 'I-PER', 'score': 0.99682057, 'index': 8, 'word': 'szczy', 'start': 30, 'end': 35},
 {'entity': 'I-PER', 'score': 0.9964832, 'index': 9, 'word': 'kiewicz</w>', 'start': 35, 'end': 42},
 {'entity': 'B-LOC', 'score': 0.99427444, 'index': 14, 'word': 'Chrzą', 'start': 55, 'end': 60},
 {'entity': 'I-LOC', 'score': 0.99143463, 'index': 15, 'word': 'szczy', 'start': 60, 'end': 65},
 {'entity': 'I-LOC', 'score': 0.9922201, 'index': 16, 'word': 'że', 'start': 65, 'end': 67},
 {'entity': 'I-LOC', 'score': 0.9918464, 'index': 17, 'word': 'wo', 'start': 67, 'end': 69},
 {'entity': 'I-LOC', 'score': 0.9900766, 'index': 18, 'word': 'szczy', 'start': 69, 'end': 74},
 {'entity': 'I-LOC', 'score': 0.98823845, 'index': 19, 'word': 'c</w>', 'start': 74, 'end': 75},
 {'entity': 'B-ORG', 'score': 0.6808262, 'index': 23, 'word': 'Łę', 'start': 87, 'end': 89},
 {'entity': 'I-ORG', 'score': 0.7763973, 'index': 24, 'word': 'ko', 'start': 89, 'end': 91},
 {'entity': 'I-ORG', 'score': 0.77731717, 'index': 25, 'word': 'ło', 'start': 91, 'end': 93},
 {'entity': 'I-ORG', 'score': 0.9108255, 'index': 26, 'word': 'dzkim</w>', 'start': 93, 'end': 98},
 {'entity': 'I-ORG', 'score': 0.98050755, 'index': 27, 'word': 'Urzędzie</w>', 'start': 99, 'end': 107},
 {'entity': 'I-ORG', 'score': 0.9789752, 'index': 28, 'word': 'Powiatowym</w>', 'start': 108, 'end': 118}]

📚 详细文档

模型信息

属性	详情
模型类型	用于命名实体识别的微调 HerBERT 模型
训练数据	wikiann 数据集的波兰语子集

模型评估结果

任务	数据集	指标	值
标记分类	wikiann（波兰语测试集）	精确率	0.8857142857142857
标记分类	wikiann（波兰语测试集）	召回率	0.9070532179048386
标记分类	wikiann（波兰语测试集）	F1 值	0.896256755412619
标记分类	wikiann（波兰语测试集）	准确率	0.9581463871961428

BibTeX 引用和引用信息

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
}