🚀 HoogBERTa
本仓库包含针对命名实体识别(NER)任务进行微调的泰语预训练语言表示模型(HoogBERTa_base)。
🚀 快速开始
前提条件
由于我们使用 subword - nmt BPE 编码,在将输入送入 HoogBERTa 之前,需要使用 BEST 标准对输入进行预分词。
pip install attacut
初始化模型
要从模型中心初始化模型,请使用以下命令:
from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch
tokenizer = RobertaTokenizerFast.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
model = RobertaForTokenClassification.from_pretrained("lst - nectec/HoogBERTa - NER - lst20")
进行命名实体识别标注
使用以下命令进行命名实体识别标注:
from transformers import pipeline
nlp = pipeline('token - classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
print(nlp(sentence))
批量处理
from transformers import pipeline
nlp = pipeline('token - classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
print(nlp(inputList))
📚 详细文档
Huggingface 模型
HoogBERTaEncoder
- [HoogBERTa](https://huggingface.co/lst - nectec/HoogBERTa):用于
特征提取
和掩码语言建模
HoogBERTaMuliTaskTagger
- [HoogBERTa - NER - lst20](https://huggingface.co/lst - nectec/HoogBERTa - NER - lst20):基于 LST20 数据集的
命名实体识别(NER)
- [HoogBERTa - POS - lst20](https://huggingface.co/lst - nectec/HoogBERTa - POS - lst20):基于 LST20 数据集的
词性标注(POS)
- [HoogBERTa - SENTENCE - lst20](https://huggingface.co/lst - nectec/HoogBERTa - SENTENCE - lst20):基于 LST20 数据集的
子句边界分类
引用
请按以下格式引用:
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi - task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI - NLP 2021)},
year = {2021},
address={Online}
}
下载全文 PDF
查看 Github 上的代码