language: tr
针对土耳其语,这里有一个易于使用的命名实体识别(NER)应用。
** 适用于土耳其语的简易Python NER(Bert + 迁移学习)模型(命名实体识别)...
引用
如果您在研究中使用了本模型,请引用以下文献:
@misc{yildirim2024finetuning,
title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks},
author={Savas Yildirim},
year={2024},
eprint={2401.17396},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@book{yildirim2021mastering,
title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
year={2021},
publisher={Packt Publishing Ltd}
}
其他细节
感谢@stefan-it,我在训练中应用了以下步骤:
cd tr-data
for file in train.txt dev.txt test.txt labels.txt
do
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done
cd ..
这将下载预处理好的数据集(包括训练集、验证集和测试集),并将其放入tr-data
文件夹中。
运行预训练
下载数据集后,可以开始预训练。只需设置以下环境变量:
export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased
export OUTPUT_DIR=tr-new-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1
然后运行预训练命令:
python3 run_ner_old.py --data_dir ./tr-data3 \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16
使用方法
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner = pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
部分结果
数据集1:上述数据
评估结果:
- 精确率 = 0.916400580551524
- 召回率 = 0.9342309684101502
- F1分数 = 0.9252298787412536
- 损失 = 0.11335893666411284
测试结果:
- 精确率 = 0.9192058759362955
- 召回率 = 0.9303010230367262
- F1分数 = 0.9247201697271198
- 损失 = 0.11182546521618497
数据集2:
来自@kemalaraz提供的数据(链接:https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt)
性能如下:
评估结果:
- 精确率 = 0.9461980692049029
- 召回率 = 0.959309358847465
- F1分数 = 0.9527086063783312
- 损失 = 0.037054269206847804
测试结果:
- 精确率 = 0.9458370635631155
- 召回率 = 0.9588201928530913
- F1分数 = 0.952284378344882
- 损失 = 0.035431676572445225