语言:
- 西班牙语
许可证: cc-by-4.0
标签:
- 英语借词 # 示例: 音频
- 外来词 # 示例: 自动语音识别
- 借用词 # 示例: 语音
- 语码转换 # 示例指定库: allennlp
- arxiv:2203.16169
数据集:
- coalas # 示例: common_voice. 使用来自 https://hf.co/datasets 的数据集ID
小部件:
- 文本: "关于名人的假新闻在黄金时段的'大众媒体'上传播。"
- 文本: "我喜欢黑色电影和动漫。"
- 文本: "替补,坐在你'暗恋对象'的替补席上,而别人是首发。"
- 文本: "十一月的批量烹饪食谱。"
- 文本: "他们使用了机器学习、大数据或区块链技术。"
西班牙语-英语借词-mbert模型
这是一个预训练模型,用于检测西班牙新闻中未同化的英语词汇借用(又称英语借词)。该模型标记西班牙语中使用的外来词汇(主要来自英语),如fake news、machine learning、smartwatch、influencer或streaming。
该模型是基于多语言BERT微调的版本,训练数据来自COALAS语料库,用于检测词汇借用。
模型考虑两种标签:
ENG
: 英语词汇借用(如smartphone、online、podcast)
OTHER
: 来自其他语言的词汇借用(如boutique、anime、umami)
模型使用BIO编码处理多词借用。
⚠ 这不是该任务中表现最佳的模型。 如需最佳表现模型(F1=85.76),请参见Flair模型。
指标(测试集)
下表总结了在COALAS语料库测试集上获得的结果。
标签 |
精确率 |
召回率 |
F1值 |
全部 |
88.09 |
79.46 |
83.55 |
ENG |
88.44 |
82.16 |
85.19 |
OTHER |
37.5 |
6.52 |
11.11 |
数据集
该模型训练于COALAS,一个标注了未同化词汇借用的西班牙新闻语料库。语料库包含370,000个词,涵盖欧洲西班牙语的多种书面媒体。测试集设计尽可能困难:覆盖训练集未见过的来源和日期,包含大量OOV词(测试集中92%的借用词为OOV),且借用密度高(每1,000词中有20个借用词)。
集合 |
词数 |
ENG |
OTHER |
唯一词 |
训练集 |
231,126 |
1,493 |
28 |
380 |
开发集 |
82,578 |
306 |
49 |
316 |
测试集 |
58,997 |
1,239 |
46 |
987 |
总计 |
372,701 |
3,038 |
123 |
1,683 |
更多信息
关于数据集、模型实验和错误分析的更多信息,请参阅论文:Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling。
使用方法
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Buscamos data scientist para proyecto de machine learning."
borrowings = nlp(example)
print(borrowings)
引用
若使用此模型,请引用以下文献:
@inproceedings{alvarez-mellado-lignos-2022-detecting,
title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
author = "{\'A}lvarez-Mellado, Elena and
Lignos, Constantine",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.268",
pages = "3868--3888",
abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}