OpenLID开源语言识别模型 - 高性能支持201种语言，覆盖范围超广！

首页

Openlid

由 laurievb 开发

OpenLID是一个高覆盖率、高性能的语言识别模型，支持201种语言。

文本分类 #多语言识别 #高覆盖率 #fastText框架

下载量 1,854

发布时间 : 10/24/2023

模型简介

基于fastText框架的文本分类模型，专门用于语言识别任务。

模型特点

高覆盖率

支持201种语言，覆盖范围广。

高性能

在FLORES-200基准上表现优异。

公开数据集

训练数据及性能指标公开，促进进一步研究。

模型能力

文本分类

语言识别

使用案例

多语言处理

语言检测

识别文本的语言类别。

在FLORES-200基准上表现优异。

🚀 OpenLID

OpenLID是一款高覆盖、高性能的语言识别模型，基于fastText实现，可识别201种语言。其训练数据和各语言的性能数据均公开，方便后续研究。

🚀 快速开始

OpenLID是一个高覆盖、高性能的语言识别模型。它是一个fastText模型，涵盖201种语言。训练数据和每种语言的性能数据都是公开的，以鼓励进一步的研究。

以下是使用该模型检测给定文本语言的示例代码：

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="laurievb/OpenLID", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

✨ 主要特性

高覆盖：能够识别201种语言。
高性能：在语言识别任务上表现出色。
数据公开：训练数据和各语言性能数据公开，便于进一步研究。

📦 安装指南

文档未提及安装步骤，跳过此章节。

💻 使用示例

基础用法

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="laurievb/OpenLID", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

高级用法

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

📚 详细文档

模型描述

该模型和训练数据在Burchell et al. (2023)中有详细描述，原始的fastText实现可以通过github获取。

局限性和偏差

语言覆盖有限：数据集和模型仅覆盖201种语言，即我们能够使用FLORES - 200评估基准进行测试的语言。
领域局限性：由于测试集仅包含来自单一领域（维基文章）的句子，在该测试集上的性能可能无法反映分类器在其他领域的工作效果。未来的工作可以创建一个能代表网络数据的LID测试集，因为这些分类器通常应用于网络数据。
数据审核不足：大部分数据没有像理想情况那样由母语人士进行审核。该数据集的未来版本应该有更多语言由母语人士进行验证，尤其关注资源最少的语言。

我们的工作旨在通过让从业者能够识别更多语言的相关数据来扩大NLP的覆盖范围。然而，我们注意到语言识别本质上是一种规范性活动，存在将少数方言、文字系统或整个微观语言从宏观语言中排除的风险。选择要覆盖的语言可能会加剧权力不平衡，因为只有部分群体能够使用NLP技术。此外，语言识别中的错误可能会对下游性能产生重大影响，特别是当系统被用作“黑匣子”时（这种情况很常见）。我们的分类器在不同语言上的性能并不均衡，这可能导致特定群体的下游性能更差。我们通过按类别提供指标来缓解这一问题。

训练数据

该模型在OpenLID数据集上进行训练，该数据集可通过github仓库获取。

训练过程

该模型使用fastText进行训练，并设置了以下超参数。所有其他超参数均设置为默认值。

损失函数：softmax
训练轮数：2
学习率：0.8
单词最小出现次数：1000
嵌入维度：256
字符n - 元组：2 - 5
单词n - 元组：1
桶大小：1,000,000
线程数：68

评估数据集

该模型使用Costa - jussà等人（2022）提供的FLORES - 200基准进行评估。更多信息可在论文中获取。

🔧 技术细节

模型使用fastText进行训练，通过设置特定的超参数来优化性能。具体超参数设置如下：

损失函数：softmax
训练轮数：2
学习率：0.8
单词最小出现次数：1000
嵌入维度：256
字符n - 元组：2 - 5
单词n - 元组：1
桶大小：1,000,000
线程数：68

📄 许可证

本模型使用的许可证为gpl - 3.0。

BibTeX引用和引用信息

ACL引用（推荐）

@inproceedings{burchell-etal-2023-open,
    title = "An Open Dataset and Model for Language Identification",
    author = "Burchell, Laurie  and
      Birch, Alexandra  and
      Bogoychev, Nikolay  and
      Heafield, Kenneth",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-short.75",
    doi = "10.18653/v1/2023.acl-short.75",
    pages = "865--879",
    abstract = "Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033{\%} across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model{'}s performance, both in comparison to existing open models and by language class.",
}

ArXiv引用

@article{burchell2023open,
  title={An Open Dataset and Model for Language Identification},
  author={Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth},
  journal={arXiv preprint arXiv:2305.13820},
  year={2023}
}