fasttext-en-vectors开源模型 - 高效学习词向量与文本分类，通用硬件快速训练

首页

Fasttext En Vectors

由 facebook 开发

fastText是一个高效学习词向量表示和文本分类的开源库，支持在通用硬件上快速训练。

文本嵌入英语#多语言词向量 #轻量级文本分类 #快速语义检索

下载量 956

发布时间 : 3/16/2023

模型简介

该模型提供预训练的英文词向量，可用于文本分类、语言识别等自然语言处理任务。支持快速查询词向量和计算词语相似度。

模型特点

高效训练

在普通多核CPU上几分钟内可处理十亿级词汇训练

轻量级

模型可压缩至适合移动设备的尺寸

多语言支持

支持157种语言的词向量学习

子词信息

利用字符n-gram捕获词形变化特征

模型能力

词向量生成

词语相似度计算

文本分类

语言识别

近义词查询

使用案例

自然语言处理

词语相似度分析

计算两个词语的余弦相似度

可得到0到1之间的相似度分数

近义词查询

查找与给定词语最相似的其他词语

返回相似度最高的前N个词语

语言检测

识别输入文本的语言种类

返回概率最高的语言标签

🚀 fastText

fastText 是一个开源、免费且轻量级的库，它允许用户学习文本表示和文本分类器。该库可在标准通用硬件上运行，其模型后续还能缩小尺寸，以适配移动设备。它在这篇论文中被首次提出，官方网站可查看此处。

✨ 主要特性

易于使用：专为开发者、领域专家和学生设计，使用起来非常简单。
高效训练：能够在多核 CPU 上，在几分钟内对超过十亿个单词进行模型训练。
多语言支持：包含在维基百科上学习的预训练模型，支持超过 157 种不同语言。
灵活使用：可以作为命令行工具使用，也能链接到 C++ 应用程序，还能作为库用于从实验、原型设计到生产的各种用例。

📦 安装指南

文档未提及安装步骤，跳过此章节。

💻 使用示例

基础用法

以下是如何加载和使用预训练向量的示例：

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.words

['the', 'of', 'and', 'to', 'in', 'a', 'that', 'is', ...]

>>> len(model.words)

145940

>>> model['bread']

array([ 4.89417791e-01,  1.60882145e-01, -2.25947708e-01, -2.94273376e-01,
       -1.04577184e-01,  1.17962055e-01,  1.34821936e-01, -2.41778508e-01, ...])

高级用法

查询英文单词向量的最近邻

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-nearest-neighbors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.get_nearest_neighbors("bread", k=5)

[(0.5641006231307983, 'butter'), 
 (0.48875734210014343, 'loaf'), 
 (0.4491206705570221, 'eat'), 
 (0.42444291710853577, 'food'), 
 (0.4229326844215393, 'cheese')]

检测给定文本的语言

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

📚 详细文档

预期用途和限制

你可以使用预训练的词向量进行文本分类或语言识别。请查看其官方网站上的教程和资源，以查找你感兴趣的任务。

局限性和偏差

即使该模型使用的训练数据可以被描述为相当中立，但该模型仍可能存在有偏差的预测。

可以使用余弦相似度来衡量两个不同词向量之间的相似度。如果两个向量相同，余弦相似度将为 1；对于两个完全不相关的向量，该值将为 0；如果两个向量具有相反的关系，该值将为 -1。

>>> import numpy as np

>>> def cosine_similarity(word1, word2):
>>>     return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))

>>> cosine_similarity("man", "boy")

0.061653383

>>> cosine_similarity("man", "ceo")

0.11989131

>>> cosine_similarity("woman", "ceo")

-0.08834904

训练数据

使用 fastText 在 Common Crawl 和维基百科上对 157 种语言的预训练词向量进行了训练。这些模型使用带有位置权重的 CBOW 进行训练，维度为 300，字符 n-gram 长度为 5，窗口大小为 5，负样本数为 10。同时，还发布了三个新的词类比数据集，分别用于法语、印地语和波兰语。

训练过程

分词

中文使用斯坦福分词器。
日语使用 Mecab。
越南语使用 UETsegmenter。
对于使用拉丁、西里尔、希伯来或希腊字母的语言，使用 Europarl 预处理工具中的分词器。
对于其余语言，使用 ICU 分词器。

更多关于这些模型训练的信息可以在文章 Learning Word Vectors for 157 Languages 中找到。

评估数据集

论文中描述的类比评估数据集可在以下链接获取：

BibTeX 引用和引用信息

如果使用此代码学习词表示，请引用 [1]；如果用于文本分类，请引用 [2]。

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

如果你使用这些词向量，请引用以下论文：

[4] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

(* 这些作者贡献相同。)