fasttext-zh-vectors开源文本处理库 - 免费支持中文词向量训练与文本分类

首页

Fasttext Zh Vectors

由 facebook 开发

fastText是一个开源、免费、轻量级的文本表示学习和分类库，支持中文词向量训练和文本分类任务。

文本嵌入中文#多语言词向量 #轻量级文本分类 #高效特征提取

下载量 355

发布时间 : 3/19/2023

模型简介

fastText库专注于文本分类和词向量学习，能在普通硬件上快速处理大规模文本数据，提供预训练的中文词向量模型。

模型特点

高效训练

能在标准硬件上快速处理十亿级词汇量的训练任务

子词信息

利用字符n-gram捕捉词形变化和罕见词特征

多场景支持

提供命令行工具、C++库和编程接口，支持从实验到生产的全流程

模型能力

词向量生成

文本分类

语义相似度计算

语言识别

近义词发现

使用案例

自然语言处理

语义搜索

利用词向量计算查询词与文档的语义相关性

提升搜索结果的语义匹配精度

文本分类

对新闻、评论等内容进行自动分类

快速实现多类别文本分类系统

语言分析

语言检测

识别输入文本的语种

支持157种语言的识别

🚀 fastText (中文)

fastText 是一个开源、免费、轻量级的库，它能让用户学习文本表示和文本分类器。它可在标准通用硬件上运行，后续模型还能缩小尺寸，甚至适配移动设备。该库在这篇论文中被提出，其官方网站可点击此处访问。

✨ 主要特性

高效学习：能高效学习词表示和进行句子分类。
简单易用：对开发者、领域专家和学生来说都易于使用。
多语言支持：包含在维基百科上学习的预训练模型，支持超 157 种不同语言。
多方式使用：可作为命令行工具、链接到 C++ 应用程序，或作为库用于从实验、原型开发到生产的各种用例。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

以下是如何加载和使用预训练向量：

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-zh-vectors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.words

['the', 'of', 'and', 'to', 'in', 'a', 'that', 'is', ...]

>>> len(model.words)

145940

>>> model['bread']

array([ 4.89417791e-01,  1.60882145e-01, -2.25947708e-01, -2.94273376e-01,
       -1.04577184e-01,  1.17962055e-01,  1.34821936e-01, -2.41778508e-01, ...])

高级用法

查询英文单词向量的最近邻

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-nearest-neighbors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.get_nearest_neighbors("bread", k=5)

[(0.5641006231307983, 'butter'), 
 (0.48875734210014343, 'loaf'), 
 (0.4491206705570221, 'eat'), 
 (0.42444291710853577, 'food'), 
 (0.4229326844215393, 'cheese')]

检测给定文本的语言

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

📚 详细文档

预期用途和限制

你可以使用预训练词向量进行文本分类或语言识别。可查看其官方网站上的教程和资源，寻找你感兴趣的任务。

局限性和偏差

即使该模型使用的训练数据可被描述为相当中立，但此模型仍可能有有偏差的预测。

可以使用余弦相似度来衡量两个不同词向量之间的相似度。如果两个向量相同，余弦相似度将为 1；对于两个完全不相关的向量，值将为 0；如果两个向量呈相反关系，值将为 -1。

>>> import numpy as np

>>> def cosine_similarity(word1, word2):
>>>     return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))

>>> cosine_similarity("man", "boy")

0.061653383

>>> cosine_similarity("man", "ceo")

0.11989131

>>> cosine_similarity("woman", "ceo")

-0.08834904

训练数据

使用 fastText 在 Common Crawl 和维基百科上对 157 种语言的预训练词向量进行了训练。这些模型使用带位置权重的 CBOW 进行训练，维度为 300，字符 n-gram 长度为 5，窗口大小为 5，负样本数为 10。我们还发布了三个新的词类比数据集，分别用于法语、印地语和波兰语。

训练过程

分词

中文使用斯坦福分词器。
日语使用 Mecab。
越南语使用 UETsegmenter。
对于使用拉丁、西里尔、希伯来或希腊字母的语言，使用 Europarl 预处理工具中的分词器。
对于其余语言，使用 ICU 分词器。

关于这些模型训练的更多信息可在文章 Learning Word Vectors for 157 Languages 中找到。

评估数据集

论文中描述的类比评估数据集可在此处获取：法语、印地语、波兰语。

引用信息

如果使用此代码学习词表示，请引用 [1]；如果用于文本分类，请引用 [2]。

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

如果你使用这些词向量，请引用以下论文：

[4] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

(* 这些作者贡献相同。)