llm-data-textbook-quality-fasttext-classifier-v2开源文本分类器

首页

Llm Data Textbook Quality Fasttext Classifier V2

由 kenhktsui 开发

这是一个基于fasttext构建的教育价值分类器，用于判断网络文本是否具有较高的教育价值，适用于大语言模型(LLM)预训练数据筛选。

文本分类英语开源协议:MIT #教育价值评估 #LLM数据筛选 #CPU高效

下载量 3,651

发布时间 : 5/19/2024

模型简介

该分类器可以判断文本的教育价值水平，分为高、中、低三个等级，特别适用于LLM训练数据的质量筛选。

模型特点

高效CPU推理

基于fasttext构建，在CPU上每秒可分类超过2000个样本，适合实时使用

三级教育价值评估

提供高、中、低三个教育价值等级，比二元分类提供更细粒度的评估

量化模型支持

提供量化模型版本model_quantized.bin，优化推理效率

模型能力

文本分类

教育价值评估

数据质量筛选

使用案例

LLM训练数据筛选

预训练数据过滤

在LLM预训练前筛选高质量教育价值的文本数据

提高训练数据质量，改善模型性能

教育内容分析

教材内容评估

评估不同教育材料的教育价值水平

帮助识别高质量教育内容

🚀 📚llm-data-textbook-quality-fasttext-classifier-v2

本项目是一个教育价值分类器，能够对来自网络的文本是否具有高教育价值进行分类。它可以作为训练大语言模型（LLM）时预训练数据筛选的过滤器，具有较高的教育价值分类粒度。

博客 | 数据集

🚀 快速开始

更新信息

2024年7月7日：量化模型 "model_quantized.bin" 发布。

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model_quantized.bin"))

模型相关图片

✨ 主要特性

教育价值分类功能

“输入垃圾，输出垃圾。无论语言模型的参数数量多少，其质量都取决于训练数据的质量。”

该教育价值分类器可以对来自网络的文本是否具有高教育价值（比教科书质量的定义更明确）进行分类。它深受论文 Textbooks Are All You Need 的启发，该论文中开发了一个分类器来预测数据的教育价值，然后用于数据过滤。

该模型是在网络/原始文本上进行训练的，而不是在格式化为指令数据集的数据上（目前）。它可以在训练大语言模型（LLM）时用作预训练数据筛选的过滤器🤖。该模型有3个标签，而不是2个，提供了更高的教育价值分类粒度：

高（前25%的教育价值）
中（中间25 - 75%的教育价值）
低（后25%的教育价值）

当该分类器有更多下游实验结果时，将会有详细的报告/论文发布。关于该分类器的验证，请参阅分析。该分类器已应用于各种预训练数据集，具体情况请参阅 基准测试。

高性能计算

⚡ 该模型基于 fasttext 构建，在CPU上每秒可以对2000多个示例进行分类，因此可以在预训练过程中实时使用。

请注意，教科书质量是高质量的一个子集。

反馈机制

💬 欢迎反馈！如果您觉得这个模型有帮助，请点赞并留下评论。我将持续致力于让大语言模型的数据筛选变得更好、更简单。

💻 使用示例

基础用法

教育价值的范围是 [0, 2]，详细公式如下所述。

predict_educational_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
# 输出 [1.9266871362924576]
predict_educational_value(['''"Attention Is All You Need" is a landmark[1][2] 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer. The paper is considered by some to be a founding document for modern artificial intelligence, as transformers became the main architecture of large language models.[3][4] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but even in their paper the authors saw the potential for other tasks like question answering and for what is now called multimodal Generative AI.[5]'''])
# 输出 [1.8226698189973831]
predict_educational_value(['''A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]'''])
# 输出 [1.7609568238258362]
predict_educational_value(['''In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.[1]'''])
# 输出 [1.589950144290924]
predict_educational_value(['''The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[5] The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.'''])
# 输出 [1.4657384157180786]
predict_educational_value(['''The Arsenal Football Club (commonly known as simply Arsenal) is an English professional football club based in Holloway, North London. Arsenal compete in the Premier League, the top flight of English football. In domestic football, Arsenal has won 13 league titles (including one unbeaten title), a record 14 FA Cups, two League Cups, 17 FA Community Shields, and a Football League Centenary Trophy. In European football, they have one European Cup Winners' Cup and one Inter-Cities Fairs Cup. In terms of trophies won, it is the third-most successful club in English football.[2]'''])
# 输出 [1.1015518307685852]
predict_educational_value(['''The 2003–04 season was Arsenal Football Club's 12th season in the Premier League and their 78th consecutive season in the top flight of English football.[3][4] It began on 1 July 2003 and concluded on 30 June 2004, with competitive matches played between August and May. The club ended the Premier League campaign as champions without a single defeat – a record of 26 wins and 12 draws. Arsenal fared less well in the cups, eliminated in the FA Cup and League Cup semi-finals to Manchester United and Middlesbrough respectively, and at the quarter-final stage of the UEFA Champions League to Chelsea.'''])
# 输出 [1.0146622359752655]
predict_educational_value(['''As both teams' first-choice kits featured a shade of red, Arsenal wore their yellow away strip, while Barcelona wore their traditional blue and maroon striped kit. Arsenal won the coin toss and Barcelona kicked off.[21] Barcelona almost immediately came under pressure when Thierry Henry shot straight at Barcelona goalkeeper Víctor Valdés, who conceded a corner. From the resulting corner Arsenal had another chance again courtesy of Henry, whose shot was again saved by Valdés. The next attack in the seventh minute resulted in Arsenal goalkeeper Jens Lehmann saving from Ludovic Giuly after he shot from a narrow angle. Four minutes later Barcelona were awarded a free-kick 35 yards from goal; Ronaldinho shot wide of the goal.'''])
# 输出 [0.7897453680634499]

通过观察可以发现，该模型更喜欢科学知识。它也对阿森纳足球俱乐部感兴趣，然而，它认为某场特定比赛的总结没有很好的教育价值。

高级用法

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin"))

def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)

score_dict = {
  '__label__': 0, 
  '__label__Low': 0, 
  '__label__Mid': 1,
  '__label__High': 2
}

def predict_educational_value(text_list):
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list, k=-1)
  score_list = []
  for l, s in zip(*pred):
    score = 0
    for _l, _s in zip(l, s):
      score += score_dict[_l] * _s
    score_list.append(float(score))
  return score_list

predict_educational_value(["Hi"])
# 输出: [3.0000010156072676e-05]

📚 详细文档

📊 基准测试

为了确保该分类器的有效性，将其应用于各种数据集。

教育价值 = 2分 * P(高) + 1分 * P(中) + 0分 * P(低)

该分数大致可以解释为：

教育价值	类别
2	高
1	中
0	低

数据集	采样	平均教育价值	类型
SciPhi/textbooks-are-all-you-need-lite	前100,000条	1.846	合成数据
nampdn-ai/tiny-orca-textbooks	前100,000条	1.673	合成数据
HuggingFaceTB/cosmopedia stanford	前100,000条	1.673	合成数据
vikp/textbook_quality_programming	前100,000条	1.663	合成数据
HuggingFaceTB/cosmopedia web_samples_v1	前100,000条	1.618	合成数据
nampdn-ai/tiny-textbooks	前100,000条	1.586	合成数据
HuggingFaceTB/cosmopedia web_samples_v2	前100,000条	1.562	合成数据
HuggingFaceTB/cosmopedia openstax	前100,000条	1.462	合成数据
HuggingFaceTB/cosmopedia wikihow	前100,000条	1.422	合成数据
HuggingFaceTB/cosmopedia khanacademy	前100,000条	1.419	合成数据
HuggingFaceTB/cosmopedia auto_math_text	前100,000条	1.347	合成数据
armanc/scientific_papers pubmed	前100,000条	1.260	真实数据
HuggingFaceTB/cosmopedia stories	前100,000条	1.154	合成数据
teknium/OpenHermes-2.5	前100,000条	1.121	合成数据
timdettmers/openassistant-guanaco	前100,000条	1.115	真实数据
open-web-math/open-web-math	前100,000条	1.089	真实数据
armanc/scientific_papers arxiv	前100,000条	1.068	真实数据
HuggingFaceFW/fineweb	前100,000条	1.056	真实数据
NousResearch/dolma-v1_7-305B*	前100,000条	1.037	真实数据
tatsu-lab/alpaca	前100,000条	1.020	合成数据
BEE-spoke-data/fineweb-100k_en-med	前100,000条	1.019	真实数据
JeanKaddour/minipile	前100,000条	0.998	真实数据
togethercomputer/RedPajama-Data-V2 en 2023-06	前100,000条	0.985	真实数据
wikipedia en 20220301	前100,000条	0.975	真实数据
Replete-AI/code_bagel	前100,000条	0.950	合成数据
allenai/c4 en	前100,000条	0.934	真实数据
mattymchen/refinedweb-3m	前100,000条	0.857	真实数据
iamtarun/python_code_instructions_18k_alpaca	前100,000条	0.849	合成数据
tiiuae/falcon-refinedweb	前100,000条	0.835	真实数据
BEE-spoke-data/FineMeme-100k	前100,000条	0.716	真实数据
neuralcatcher/hateful_memes	前100,000条	0.070	真实数据