模型简介
模型特点
模型能力
使用案例
🚀 📚llm-data-textbook-quality-fasttext-classifier-v2
本项目是一个教育价值分类器,能够对来自网络的文本是否具有高教育价值进行分类。它可以作为训练大语言模型(LLM)时预训练数据筛选的过滤器,具有较高的教育价值分类粒度。
🚀 快速开始
更新信息
2024年7月7日:量化模型 "model_quantized.bin" 发布。
model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model_quantized.bin"))
✨ 主要特性
教育价值分类功能
“输入垃圾,输出垃圾。无论语言模型的参数数量多少,其质量都取决于训练数据的质量。”
该教育价值分类器可以对来自网络的文本是否具有高教育价值(比教科书质量的定义更明确)进行分类。它深受论文 Textbooks Are All You Need 的启发,该论文中开发了一个分类器来预测数据的教育价值,然后用于数据过滤。
该模型是在网络/原始文本上进行训练的,而不是在格式化为指令数据集的数据上(目前)。它可以在训练大语言模型(LLM)时用作预训练数据筛选的过滤器🤖。该模型有3个标签,而不是2个,提供了更高的教育价值分类粒度:
- 高(前25%的教育价值)
- 中(中间25 - 75%的教育价值)
- 低(后25%的教育价值)
当该分类器有更多下游实验结果时,将会有详细的报告/论文发布。关于该分类器的验证,请参阅 分析。该分类器已应用于各种预训练数据集,具体情况请参阅 基准测试。
高性能计算
⚡ 该模型基于 fasttext 构建,在CPU上每秒可以对2000多个示例进行分类,因此可以在预训练过程中 实时 使用。
请注意,教科书质量是高质量的一个子集。
反馈机制
💬 欢迎反馈!如果您觉得这个模型有帮助,请点赞并留下评论。我将持续致力于让大语言模型的数据筛选变得更好、更简单。
💻 使用示例
基础用法
教育价值的范围是 [0, 2],详细公式如下所述。
predict_educational_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
# 输出 [1.9266871362924576]
predict_educational_value(['''"Attention Is All You Need" is a landmark[1][2] 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer. The paper is considered by some to be a founding document for modern artificial intelligence, as transformers became the main architecture of large language models.[3][4] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but even in their paper the authors saw the potential for other tasks like question answering and for what is now called multimodal Generative AI.[5]'''])
# 输出 [1.8226698189973831]
predict_educational_value(['''A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]'''])
# 输出 [1.7609568238258362]
predict_educational_value(['''In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.[1]'''])
# 输出 [1.589950144290924]
predict_educational_value(['''The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[5] The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.'''])
# 输出 [1.4657384157180786]
predict_educational_value(['''The Arsenal Football Club (commonly known as simply Arsenal) is an English professional football club based in Holloway, North London. Arsenal compete in the Premier League, the top flight of English football. In domestic football, Arsenal has won 13 league titles (including one unbeaten title), a record 14 FA Cups, two League Cups, 17 FA Community Shields, and a Football League Centenary Trophy. In European football, they have one European Cup Winners' Cup and one Inter-Cities Fairs Cup. In terms of trophies won, it is the third-most successful club in English football.[2]'''])
# 输出 [1.1015518307685852]
predict_educational_value(['''The 2003–04 season was Arsenal Football Club's 12th season in the Premier League and their 78th consecutive season in the top flight of English football.[3][4] It began on 1 July 2003 and concluded on 30 June 2004, with competitive matches played between August and May. The club ended the Premier League campaign as champions without a single defeat – a record of 26 wins and 12 draws. Arsenal fared less well in the cups, eliminated in the FA Cup and League Cup semi-finals to Manchester United and Middlesbrough respectively, and at the quarter-final stage of the UEFA Champions League to Chelsea.'''])
# 输出 [1.0146622359752655]
predict_educational_value(['''As both teams' first-choice kits featured a shade of red, Arsenal wore their yellow away strip, while Barcelona wore their traditional blue and maroon striped kit. Arsenal won the coin toss and Barcelona kicked off.[21] Barcelona almost immediately came under pressure when Thierry Henry shot straight at Barcelona goalkeeper Víctor Valdés, who conceded a corner. From the resulting corner Arsenal had another chance again courtesy of Henry, whose shot was again saved by Valdés. The next attack in the seventh minute resulted in Arsenal goalkeeper Jens Lehmann saving from Ludovic Giuly after he shot from a narrow angle. Four minutes later Barcelona were awarded a free-kick 35 yards from goal; Ronaldinho shot wide of the goal.'''])
# 输出 [0.7897453680634499]
通过观察可以发现,该模型更喜欢科学知识。它也对阿森纳足球俱乐部感兴趣,然而,它认为某场特定比赛的总结没有很好的教育价值。
高级用法
from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext
model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin"))
def replace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
score_dict = {
'__label__': 0,
'__label__Low': 0,
'__label__Mid': 1,
'__label__High': 2
}
def predict_educational_value(text_list):
text_list = [replace_newlines(text) for text in text_list]
pred = model.predict(text_list, k=-1)
score_list = []
for l, s in zip(*pred):
score = 0
for _l, _s in zip(l, s):
score += score_dict[_l] * _s
score_list.append(float(score))
return score_list
predict_educational_value(["Hi"])
# 输出: [3.0000010156072676e-05]
📚 详细文档
📊 基准测试
为了确保该分类器的有效性,将其应用于各种数据集。
教育价值 = 2分 * P(高) + 1分 * P(中) + 0分 * P(低)
该分数大致可以解释为:
教育价值 | 类别 |
---|---|
2 | 高 |
1 | 中 |
0 | 低 |
* 我遇到了一个 问题,因此无法处理原始的 allenai/dolma。
该分类器符合预期:
- 一般来说,合成数据具有较高的教育价值,因为它们在设计时就具有较高的教育价值。
- 对于真实数据,HuggingFaceFW/fineweb 和 Dolma v1_7 应用了 此处 描述的质量过滤器,在所有真实数据中具有最高的教育价值。
- 一般来说,数据集发布得越晚,其教育价值越高,因为研究界对数据质量的关注度越来越高。
- 教科书类别(大多为合成数据)得分最高,因为它们是为教育价值而创建的,反映了该模型的有效性。
- 数学/论文类别得分第二高,因为其知识密度高。
- 维基百科得分相对较低,因为它还包含一些教育价值较小的信息(例如比赛结果、电影明星的奖项)。
- 网络数据得分较低(如果未应用过滤),因为它包含所有领域的信息。
- 表情包得分最低,正如预期的那样。仇恨表情包几乎得分为零。
出于好奇,添加了一些指令数据集,尽管该模型不是在指令数据上训练的。有两种可能的解释:
- 它们的得分低于教科书,因为对话中的知识深度通常不如教科书中的知识密度高,但它们通常比未过滤的网络数据更具教育性。
- 该模型在判断指令数据集中的教育价值方面表现不够好。
📈 分析
🤖 使用和不使用分类器的模型训练
预期是使用过滤器训练的模型将优于不使用过滤器训练的模型。
Fineweb 实时过滤教育价值 >= 1.0 的数据。
测试1: 模型参数:192M 训练令牌:31亿训练令牌,6000个全局步骤
任务 | 在经过过滤的 FineWeb 上训练 | 在未经过过滤的 FineWeb 上训练 | 在 Cosmopedia 上训练 |
---|---|---|---|
arc-easy | 37.37 | 34.97 | 37.45 |
arc-challenge | 23.55 | 22.95 | 23.21 |
Hellaswag | 28.02 | 27.92 | 27.78 |
MMLU | 24.71 | 23.94 | 24.65 |
TruthfulQA | 45.88 | 45.20 | 45.97 |
Winogrande | 49.49 | 50.59 | 50.67 |
使用过滤器时,推理和常识推理似乎更好,符合预期。结果也接近在 Cosmopedia 上训练的模型。MMLU 结果也更好;然而,由于计算限制(训练时间和模型大小),结果接近随机。将训练更大尺寸的模型以进一步验证这一结论。
(很快将使用更大的模型进行更新)
🌐 域名分析
预期是大多数教育价值来自大学/学校、研究机构和组织的网站。
由于 HuggingFaceFW/fineweb 包含爬取网站的URL,因此计算了每个域名的平均教育价值。
已分析了前1000万条记录。完整文件请见 此处。
以下是前100个记录数 >= 100的域名:
🧪 分类器排名顺序
教育价值与测试数据的斯皮尔曼等级相关系数为0.7055,表明存在很强的单调关系。教育价值可以用于排名。
📄 许可证
本项目采用 MIT 许可证。








