MathBERT-custom开源模型 - 专注数学语言理解，免费支持英语数学文本处理

首页

Mathbert Custom

由 tbs17 开发

基于数学领域英语文本预训练的BERT模型，专注于数学语言理解任务

大型语言模型

Transformers

#数学语言理解 #教育领域专用 #双向上下文建模

下载量 214

发布时间 : 3/2/2022

模型简介

通过自监督方式在大型数学语料库上预训练的Transformer模型，支持掩码语言建模和下一句预测任务，特别优化于数学相关文本处理

模型特点

数学领域优化

专门针对数学文本训练，包含从学前到研究生阶段的数学语言

自定义词汇表

使用30,522个词汇的定制词汇表，优化数学术语处理

双向上下文理解

通过MLM目标实现句子双向表征学习

不区分大小写

统一处理大小写变体，提升模型鲁棒性

模型能力

数学文本特征提取

数学问题理解

数学术语预测

数学句子关系判断

使用案例

教育技术

数学问题解答系统

作为数学问答系统的特征提取模块

在数学问题文本填充任务中表现优于通用模型

数学教材分析

分析数学教材内容结构

学术研究

数学论文处理

处理arXiv数学论文摘要

🚀 MathBERT模型（自定义词表）

MathBERT是一个预训练模型，它基于从幼儿园到研究生阶段的数学语言（英语）数据，采用掩码语言模型（MLM）目标进行预训练。该模型不区分大小写，例如“english”和“English”对它来说是一样的。

✨ 主要特性

MathBERT是一个以自监督方式在大量英语数学语料库数据上进行预训练的Transformer模型。它仅在原始文本上进行预训练，没有人工进行任何标注，而是通过自动流程从这些文本中生成输入和标签。具体而言，它通过两个目标进行预训练：

掩码语言模型（MLM）：给定一个句子，模型会随机掩盖输入中15%的单词，然后将整个掩码后的句子输入模型，让模型预测被掩盖的单词。这与传统的循环神经网络（RNN）不同，RNN通常是逐个处理单词；也与像GPT这样的自回归模型不同，GPT会在内部掩盖未来的标记。这种方式使模型能够学习句子的双向表示。
下一句预测（NSP）：在预训练期间，模型将两个掩码后的句子作为输入进行拼接。有时这两个句子在原始文本中是相邻的，有时则不是。模型需要预测这两个句子是否相邻。

通过这种方式，模型学习到数学语言的内部表示，可用于提取对下游任务有用的特征。例如，如果有一个带标签的句子数据集，可以使用MathBERT模型生成的特征作为输入来训练一个标准分类器。

📦 安装指南

文档未提供安装步骤，跳过该章节。

💻 使用示例

基础用法

以下是如何在PyTorch中使用该模型获取给定文本特征的示例：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = BertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')["input_ids"]
output = model(encoded_input)

高级用法

以下是在TensorFlow中使用该模型获取给定文本特征的示例：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = TFBertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 详细文档

预期用途和局限性

可以使用原始模型进行掩码语言建模或下一句预测，但它主要用于在与数学相关的下游任务上进行微调。

需要注意的是，该模型主要旨在针对使用整个句子（可能是掩码后的）进行决策的数学相关任务进行微调，例如序列分类、标记分类或问答任务。对于数学文本生成等任务，建议使用像GPT2这样的模型。

警告

MathBERT是专门为数学相关任务设计的，在数学问题文本的掩码填充任务中表现更好，而不是通用的掩码填充任务。以下是示例：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tbs17/MathBERT')
# 以下是期望的使用方式
>>> unmasker("students apply these new understandings as they reason about and perform decimal [MASK] through the hundredths place.")

[{'score': 0.832804799079895,
  'sequence': 'students apply these new understandings as they reason about and perform decimal numbers through the hundredths place.',
  'token': 3616,
  'token_str': 'numbers'},
 {'score': 0.0865366980433464,
  'sequence': 'students apply these new understandings as they reason about and perform decimals through the hundredths place.',
  'token': 2015,
  'token_str': '##s'},
 {'score': 0.03134258836507797,
  'sequence': 'students apply these new understandings as they reason about and perform decimal operations through the hundredths place.',
  'token': 3136,
  'token_str': 'operations'},
 {'score': 0.01993160881102085,
  'sequence': 'students apply these new understandings as they reason about and perform decimal placement through the hundredths place.',
  'token': 11073,
  'token_str': 'placement'},
 {'score': 0.012547064572572708,
  'sequence': 'students apply these new understandings as they reason about and perform decimal places through the hundredths place.',
  'token': 3182,
  'token_str': 'places'}]

# 以下不是期望的使用方式
>>> unmasker("The man worked as a [MASK].")

[{'score': 0.6469377875328064,
  'sequence': 'the man worked as a book.',
  'token': 2338,
  'token_str': 'book'},
 {'score': 0.07073448598384857,
  'sequence': 'the man worked as a guide.',
  'token': 5009,
  'token_str': 'guide'},
 {'score': 0.031362924724817276,
  'sequence': 'the man worked as a text.',
  'token': 3793,
  'token_str': 'text'},
 {'score': 0.02306508645415306,
  'sequence': 'the man worked as a man.',
  'token': 2158,
  'token_str': 'man'},
 {'score': 0.020547250285744667,
  'sequence': 'the man worked as a distance.',
  'token': 3292,
  'token_str': 'distance'}]