macbert4csc-base-chinese开源中文拼写纠错模型，SIGHAN2015测试达最优水平

首页

Macbert4csc Base Chinese

由 shibing624 开发

基于MacBERT的中文拼写纠错模型，在SIGHAN2015测试集上达到当前最优水平

大型语言模型

Transformers

中文开源协议:Apache-2.0 #中文拼写纠错 #SIGHAN最优模型 #MacBERT架构

下载量 9,623

发布时间 : 3/2/2022

模型简介

该模型专注于中文文本的拼写错误检测与纠正，采用改进的MacBERT架构，适用于各类中文文本校对场景

模型特点

最优性能

在SIGHAN2015测试集上达到字符级别F1值89.91，句子级别F1值77.89的当前最优水平

改进架构

基于softmaskedbert改进的MacBERT架构，通过MLM校正预训练任务优化模型性能

全面训练数据

使用SIGHAN+Wang271K中文纠错数据集训练，包含27万条高质量纠错样本

模型能力

中文拼写错误检测

中文文本自动纠正

错别字识别与修正

使用案例

文本校对

日常文本纠错

自动纠正聊天、邮件等日常文本中的拼写错误

示例：'今天新情很好' → '今天心情很好'

正式文档校对

辅助检查报告、论文等正式文档的文字准确性

教育辅助

中文学习辅助

帮助中文学习者识别和纠正写作中的错误

🚀 MacBERT中文拼写纠错（macbert4csc）模型

macbert4csc是一款用于中文拼写纠错的模型，在中文文本纠错场景中表现出色，能有效提升文本的准确性和质量。

macbert4csc-base-chinese 在SIGHAN2015测试数据上的评估结果如下：

	纠错准确率	纠错召回率	纠错F1值
字符级别	93.72	86.40	89.91
句子级别	82.64	73.66	77.89

由于训练使用的数据采用了SIGHAN2015的训练集（复现论文），该模型在SIGHAN2015的测试集上达到了SOTA水平。

模型结构借鉴并改进于softmaskedbert，具体结构如下：

arch

🚀 快速开始

本项目开源在中文文本纠错项目：pycorrector，支持macbert4csc模型，可通过如下方式调用。

💻 使用示例

基础用法

使用pycorrector库调用模型：

from pycorrector.macbert.macbert_corrector import MacBertCorrector

m = MacBertCorrector("shibing624/macbert4csc-base-chinese")

i = m.correct('今天新情很好')
print(i)

高级用法

使用transformers库调用模型：

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜欢的工作，我也很高心。"]
with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
for ids, text in zip(outputs.logits, texts):
    _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
    corrected_text = _text[:len(text)]
    corrected_text, details = get_errors(corrected_text, text)
    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
print(result)

输出结果：

今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
你找到你最喜欢的工作，我也很高心。  =>  你找到你最喜欢的工作，我也很高兴。 [('心', '兴', 15, 16)]

模型文件组成

macbert4csc-base-chinese
    ├── config.json
    ├── added_tokens.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

📚 详细文档

训练数据集

SIGHAN+Wang271K中文纠错数据集

数据集	语料	下载链接	压缩包大小
`SIGHAN+Wang271K中文纠错数据集`	SIGHAN+Wang271K(27万条)	百度网盘（密码01b9）	106M
`原始SIGHAN数据集`	SIGHAN13 14 15	官方csc.html	339K
`原始Wang271K数据集`	Wang271K	Automatic-Corpus-Generation dimmywang提供	93M

SIGHAN+Wang271K中文纠错数据集的数据格式如下：

[
    {
        "id": "B2-4029-3",
        "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
        "wrong_ids": [
            5,
            31
        ],
        "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
    }
]

模型文件结构：

macbert4csc
    ├── config.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

如果需要训练macbert4csc，请参考https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert

关于MacBERT

MacBERT 是一种改进的BERT模型，采用了新颖的将MLM作为纠错的预训练任务，缓解了预训练和微调之间的差异。

以下是预训练任务的一个示例：

任务	示例
原始句子	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
全词掩码	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram掩码	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
将MLM作为纠错	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

除了新的预训练任务，该模型还采用了以下技术：

全词掩码（Whole Word Masking，WWM）
N-gram掩码
句子顺序预测（Sentence-Order Prediction，SOP）

请注意，由于主要神经网络架构没有差异，我们的MacBERT可以直接替代原始的BERT。

更多技术细节，请参考论文：Revisiting Pre-trained Models for Chinese Natural Language Processing

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用

@software{pycorrector,
  author = {Xu Ming},
  title = {pycorrector: Text Error Correction Tool},
  year = {2021},
  url = {https://github.com/shibing624/pycorrector},
}