bert-restore-punctuation开源模型 - 免费部署精准恢复文本标点大小写

首页

Bert Restore Punctuation

由 felflare 开发

基于bert-base-uncased架构微调的标点恢复模型，专为Yelp评论数据集设计，能预测纯小写文本的标点符号及大小写恢复。

序列标注

Transformers

英语开源协议:MIT #英文标点恢复 #ASR后处理 #文本增强

下载量 1,890

发布时间 : 3/2/2022

模型简介

该模型用于恢复英文文本中的标点符号和大小写，适用于语音识别输出或其他丢失标点的文本处理。支持恢复的标点包括：! ? . , - : ; ' 以及单词首字母大写。

模型特点

多标点恢复

支持恢复多种标点符号，包括句号、逗号、问号、感叹号等常见标点。

大小写恢复

能够自动恢复单词的首字母大写，提升文本可读性。

长文本处理

支持任意长度的英文文本处理，适合处理长篇内容。

GPU加速

自动启用GPU加速，提高处理速度。

模型能力

标点符号恢复

大小写恢复

文本处理

长文本支持

使用案例

语音识别后处理

ASR输出文本标点恢复

将语音识别系统输出的无标点文本恢复标点和大小写。

提升文本可读性和专业性。

文本预处理

丢失标点文本恢复

处理因传输或存储丢失标点的文本。

恢复原始文本格式，便于后续分析。

🚀 ✨ bert-restore-punctuation

这是一个基于BERT的模型，经过微调后可用于恢复英文文本的标点和大小写。它在Yelp评论数据集上进行训练，能够处理各种失去标点的英文文本，例如语音识别（ASR）的输出。该模型可直接用于通用英文文本的标点恢复，也可针对特定领域文本进行进一步微调。

🚀 快速开始

以下是快速使用该模型的方法：

首先，安装所需的包。

pip install rpunct

示例Python代码。

from rpunct import RestorePuncts
# 默认为英文
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 输出如下：
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

该模型可以处理任意长度的英文文本，并在可用时使用GPU加速。

✨ 主要特性

该模型可预测纯小写文本的标点和大小写，适用于语音识别输出或其他失去标点的文本场景。
可直接用于通用英文文本的标点恢复，也可针对特定领域文本进行进一步微调。
能够恢复以下标点符号 -- [! ? . , - : ; ' ] ，并恢复单词的首字母大写。

📦 安装指南

pip install rpunct

💻 使用示例

基础用法

from rpunct import RestorePuncts
# 默认为英文
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 输出如下：
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

📚 详细文档

训练数据

以下是用于微调模型的产品评论数量：

属性	详情
模型类型	bert-base-uncased 微调模型
训练数据	Yelp Reviews，英文文本样本数为 560,000

我们发现模型在大约 3 个训练周期 时达到最佳收敛效果。

准确率

微调后的模型在 45,990 个保留文本样本上的准确率如下：

准确率	整体 F1 值	评估样本数
91%	90%	45,990

以下是模型按每个标签的性能细分：

标签	精确率	召回率	F1 值	样本数
!	0.45	0.17	0.24	424
!+Upper	0.43	0.34	0.38	98
'	0.60	0.27	0.37	11
,	0.59	0.51	0.55	1522
,+Upper	0.52	0.50	0.51	239
-	0.00	0.00	0.00	18
.	0.69	0.84	0.75	2488
.+Upper	0.65	0.52	0.57	274
:	0.52	0.31	0.39	39
:+Upper	0.36	0.62	0.45	16
;	0.00	0.00	0.00	17
?	0.54	0.48	0.51	46
?+Upper	0.40	0.50	0.44	4
none	0.96	0.96	0.96	35352
Upper	0.84	0.82	0.83	5442