🚀 ✨ bert-restore-punctuation
这是一个基于BERT的模型,经过微调后可用于恢复英文文本的标点和大小写。它在Yelp评论数据集上进行训练,能够处理各种失去标点的英文文本,例如语音识别(ASR)的输出。该模型可直接用于通用英文文本的标点恢复,也可针对特定领域文本进行进一步微调。
🚀 快速开始
以下是快速使用该模型的方法:
- 首先,安装所需的包。
pip install rpunct
- 示例Python代码。
from rpunct import RestorePuncts
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
该模型可以处理任意长度的英文文本,并在可用时使用GPU加速。
✨ 主要特性
- 该模型可预测纯小写文本的标点和大小写,适用于语音识别输出或其他失去标点的文本场景。
- 可直接用于通用英文文本的标点恢复,也可针对特定领域文本进行进一步微调。
- 能够恢复以下标点符号 -- [! ? . , - : ; ' ] ,并恢复单词的首字母大写。
📦 安装指南
pip install rpunct
💻 使用示例
基础用法
from rpunct import RestorePuncts
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
📚 详细文档
训练数据
以下是用于微调模型的产品评论数量:
属性 |
详情 |
模型类型 |
bert-base-uncased 微调模型 |
训练数据 |
Yelp Reviews,英文文本样本数为 560,000 |
我们发现模型在大约 3 个训练周期 时达到最佳收敛效果。
准确率
微调后的模型在 45,990 个保留文本样本上的准确率如下:
准确率 |
整体 F1 值 |
评估样本数 |
91% |
90% |
45,990 |
以下是模型按每个标签的性能细分:
标签 |
精确率 |
召回率 |
F1 值 |
样本数 |
! |
0.45 |
0.17 |
0.24 |
424 |
!+Upper |
0.43 |
0.34 |
0.38 |
98 |
' |
0.60 |
0.27 |
0.37 |
11 |
, |
0.59 |
0.51 |
0.55 |
1522 |
,+Upper |
0.52 |
0.50 |
0.51 |
239 |
- |
0.00 |
0.00 |
0.00 |
18 |
. |
0.69 |
0.84 |
0.75 |
2488 |
.+Upper |
0.65 |
0.52 |
0.57 |
274 |
: |
0.52 |
0.31 |
0.39 |
39 |
:+Upper |
0.36 |
0.62 |
0.45 |
16 |
; |
0.00 |
0.00 |
0.00 |
17 |
? |
0.54 |
0.48 |
0.51 |
46 |
?+Upper |
0.40 |
0.50 |
0.44 |
4 |
none |
0.96 |
0.96 |
0.96 |
35352 |
Upper |
0.84 |
0.82 |
0.83 |
5442 |
☕ 联系我们
如有任何问题、反馈或需要类似模型,请联系 Daulet Nurmanbetov。
📄 许可证
本项目采用 MIT 许可证。