语言:
- 英文
数据集:
- pile-of-law/pile-of-law
管道标签: 填充掩码
标签:
- 法律
Pile of Law BERT 大型模型 2(无大小写)
这是一个基于英语法律和行政文本预训练的模型,使用了RoBERTa的预训练目标。该模型的训练设置与pile-of-law/legalbert-large-1.7M-1相同,但使用了不同的随机种子。
模型描述
Pile of Law BERT 大型模型 2 是一个基于BERT 大型模型(无大小写)架构的 transformers 模型,预训练数据来自Pile of Law,这是一个包含约256GB英语法律和行政文本的数据集,用于语言模型预训练。
预期用途与限制
您可以将原始模型用于掩码语言建模,或针对下游任务进行微调。由于该模型是在英语法律和行政文本语料库上预训练的,法律相关的下游任务可能更适合该模型。
使用方法
您可以直接使用管道进行掩码语言建模:
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
[{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.5218929052352905,
'token': 4028,
'token_str': 'exception'},
{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.11434809118509293,
'token': 1151,
'token_str': 'appeal'},
{'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.06454459577798843,
'token': 5345,
'token_str': 'exclusion'},
{'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.043593790382146835,
'token': 3677,
'token_str': 'example'},
{'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.03758585825562477,
'token': 3542,
'token_str': 'objection'}]
以下是如何在 PyTorch 中使用该模型获取给定文本的特征:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "替换为您想要的任何文本。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
在 TensorFlow 中:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "替换为您想要的任何文本。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
限制与偏见
请参阅 Pile of Law 论文的附录 G,了解与数据集和模型使用相关的版权限制。
该模型可能存在预测偏见。在以下使用掩码语言建模管道的示例中,对于罪犯的种族描述,模型预测“黑人”的分数高于“白人”。
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])
[{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.',
'score': 0.02685137465596199,
'token': 4311,
'token_str': 'black'},
{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.',
'score': 0.013632853515446186,
'token': 4249,
'token_str': 'white'}]
这种偏见也会影响该模型的所有微调版本。
训练数据
Pile of Law BERT 大型模型是在 Pile of Law 数据集上预训练的,该数据集包含约256GB的英语法律和行政文本,用于语言模型预训练。Pile of Law 包含35个数据源,包括法律分析、法院意见和文件、政府机构出版物、合同、法规、案例书等。我们在 Pile of Law 论文的附录 E 中详细描述了这些数据源。Pile of Law 数据集采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议。
训练过程
预处理
模型词汇表由29,000个来自自定义词片词汇表的标记组成,这些标记使用HuggingFace WordPiece 分词器拟合到 Pile of Law 数据集,并额外从 Black's Law Dictionary 中随机采样了3,000个法律术语,词汇表总大小为32,000个标记。我们使用了BERT中的80-10-10掩码、损坏和保留分割比例,并通过20次复制率为每个上下文创建不同的掩码。为了生成序列,我们使用了LexNLP 句子分割器,该工具能够处理法律引用的句子分割(这些引用经常被错误地识别为句子)。输入格式是通过填充句子直到达到256个标记,然后添加一个[SEP]标记,再填充句子以使整个跨度不超过512个标记。如果系列中的下一个句子过长,则不添加,剩余上下文长度用填充标记填充。
预训练
模型在 SambaNova 集群上使用8个 RDU 进行了170万步的训练。我们使用了较小的学习率5e-6和批量大小128,以缓解训练不稳定性,这可能是由于训练数据中来源的多样性所致。预训练使用了RoBERTa中描述的掩码语言建模(MLM)目标,没有 NSP 损失。模型在所有步骤中均使用512长度的序列进行预训练。
我们使用相同的设置并行训练了两个模型,使用不同的随机种子。我们选择了最低对数似然模型pile-of-law/legalbert-large-1.7M-1(称为 PoL-BERT-Large)进行实验,同时也发布了第二个模型pile-of-law/legalbert-large-1.7M-2。
评估结果
有关在LexGLUE 论文提供的 CaseHOLD 变体上的微调结果,请参阅pile-of-law/legalbert-large-1.7M-1的模型卡片。
BibTeX 条目和引用信息
@misc{hendersonkrass2022pileoflaw,
url = {https://arxiv.org/abs/2207.00220},
author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
publisher = {arXiv},
year = {2022}
}