语言: 英文
标签:
LSG模型
Transformers版本 >= 4.36.1
此模型依赖自定义建模文件,需添加trust_remote_code=True
参考#13467
LSG模型论文发布于ArXiv链接。
Github/转换脚本可在此处获取。
此模型为LEGAL-BERT的小型版本,尚未进行额外预训练。它采用相同参数/层数及分词器。
该模型能高效处理长序列,速度与性能优于Transformers中的Longformer或BigBird,基于局部+稀疏+全局注意力机制(LSG)。
输入序列长度需为块大小的整数倍。模型具备"自适应"能力,自动填充不足部分(配置中adaptive=True)。建议配合分词器截断输入(truncation=True),并可选择按块大小倍数填充(pad_to_multiple_of=...)。
支持编码器-解码器架构,但未全面测试。
基于PyTorch实现。

使用方式
需通过trust_remote_code=True加载自定义建模文件:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("ccdv/legal-lsg-small-uncased-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/legal-lsg-small-uncased-4096")
参数配置
可调整以下参数:
- 全局令牌数(num_global_tokens=1)
- 局部块大小(block_size=128)
- 稀疏块大小(sparse_block_size=128)
- 稀疏因子(sparsity_factor=2)
- 掩码首令牌(mask_first_token,因与首全局令牌冗余)
- 详见config.json文件
默认参数实践效果良好。内存不足时可减小块大小、增加稀疏因子并移除注意力分数矩阵中的dropout。
from transformers import AutoModel
model = AutoModel.from_pretrained("ccdv/legal-lsg-small-uncased-4096",
trust_remote_code=True,
num_global_tokens=16,
block_size=64,
sparse_block_size=64,
attention_probs_dropout_prob=0.0,
sparsity_factor=4,
sparsity_type="none",
mask_first_token=True
)
稀疏选择类型
提供6种稀疏选择模式,最佳类型因任务而异。若sparse_block_size=0
或sparsity_type="none"
,则仅使用局部注意力。注意:序列长度<2*block_size时类型无效。
-
sparsity_type="bos_pooling"
(新增)
- 使用BOS令牌加权平均池化
- 通用性最佳,尤其适合较大稀疏因子(8/16/32)
- 附加参数:无
-
sparsity_type="norm"
选择高范数令牌
-
sparsity_type="pooling"
使用平均池化合并令牌
-
sparsity_type="lsh"
使用LSH算法聚类相似令牌
- 适合大稀疏因子(4+)
- LSH依赖随机投影,不同种子可能导致推理差异
- 附加参数:
- lsg_num_pre_rounds=1 计算质心前预合并次数
-
sparsity_type="stride"
按头部步进机制
-
sparsity_type="block_stride"
按头部块步进机制
任务示例
填充掩码示例:
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("ccdv/legal-lsg-small-uncased-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/legal-lsg-small-uncased-4096")
SENTENCES = ["Paris is the <mask> of France.", "The goal of life is <mask>."]
pipeline = FillMaskPipeline(model, tokenizer)
output = pipeline(SENTENCES, top_k=1)
output = [o[0]["sequence"] for o in output]
> ['Paris is the capital of France.', 'The goal of life is happiness.']
分类任务示例:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("ccdv/legal-lsg-small-uncased-4096",
trust_remote_code=True,
pool_with_global=True,
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/legal-lsg-small-uncased-4096")
SENTENCE = "This is a test for sequence classification. " * 300
token_ids = tokenizer(
SENTENCE,
return_tensors="pt",
truncation=True
)
output = model(**token_ids)
> SequenceClassifierOutput(loss=None, logits=tensor([[-0.3051, -0.1762]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
训练全局令牌
仅训练全局令牌及分类头:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("ccdv/legal-lsg-small-uncased-4096",
trust_remote_code=True,
pool_with_global=True,
num_global_tokens=16
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/legal-lsg-small-uncased-4096")
for name, param in model.named_parameters():
if "global_embeddings" not in name:
param.requires_grad = False
else:
param.required_grad = True
LEGAL-BERT引用
@inproceedings{chalkidis-etal-2020-legal,
title = "{LEGAL}-{BERT}: The Muppets straight out of Law School",
author = "Chalkidis, Ilias and
Fergadiotis, Manos and
Malakasiotis, Prodromos and
Aletras, Nikolaos and
Androutsopoulos, Ion",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
doi = "10.18653/v1/2020.findings-emnlp.261",
pages = "2898--2904"
}