开源rebel-large模型 - 端到端抽取关系，支持超200种不同关系类型

首页

Rebel Large

由 Babelscape 开发

REBEL是一种基于BART的序列到序列模型，用于端到端关系抽取，支持200多种不同关系类型。

知识图谱

Transformers

英语#端到端关系抽取 #序列到序列模型 #多关系类型支持

下载量 37.57k

发布时间 : 3/2/2022

模型简介

REBEL通过将关系抽取重新定义为序列到序列任务，简化了从原始文本中提取关系三元组的过程。它使用自回归序列到序列模型，能够直接从文本中提取关系三元组，支持多种应用如知识库填充和事实核查。

模型特点

端到端关系抽取

将关系抽取任务简化为序列到序列任务，直接从文本中生成关系三元组。

多关系类型支持

支持200多种不同关系类型，适用于广泛的信息抽取场景。

高性能

在多个关系抽取基准测试中达到最先进的性能。

模型能力

关系抽取

实体关系识别

知识库填充

使用案例

知识库构建

知识库填充

从非结构化文本中提取关系三元组，用于填充或验证知识库。

提高知识库的覆盖率和准确性。

信息抽取

事实核查

从文本中提取关系三元组，用于验证事实的准确性。

支持自动化事实核查流程。

🚀 REBEL ：端到端语言生成的关系抽取

REBEL提出了一种新的线性化方法，并将关系抽取重新定义为一个序列到序列（seq2seq）任务。该模型可用于从原始文本中抽取关系三元组，适用于知识图谱填充、事实核查等多个下游任务。

多语言更新！查看 mREBEL，这是一个多语言版本，涵盖更多关系类型、语言，并包含实体类型。

✨ 主要特性

新的线性化方法：提出了一种新的线性化方法，将关系三元组表示为文本序列，简化了关系抽取任务。
端到端关系抽取：基于BART的seq2seq模型，可进行端到端的关系抽取，支持200多种不同的关系类型。
灵活性高：在多个关系抽取和关系分类基准测试上进行微调，在大多数基准测试中达到了最先进的性能。

📚 详细文档

这是2021年EMNLP会议论文 REBEL: Relation Extraction By End-to-end Language generation 的模型卡片。如果您使用了相关代码，请在论文中引用这项工作：

@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
    title = "{REBEL}: Relation Extraction By End-to-end Language generation",
    author = "Huguet Cabot, Pere-Llu{\'\i}s  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.204",
    pages = "2370--2381",
    abstract = "Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model{'}s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.",
}

论文的原始仓库可以在这里找到。

请注意，右侧的推理小部件不会输出特殊标记，这些标记对于区分主语、宾语和关系类型是必要的。有关REBEL及其预训练数据集的演示，请查看 Spaces演示。

💻 使用示例

基础用法

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("Punta Cana is a resort town in the municipality of Higuey, in La Altagracia Province, the eastern most province of the Dominican Republic", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)

高级用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 3,
}

# Text to extract triplets from
text = 'Punta Cana is a resort town in the municipality of Higüey, in La Altagracia Province, the easternmost province of the Dominican Republic.'

# Tokenizer text
model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')

# Generate
generated_tokens = model.generate(
    model_inputs["input_ids"].to(model.device),
    attention_mask=model_inputs["attention_mask"].to(model.device),
    **gen_kwargs,
)

# Extract text
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

# Extract triplets
for idx, sentence in enumerate(decoded_preds):
    print(f'Prediction triplets sentence {idx}')
    print(extract_triplets(sentence))

📄 许可证

本项目采用 cc-by-nc-sa-4.0 许可证。

📦 模型信息

属性	详情
模型类型	seq2seq
训练数据	Babelscape/rebel-dataset
任务类型	关系抽取
评估数据集	CoNLL04、NYT
CoNLL04指标	RE+ Macro F1：76.65
NYT指标	F1：93.4