mT5_multilingual_XLSum_rust开源模型 - 免费实现45种语言摘要生成

首页

Mt5 Multilingual XLSum Rust

由 spursyy 开发

基于XL-Sum数据集在45种语言上微调的mT5模型，用于多语言摘要生成任务。

文本生成支持多种语言#多语言摘要生成 #45种语言支持 #新闻摘要

下载量 18

发布时间 : 2/1/2023

模型简介

该模型是基于mT5架构的多语言摘要生成模型，支持45种语言的文本摘要任务，特别针对新闻摘要场景进行了优化。

模型特点

多语言支持

支持45种语言的摘要生成，包括多种亚洲、非洲和欧洲语言。

高质量摘要

在XL-Sum数据集上微调，能够生成准确、简洁的新闻摘要。

基于mT5架构

使用强大的mT5多语言Transformer架构，具有良好的迁移学习能力。

模型能力

文本摘要生成

多语言处理

新闻内容浓缩

使用案例

新闻媒体

多语言新闻摘要

为国际新闻机构提供自动化的多语言新闻摘要服务。

可快速生成多种语言的新闻要点，提高内容分发效率。

内容分析

跨语言内容分析

分析不同语言的新闻内容并生成统一语言的摘要。

便于比较不同语言媒体对同一事件的报道。

🚀 mT5-multilingual-XLSum

本项目包含在XL - Sum数据集的45种语言上微调的mT5检查点。有关微调的详细信息和脚本，请参阅论文和官方仓库。

🚀 快速开始

环境要求

本项目使用的transformers库版本为 4.11.0.dev0。

代码示例

import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

article_text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said.  The policy includes the termination of accounts of anti-vaccine influencers.  Tech giants have been criticised for not doing more to counter false health information on their sites.  In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue.  YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines.  In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B.  "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

✨ 主要特性

多语言支持：该模型在45种语言的XL - Sum数据集上进行了微调，能够处理多种语言的摘要生成任务。
高性能：在多个语言的评估指标上取得了较好的成绩，如ROUGE - 1、ROUGE - 2和ROUGE - L等。

📚 详细文档

基准测试

XL - Sum测试集上的得分如下：

语言	ROUGE - 1 / ROUGE - 2 / ROUGE - L
阿姆哈拉语	20.0485 / 7.4111 / 18.0753
阿拉伯语	34.9107 / 14.7937 / 29.1623
阿塞拜疆语	21.4227 / 9.5214 / 19.3331
孟加拉语	29.5653 / 12.1095 / 25.1315
缅甸语	15.9626 / 5.1477 / 14.1819
中文（简体）	39.4071 / 17.7913 / 33.406
中文（繁体）	37.1866 / 17.1432 / 31.6184
英语	37.601 / 15.1536 / 29.8817
法语	35.3398 / 16.1739 / 28.2041
古吉拉特语	21.9619 / 7.7417 / 19.86
豪萨语	39.4375 / 17.6786 / 31.6667
印地语	38.5882 / 16.8802 / 32.0132
伊博语	31.6148 / 10.1605 / 24.5309
印尼语	37.0049 / 17.0181 / 30.7561
日语	48.1544 / 23.8482 / 37.3636
基隆迪语	31.9907 / 14.3685 / 25.8305
韩语	23.6745 / 11.4478 / 22.3619
吉尔吉斯语	18.3751 / 7.9608 / 16.5033
马拉地语	22.0141 / 9.5439 / 19.9208
尼泊尔语	26.6547 / 10.2479 / 24.2847
奥罗莫语	18.7025 / 6.1694 / 16.1862
普什图语	38.4743 / 15.5475 / 31.9065
波斯语	36.9425 / 16.1934 / 30.0701
皮钦语	37.9574 / 15.1234 / 29.872
葡萄牙语	37.1676 / 15.9022 / 28.5586
旁遮普语	30.6973 / 12.2058 / 25.515
俄语	32.2164 / 13.6386 / 26.1689
苏格兰盖尔语	29.0231 / 10.9893 / 22.8814
塞尔维亚语（西里尔文）	23.7841 / 7.9816 / 20.1379
塞尔维亚语（拉丁字母）	21.6443 / 6.6573 / 18.2336
僧伽罗语	27.2901 / 13.3815 / 23.4699
索马里语	31.5563 / 11.5818 / 24.2232
西班牙语	31.5071 / 11.8767 / 24.0746
斯瓦希里语	37.6673 / 17.8534 / 30.9146
泰米尔语	24.3326 / 11.0553 / 22.0741
泰卢固语	19.8571 / 7.0337 / 17.6101
泰语	37.3951 / 17.275 / 28.8796
提格雷尼亚语	25.321 / 8.0157 / 21.1729
土耳其语	32.9304 / 15.5709 / 29.2622
乌克兰语	23.9908 / 10.1431 / 20.9199
乌尔都语	39.5579 / 18.3733 / 32.8442
乌兹别克语	16.8281 / 6.3406 / 15.4055
越南语	32.8826 / 16.2247 / 26.0844
威尔士语	32.6599 / 11.596 / 26.1164
约鲁巴语	31.6595 / 11.6599 / 25.0898

引用

如果您使用了该模型，请引用以下论文：

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}