bert-turkish-text-classification开源模型 - 精准将土耳其语文本分类到7个预定义类别

首页

Bert Turkish Text Classification

由 savasy 开发

这是一个基于BERT架构微调的土耳其语文本分类模型，能够将土耳其语文本分类到7个预定义的类别中。

文本分类其他#土耳其语BERT #多类别分类 #新闻主题识别

下载量 523

发布时间 : 3/2/2022

模型简介

该模型专门用于土耳其语文本分类任务，支持将文本分类为世界、经济、文化、健康、政治、体育和科技7个类别。

模型特点

土耳其语优化

基于土耳其语BERT模型微调，专门针对土耳其语文本分类任务优化

多类别分类

支持7个不同类别的文本分类，覆盖主要新闻领域

简单易用

提供简单的API接口，便于集成到各种应用中

模型能力

土耳其语文本分类

多类别预测

文本内容分析

使用案例

新闻分类

新闻自动分类

自动将土耳其语新闻分类到预定义的7个类别中

准确率达到论文报告的水平

内容分析

社交媒体内容分析

分析土耳其语社交媒体内容的话题分布

🚀 土耳其语文本分类

本模型是基于https://github.com/stefan-it/turkish - bert 进行微调的模型，使用了文本分类数据，该数据包含以下7个类别：

code_to_label={
 'LABEL_0': '世界 ',
 'LABEL_1': '经济 ',
 'LABEL_2': '文化 ',
 'LABEL_3': '健康 ',
 'LABEL_4': '政治 ',
 'LABEL_5': '体育 ',
 'LABEL_6': '科技 '}

🚀 快速开始

首先，按照以下方式安装transformers库：

pip install transformers

# 导入库
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")

# 构建并加载模型，这可能需要一些时间，具体取决于你的网络连接
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")

# 创建管道
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# 应用模型
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]

code_to_label = {
 'LABEL_0': '世界 ',
 'LABEL_1': '经济 ',
 'LABEL_2': '文化 ',
 'LABEL_3': '健康 ',
 'LABEL_4': '政治 ',
 'LABEL_5': '体育 ',
 'LABEL_6': '科技 '}
 
code_to_label[nlp("bla bla")[0]['label']]
# > '文化 '

💻 使用示例

基础用法

# 导入库
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")

# 构建并加载模型，这可能需要一些时间，具体取决于你的网络连接
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")

# 创建管道
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# 应用模型
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]

code_to_label = {
 'LABEL_0': '世界 ',
 'LABEL_1': '经济 ',
 'LABEL_2': '文化 ',
 'LABEL_3': '健康 ',
 'LABEL_4': '政治 ',
 'LABEL_5': '体育 ',
 'LABEL_6': '科技 '}
 
code_to_label[nlp("bla bla")[0]['label']]
# > '文化 '

高级用法

# 加载用于土耳其语文本分类的数据
import pandas as pd
# https://www.kaggle.com/savasy/ttc4900
df = pd.read_csv("7allV03.csv")
df.columns = ["labels", "text"]
df.labels = pd.Categorical(df.labels)

traind_df = ...
eval_df = ...

# 模型
from simpletransformers.classification import ClassificationModel
import torch, sklearn

model_args = {
    "use_early_stopping": True,
    "early_stopping_delta": 0.01,
    "early_stopping_metric": "mcc",
    "early_stopping_metric_minimize": False,
    "early_stopping_patience": 5,
    "evaluate_during_training_steps": 1000,
    "fp16": False,
    "num_train_epochs": 3
}

model = ClassificationModel(
    "bert", 
    "dbmdz/bert-base-turkish-cased",
     use_cuda=cuda_available, 
     args=model_args, 
     num_labels=7
)
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

📚 详细文档

如需了解土耳其语文本分类的详细用法，请查看 Python笔记本

🔧 技术细节

本模型是基于https://github.com/stefan-it/turkish - bert 进行微调的，使用了以下土耳其语基准数据集进行微调：https://www.kaggle.com/savasy/ttc4900 。对于其他训练模型，请查看https://simpletransformers.ai/ 。

📄 许可证

引用

如需引用，请参考以下论文：

@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}