bert-base-turkish-sentiment-cased开源模型 - 精准判断土耳其语文本情感极性

首页

Bert Base Turkish Sentiment Cased

由 savasy 开发

基于土耳其语BERTurk模型微调的情感分析模型，用于判断土耳其语文本的情感极性（正面/负面）

文本分类其他#土耳其语情感分析 #BERT微调模型 #高准确率(95.4%)

下载量 17.32k

发布时间 : 3/2/2022

模型简介

该模型专门针对土耳其语情感分析任务开发，基于BERT架构，在土耳其商品评论、电影评论和推特数据集上训练，能够准确识别土耳其语文本的情感倾向。

模型特点

高准确率

在测试集上达到95.4%的准确率，表现优异

多源数据训练

整合了商品评论、电影评论和推特数据，覆盖多种文本类型

专业土耳其语处理

基于专门针对土耳其语优化的BERTurk模型开发

模型能力

土耳其语文本情感分析

正面/负面情感分类

商品评论情感判断

电影评论情感判断

社交媒体文本情感分析

使用案例

电子商务

商品评论分析

分析土耳其电商平台的商品评论情感倾向

可准确识别书籍、DVD、电子产品等商品评论的情感

影视娱乐

电影评价分析

分析土耳其电影网站的用户评价情感

能有效区分正面(≥4分)和负面(≤2分)评价

社交媒体监测

推特情感分析

监测土耳其语推特中的公众情绪

可应用于品牌声誉管理和舆情监控

🚀 土耳其语BERT基础情感分析模型

该模型用于情感分析，基于适用于土耳其语的BERTurk模型构建。模型链接：https://huggingface.co/savasy/bert-base-turkish-sentiment-cased ，BERTurk模型链接：https://huggingface.co/dbmdz/bert-base-turkish-cased

🚀 快速开始

你可以按照以下步骤使用该模型进行情感分析：

安装必要的库：

pip install transformers

使用以下代码进行情感分析：

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True

p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False

✨ 主要特性

基于BERTurk模型，适用于土耳其语的情感分析任务。
提供了详细的训练和使用示例，方便用户快速上手。
在实验中取得了约95.4%的准确率。

📦 安装指南

要使用该模型，你需要安装transformers库，可以使用以下命令进行安装：

pip install transformers

💻 使用示例

基础用法

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True

p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False

高级用法

对文件中的评论进行情感分析

假设你的文件包含多行评论和标签（1或0），以制表符分隔：

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

input_file = "/path/to/your/file/yourfile.tsv"

i, crr = 0, 0
for line in open(input_file):
    lines = line.strip().split("\t")
    if len(lines) == 2:
        
        i = i + 1
        if i%100 == 0:
            print(i)
        
        pred = sa(lines[0])
        pred = pred[0]["label"].split("_")[1]
        
        if pred == lines[1]:
            crr = crr + 1

print(crr, i, crr/i)

📚 详细文档

数据集

该数据集取自研究[2]和[3]，并进行了合并。

研究[2]收集了电影和产品评论。产品包括书籍、DVD、电子产品和厨房用品。电影数据集来自一个电影院网页（Beyazperde），包含5331条正面和5331条负面句子。网页上的评论由发表评论的用户以0到5的评分进行标记。该研究认为，如果评分大于或等于4，则评论情感为正面；如果评分小于或等于2，则为负面。他们还从一个在线零售商网页构建了土耳其语产品评论数据集。他们构建了一个基准数据集，包含一些产品（书籍、DVD等）的评论。同样，评论的评分范围为1到5，大多数评论的评分为5。每个类别有700条正面和700条负面评论，其中负面评论的平均评分为2.27，正面评论的平均评分为4.5。该数据集也被研究[1]使用。
研究[3]收集了推文数据集。他们提出了一种新的方法，用于自动对微博消息的情感进行分类。该方法基于利用强大的特征表示和融合。

合并后的数据集

大小	数据
8000	dev.tsv
8262	test.tsv
32000	train.tsv
48290	总计

引用该数据集的论文

[1] Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.

[2] Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM ’13)

[3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

训练

export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2

python3 run_glue.py \
  --model_type bert \
  --model_name_or_path dbmdz/bert-base-turkish-uncased\
  --task_name "SST-2" \
  --do_train \
  --do_eval \
  --data_dir "./sst-2-newall" \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir "./model"

结果

05/10/2020 17:00:43 - INFO - transformers.trainer -   ****** Running Evaluation ******  
05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999  
05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8  
Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]  
05/10/2020 17:01:17 - INFO - __main__ -   ****** Eval results sst-2 ******  
05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602  
05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363

准确率约为95.4%

📄 许可证

如果你在研究中使用了该模型，请进行引用：

@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}