bert-offensive-lang-detection-tr开源模型 - 免费检测土耳其语文本攻击性语言

首页

Bert Offensive Lang Detection Tr

由 TURKCELL 开发

基于BERT的土耳其语文本分类模型，用于检测文本中的攻击性语言

文本分类

Transformers

其他开源协议:MIT #土耳其语文本分类 #社交媒体内容审核 #BERT微调

下载量 43

发布时间 : 1/30/2024

模型简介

该模型基于dbmdz/bert-base-turkish-128k-uncased模型微调，专门用于土耳其语攻击性语言检测任务。

模型特点

土耳其语优化

专门针对土耳其语特性进行优化，包括字符处理和文本预处理

全面的预处理流程

包含重音转换、小写处理、用户提及移除等多种文本清洗步骤

不平衡数据处理

针对不平衡数据集（攻击性样本占少数）进行了优化处理

模型能力

土耳其语文本分类

攻击性语言检测

社交媒体文本分析

使用案例

内容审核

社交媒体评论审核

自动识别并过滤社交媒体上的攻击性评论

可帮助减少人工审核工作量

在线社区管理

检测论坛和讨论区中的不当言论

维护健康的在线讨论环境

🚀 土耳其语冒犯性语言检测模型

本项目是一个用于检测土耳其语冒犯性语言的模型，它基于预训练的BERT模型，使用特定的数据集进行微调，能够有效识别文本中的冒犯性内容。

🚀 快速开始

安装必要的库

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

模型初始化

# 直接加载模型
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

检查句子是否具有冒犯性

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

✨ 主要特性

基于 dbmdz/bert-base-turkish-128k-uncased 模型进行微调。
使用 OffensEval 2020 数据集，该数据集包含 31,756 条带注释的推文。
对多种预处理步骤的性能进行了分析，确定了有效的预处理策略。
在测试集上达到了 89% 的准确率。

📦 安装指南

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

💻 使用示例

基础用法

安装必要的库

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

预处理函数

from turkish.deasciifier import Deasciifier
def deasciifier(text):
    deasciifier = Deasciifier(text)
    return deasciifier.convert_to_turkish()

def remove_circumflex(text):
    circumflex_map = {
        'â': 'a',
        'î': 'i',
        'û': 'u',
        'ô': 'o',
        'Â': 'A',
        'Î': 'I',
        'Û': 'U',
        'Ô': 'O'
    }

    return ''.join(circumflex_map.get(c, c) for c in text)    
def turkish_lower(text):
    turkish_map = {
        'I': 'ı',
        'İ': 'i',
        'Ç': 'ç',
        'Ş': 'ş',
        'Ğ': 'ğ',
        'Ü': 'ü',
        'Ö': 'ö'
    }
    return ''.join(turkish_map.get(c, c).lower() for c in text)

文本清理函数

import re

def clean_text(text):
    # 去除文本中的带帽字符
    text = remove_circumflex(text)
    # 将文本转换为小写
    text = turkish_lower(text)
    # 转换为土耳其语字符
    text = deasciifier(text)
    # 去除用户提及
    text = re.sub(r"@\S*", " ", text)
    # 去除话题标签
    text = re.sub(r'#\S+', ' ', text)
    # 去除URL
    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
    # 去除标点符号和带标点的表情符号
    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
    # 去除表情符号
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # 表情符号
                           u"\U0001F300-\U0001F5FF"  # 符号和象形图
                           u"\U0001F680-\U0001F6FF"  # 交通和地图符号
                           u"\U0001F1E0-\U0001F1FF"  # 旗帜（iOS）
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)

    # 将多个连续空格替换为单个空格并去除首尾空格
    text = re.sub(r'\s+', ' ', text).strip()
    return text

高级用法

模型初始化

# 直接加载模型
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

检查句子是否具有冒犯性

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

📚 详细文档

模型描述

本模型使用 dbmdz/bert-base-turkish-128k-uncased 模型和 OffensEval 2020 数据集进行微调。Offenseval-tr 数据集包含 31,756 条带注释的推文。

数据集分布

	非冒犯性(0)	冒犯性 (1)
训练集	25625	6131
测试集	2812	716

预处理步骤

处理步骤	描述
重音字符转换	将重音字符转换为无重音的等效字符
小写转换	将所有文本转换为小写
去除 @user 提及	从文本中去除 @user 格式的用户提及
去除话题标签表达式	从文本中去除 #话题标签格式的表达式
去除 URL	从文本中去除 URL
去除标点符号和带标点的表情符号	从文本中去除标点符号和带有标点的表情符号
去除表情符号	从文本中去除表情符号
去 ASCII 化	将 ASCII 文本转换为包含土耳其字符的文本