🚀 多模态分类模型(泰米尔语、马拉雅拉姆语、泰卢固语)
本仓库包含用于三种语言(泰米尔语、马拉雅拉姆语和泰卢固语)的文本和音频分类的深度学习模型。该模型可解决多语言环境下文本与音频的分类问题,为相关研究和应用提供了有效的工具。
🚀 快速开始
克隆仓库
git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages
cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages
安装依赖
确保已安装Python,然后运行以下命令:
pip install -r requirements.txt
✨ 主要特性
- 支持三种德拉维达语言(泰米尔语、马拉雅拉姆语、泰卢固语)的文本和音频分类。
- 文本模型使用
xlm - roberta - large
进行特征提取,并结合深度学习分类器。
- 音频模型采用MFCC特征提取和基于CNN的分类器。
📦 安装指南
克隆仓库
git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages
cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages
安装依赖
确保Python已安装,然后运行:
pip install -r requirements.txt
💻 使用示例
基础用法
加载模型
import tensorflow as tf
import pickle
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f:
tamil_text_label_encoder = pickle.load(f)
with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f:
tamil_audio_label_encoder = pickle.load(f)
text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5")
audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras")
文本分类
from indicnlp.tokenize import indic_tokenize
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
import advertools as adv
stopwords = list(sorted(adv.stopwords["tamil"]))
def preprocess_tamil_text(text):
tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta"))
tokens = [token for token in tokens if token not in stopwords]
return " ".join(tokens)
def extract_embeddings(model_name, texts):
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
embeddings = []
batch_size = 16
with torch.no_grad():
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i + batch_size]
encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
outputs = model(**encoded_inputs)
batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
embeddings.extend(batch_embeddings)
return np.array(embeddings)
feature_extractor = "xlm-roberta-large"
text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது"
processed_text = preprocess_tamil_text(text)
text_embeddings = extract_embeddings(feature_extractor, [processed_text])
text_predictions = text_model.predict(text_embeddings)
predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
print("Predicted Label:", predicted_label[0])
音频分类
import librosa
def extract_audio_features(file_path, sr=22050, n_mfcc=40):
audio, _ = librosa.load(file_path, sr=sr)
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
return np.mean(mfccs.T, axis=0)
def predict_audio(file_path):
features = extract_audio_features(file_path)
reshaped_features = features.reshape((1, 40, 1, 1))
predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1)
predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class)
return predicted_label[0]
audio_file = "test_audio.wav"
predicted_audio_label = predict_audio(audio_file)
print("Predicted Audio Label:", predicted_audio_label)
高级用法
批量处理数据集
import os
import pandas as pd
def load_dataset(base_dir='../test', lang='tamil'):
dataset = []
lang_dir = os.path.join(base_dir, lang)
audio_dir = os.path.join(lang_dir, "audio")
text_dir = os.path.join(lang_dir, "text")
text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0])
text_df = pd.read_excel(text_file)
for file in text_df["File Name"]:
if (file + ".wav") in os.listdir(audio_dir):
audio_path = os.path.join(audio_dir, file + ".wav")
transcript_row = text_df.loc[text_df["File Name"] == file]
transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
dataset.append({"File Name": audio_path, "Transcript": transcript})
else:
transcript_row = text_df.loc[text_df["File Name"] == file]
transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
dataset.append({"File Name": "Nil", "Transcript": transcript})
return pd.DataFrame(dataset)
dataset_df = load_dataset()
dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text)
text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist())
text_predictions = text_model.predict(text_embeddings)
text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
dataset_df["Predicted Text Label"] = text_labels
dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio")
dataset_df.to_csv("predictions.tsv", sep="\t", index=False)
在Hugging Face上部署
pip install huggingface_hub
huggingface-cli login
from huggingface_hub import upload_file
upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="<your-hf-repo>")
📚 详细文档
目录结构
├── audio_label_encoders/ # 音频模型的标签编码器
├── audio_models/ # 训练好的音频分类模型
├── text_label_encoders/ # 文本模型的标签编码器
└── text_models/ # 训练好的文本分类模型
每个文件夹包含三个文件,分别对应泰米尔语、马拉雅拉姆语和泰卢固语。
模型评估结果
任务类型 |
数据集 |
指标 |
值 |
文本分类 |
德拉维达语仇恨言论数据集 |
宏F1值 |
0.6438 |
音频分类 |
德拉维达语仇恨言论数据集 |
宏F1值 |
0.88 |
🔧 技术细节
文本模型
利用xlm - roberta - large
进行特征提取,结合深度学习分类器对文本进行分类。
音频模型
采用MFCC(梅尔频率倒谱系数)特征提取方法,结合基于CNN的分类器对音频进行分类。
📄 许可证
本项目采用CC BY - NC 4.0许可证。
📬 联系我们
如有问题或改进建议,请随时提出问题或发送邮件至 。