🚀 PII检测模型 - Phi3 Mini微调版
本仓库包含一个针对检测个人身份信息(PII)而微调的Phi3 Mini模型版本。该模型经过专门训练,能够识别文本中的各种PII实体,是数据编辑、隐私保护以及遵守数据保护法规等任务的强大工具。
🚀 快速开始
本模型可用于检测文本中的个人身份信息(PII),在数据隐私保护等场景中发挥重要作用。以下为使用该模型的具体步骤。
✨ 主要特性
- 精准识别:能够识别多种类型的PII实体,涵盖个人信息、联系方式、地址信息等多个类别。
- 广泛适用:适用于数据编辑、隐私保护以及数据保护法规合规等多种任务。
📦 安装指南
若要使用此模型,你需要安装transformers
库:
pip install transformers
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email Nathen15@hotmail.com."
model_prompt = f"""### Instruction:
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
### Input:
{input_text}
### Output: """
inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
📚 详细文档
模型概述
模型架构
可检测的PII实体
该模型能够检测以下PII实体:
-
个人信息:
firstname
(名字)
middlename
(中间名)
lastname
(姓氏)
sex
(性别)
dob
(出生日期)
age
(年龄)
gender
(性别)
height
(身高)
eyecolor
(眼睛颜色)
-
联系信息:
email
(电子邮件)
phonenumber
(电话号码)
url
(网址)
username
(用户名)
useragent
(用户代理)
-
地址信息:
street
(街道)
city
(城市)
state
(州)
county
(县)
zipcode
(邮政编码)
country
(国家)
secondaryaddress
(二级地址)
buildingnumber
(楼号)
ordinaldirection
(方位)
-
地理信息:
nearbygpscoordinate
(附近的GPS坐标)
-
组织信息:
companyname
(公司名称)
jobtitle
(职位名称)
jobarea
(工作领域)
jobtype
(工作类型)
-
财务信息:
accountname
(账户名称)
accountnumber
(账户号码)
creditcardnumber
(信用卡号码)
creditcardcvv
(信用卡CVV码)
creditcardissuer
(信用卡发卡行)
iban
(国际银行账号)
bic
(银行识别码)
currency
(货币)
currencyname
(货币名称)
currencysymbol
(货币符号)
currencycode
(货币代码)
amount
(金额)
-
唯一标识符:
pin
(个人识别码)
ssn
(社会安全号码)
imei
(手机IMEI码)
mac
(MAC地址)
vehiclevin
(车辆VIN码)
vehiclevrm
(车辆VRM码)
-
加密货币信息:
bitcoinaddress
(比特币地址)
litecoinaddress
(莱特币地址)
ethereumaddress
(以太坊地址)
-
其他信息:
ip
(IP地址)
ipv4
(IPv4地址)
ipv6
(IPv6地址)
maskednumber
(掩码号码)
password
(密码)
time
(时间)
ordinaldirection
(方位)
prefix
(前缀)
提示格式
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.
📄 许可证
本项目采用MIT许可证。