File size: 2,865 Bytes
e326e25
 
 
 
 
 
 
 
 
02a0a08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e326e25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: mit
library_name: peft
tags:
- text generation
- nlp
- llm
- pii
---
# PII Guardian PHI3 Mini LORA

This repository contains a fine-tuned model for detecting Personally Identifiable Information (PII) using PHI3 Mini with LoRA applied to the query, key, and value layers. The model is optimized for accuracy and efficiency in identifying various PII entities.

## How to Use

You can use the following Python code to load and run the model:

```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Check the available device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"Using device: {device}")

# Check for bf16 support
is_bf16_support = False
try:
    tensor_bf16 = torch.tensor([1.0], dtype=torch.bfloat16, device=device)
    is_bf16_support = True
    print("bf16 tensors are supported.")
except TypeError:
    print("bf16 tensors are not supported.")

# Load the base model and tokenizer
base_model = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Load the fine-tuned PII detection model with LoRA
model = AutoModelForCausalLM.from_pretrained(base_model, return_dict=True, device_map=device, torch_dtype=torch.bfloat16 if is_bf16_support else torch.float16)
pii_model = PeftModel.from_pretrained(model, "ab-ai/PII-Guardian-Phi3-Mini-LORA")

# Example input text
input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email [email protected]."

# Define the model prompt
model_prompt = f"""### Instruction:
    Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.

    ### Input:
    {input_text}

    ### Output:
    
    """

# Tokenize the input and generate a response
inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
# adjust max_new_tokens according to your need
outputs = pii_model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) #  {'firstname': ['Abner'], 'dob': ['23/03/1926'], 'email': ['[email protected]']}