File size: 2,749 Bytes
4088a0c
 
835a894
 
4088a0c
ce5abbc
9cb476d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce5abbc
8c084ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: apache-2.0
datasets:
- dmitva/human_ai_generated_text
---

# 0xnu/AGTD-v0.1

The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).

## Instruction Format

```
<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
```

Pseudo-code for tokenizing instructions with the new format:

```Python
def tokenize(text):
    return tok.encode(text, add_special_tokens=False)

[BOS_ID] + 
tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +

tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
```

Notes:

- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
- `[INST]` and `[/INST]` are utilized to encapsulate instructions.
- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
- The maximum model length is set to 512 tokens.

## Run the model

```Python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "0xnu/AGTD-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Input text
text = "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text."

# Preprocess the text
inputs = tokenizer(text, return_tensors='pt')

# Run the model
outputs = model(**inputs)

# Interpret the output
logits = outputs.logits

# Apply softmax to convert logits to probabilities
probabilities = torch.softmax(logits, dim=1)

# Assuming the first class is 'human' and the second class is 'ai'
human_prob, ai_prob = probabilities.detach().numpy()[0]

# Print probabilities
print(f"Human Probability: {human_prob:.4f}")
print(f"AI Probability: {ai_prob:.4f}")

# Determine if the text is human or AI-generated
if human_prob > ai_prob:
    print("The text is likely human-generated.")
else:
    print("The text is likely AI-generated.")
```