|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- dmitva/human_ai_generated_text |
|
--- |
|
|
|
# 0xnu/AGTD-v0.1 |
|
|
|
The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565). |
|
|
|
## Instruction Format |
|
|
|
``` |
|
<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS] |
|
``` |
|
|
|
Pseudo-code for tokenizing instructions with the new format: |
|
|
|
```Python |
|
def tokenize(text): |
|
return tok.encode(text, add_special_tokens=False) |
|
|
|
[BOS_ID] + |
|
tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") + |
|
tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") + |
|
… |
|
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") + |
|
tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID] |
|
``` |
|
|
|
Notes: |
|
|
|
- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration. |
|
- `[INST]` and `[/INST]` are utilized to encapsulate instructions. |
|
- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space. |
|
- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization. |
|
- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process. |
|
- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters. |
|
- The maximum model length is set to 512 tokens. |
|
|
|
## Run the model |
|
|
|
```Python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
model_id = "0xnu/AGTD-v0.1" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_id) |
|
|
|
# Input text |
|
text = "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text." |
|
|
|
# Preprocess the text |
|
inputs = tokenizer(text, return_tensors='pt') |
|
|
|
# Run the model |
|
outputs = model(**inputs) |
|
|
|
# Interpret the output |
|
logits = outputs.logits |
|
|
|
# Apply softmax to convert logits to probabilities |
|
probabilities = torch.softmax(logits, dim=1) |
|
|
|
# Assuming the first class is 'human' and the second class is 'ai' |
|
human_prob, ai_prob = probabilities.detach().numpy()[0] |
|
|
|
# Print probabilities |
|
print(f"Human Probability: {human_prob:.4f}") |
|
print(f"AI Probability: {ai_prob:.4f}") |
|
|
|
# Determine if the text is human or AI-generated |
|
if human_prob > ai_prob: |
|
print("The text is likely human-generated.") |
|
else: |
|
print("The text is likely AI-generated.") |
|
``` |
|
|