metadata
license: apache-2.0
datasets:
- dmitva/human_ai_generated_text
0xnu/AGTD-v0.1
The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible here.
Instruction Format
<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
Pseudo-code for tokenizing instructions with the new format:
def tokenize(text):
return tok.encode(text, add_special_tokens=False)
[BOS_ID] +
tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +
…
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
Notes:
[CLS]
,[SEP]
,[PAD]
,[UNK]
, and[MASK]
tokens are integrated based on their definitions in the tokenizer configuration.[INST]
and[/INST]
are utilized to encapsulate instructions.- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
- The
do_lower_case
parameter indicates that text should be in lowercase for consistent tokenization. clean_up_tokenization_spaces
remove unnecessary spaces in the tokenization process.- The
tokenize_chinese_chars
parameter indicates special handling for Chinese characters. - The maximum model length is set to 512 tokens.
Run the model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "0xnu/AGTD-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
# Input text
text = "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text."
# Preprocess the text
inputs = tokenizer(text, return_tensors='pt')
# Run the model
outputs = model(**inputs)
# Interpret the output
logits = outputs.logits
# Apply softmax to convert logits to probabilities
probabilities = torch.softmax(logits, dim=1)
# Assuming the first class is 'human' and the second class is 'ai'
human_prob, ai_prob = probabilities.detach().numpy()[0]
# Print probabilities
print(f"Human Probability: {human_prob:.4f}")
print(f"AI Probability: {ai_prob:.4f}")
# Determine if the text is human or AI-generated
if human_prob > ai_prob:
print("The text is likely human-generated.")
else:
print("The text is likely AI-generated.")