0xnu
/

AGTD-v0.1

Model card Files Files and versions Community

AGTD-v0.1 / README.md

0xnu's picture

Update README.md

8c084ea 9 months ago

|

No virus

2.75 kB

	---
	license: apache-2.0
	datasets:
	- dmitva/human_ai_generated_text
	---

	# 0xnu/AGTD-v0.1

	The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).

	## Instruction Format

	```
	<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
	```

	Pseudo-code for tokenizing instructions with the new format:

	```Python
	def tokenize(text):
	return tok.encode(text, add_special_tokens=False)

	[BOS_ID] +
	tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
	tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +
	…
	tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
	tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
	```

	Notes:

	- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
	- `[INST]` and `[/INST]` are utilized to encapsulate instructions.
	- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
	- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
	- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
	- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
	- The maximum model length is set to 512 tokens.

	## Run the model

	```Python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_id = "0xnu/AGTD-v0.1"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	# Input text
	text = "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text."

	# Preprocess the text
	inputs = tokenizer(text, return_tensors='pt')

	# Run the model
	outputs = model(**inputs)

	# Interpret the output
	logits = outputs.logits

	# Apply softmax to convert logits to probabilities
	probabilities = torch.softmax(logits, dim=1)

	# Assuming the first class is 'human' and the second class is 'ai'
	human_prob, ai_prob = probabilities.detach().numpy()[0]

	# Print probabilities
	print(f"Human Probability: {human_prob:.4f}")
	print(f"AI Probability: {ai_prob:.4f}")

	# Determine if the text is human or AI-generated
	if human_prob > ai_prob:
	print("The text is likely human-generated.")
	else:
	print("The text is likely AI-generated.")
	```