nqzfaizal77ai
/

sa-145m-en-wikipedia-culture-part1-1bc

Text Generation

text-generation-inference

Model card Files Files and versions Community

sa-145m-en-wikipedia-culture-part1-1bc / README.md

nqzfaizal77ai's picture

Update README.md

b4eee4e verified about 9 hours ago

|

history blame contribute delete

3.24 kB

	---
	library_name: transformers
	inference: false
	license: cc-by-sa-4.0
	base_model:
	- nqzfaizal77ai/swiftstrike-aero-init-580m
	---

	Swiftstrike Aero Model (Falcon Pruned Model)

	This model is a fine-tuned version of the Swiftstrike Aero Model, specifically tailored for context-aware keyword searches related to culture. It's designed to process 1-block contexts, equivalent to approximately 384 tokens or a single paragraph of wikipedia standard(common length paragraph).

	Training Data (Part 1 Culture Context Wikipedia)

	The model was trained on a multi-stage dataset derived from Wikipedia's culture-related content:

	1. Base Dataset:
	- 13,000 rows of capitalized and lowercase words extracted from Wikipedia's culture sentences.
	2. Sentence-Level Dataset:
	- 2,300 rows of full sentences from Wikipedia's culture data.
	3. 1-Block Context Dataset:
	- 500 rows of 1-block contexts (approximately 1 paragraph) from Wikipedia's culture data.

	Dataset Organization

	The dataset is structured hierarchically, with each level representing an increasing level of complexity:

	1. Part: Individual components or elements.
	2. Merge Part: Combination of two or more parts.
	3. Fragment: Combination of two or more merge parts.
	4. Sub-Unit: Combination of two or more fragments.
	5. Unit: Combination of two or more sub-units.
	6. Super-Unit: Combination of two or more units.
	7. Mega-Unit: Combination of two or more super-units.

	How to Use

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from IPython.display import display, HTML

	# prompt: load model and generate example
	model_name = "nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc"

	model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)

	torch.manual_seed(3077)

	input_text = "The cultural impact of the internet is"
	inputs = tokenizer(input_text, return_tensors="pt")

	# Example usage stochastic decode
	output = model.generate(**inputs,
	do_sample=True,
	top_k=50,
	top_p=0.95,
	repetition_penalty=1.2,
	max_length=100)

	# Decode the generated output to a string
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

	def print_with_border(text):
	"""Prints the given text with a border around it."""
	from IPython.display import display, HTML
	display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

	print_with_border(generated_text)

	# Example usage greedy decode
	output = model.generate(**inputs,
	do_sample=False,
	max_length=100)

	# Decode the generated output to a string
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

	def print_with_border(text):
	"""Prints the given text with a border around it."""
	from IPython.display import display, HTML
	display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

	print_with_border(generated_text)
	```