|
--- |
|
library_name: transformers |
|
inference: false |
|
license: cc-by-sa-4.0 |
|
base_model: |
|
- nqzfaizal77ai/swiftstrike-aero-init-580m |
|
--- |
|
|
|
**Swiftstrike Aero Model (Falcon Pruned Model)** |
|
|
|
This model is a fine-tuned version of the Swiftstrike Aero Model, specifically tailored for context-aware keyword searches related to culture. It's designed to process 1-block contexts, equivalent to approximately 384 tokens or a single paragraph of wikipedia standard(common length paragraph). |
|
|
|
**Training Data (Part 1 Culture Context Wikipedia)** |
|
|
|
The model was trained on a multi-stage dataset derived from Wikipedia's culture-related content: |
|
|
|
1. **Base Dataset:** |
|
- 13,000 rows of capitalized and lowercase words extracted from Wikipedia's culture sentences. |
|
2. **Sentence-Level Dataset:** |
|
- 2,300 rows of full sentences from Wikipedia's culture data. |
|
3. **1-Block Context Dataset:** |
|
- 500 rows of 1-block contexts (approximately 1 paragraph) from Wikipedia's culture data. |
|
|
|
**Dataset Organization** |
|
|
|
The dataset is structured hierarchically, with each level representing an increasing level of complexity: |
|
|
|
1. **Part:** Individual components or elements. |
|
2. **Merge Part:** Combination of two or more parts. |
|
3. **Fragment:** Combination of two or more merge parts. |
|
4. **Sub-Unit:** Combination of two or more fragments. |
|
5. **Unit:** Combination of two or more sub-units. |
|
6. **Super-Unit:** Combination of two or more units. |
|
7. **Mega-Unit:** Combination of two or more super-units. |
|
|
|
**How to Use** |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from IPython.display import display, HTML |
|
|
|
# prompt: load model and generate example |
|
model_name = "nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True) |
|
|
|
torch.manual_seed(3077) |
|
|
|
input_text = "The cultural impact of the internet is" |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
# Example usage stochastic decode |
|
output = model.generate(**inputs, |
|
do_sample=True, |
|
top_k=50, |
|
top_p=0.95, |
|
repetition_penalty=1.2, |
|
max_length=100) |
|
|
|
# Decode the generated output to a string |
|
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>") |
|
|
|
def print_with_border(text): |
|
"""Prints the given text with a border around it.""" |
|
from IPython.display import display, HTML |
|
display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>")) |
|
|
|
print_with_border(generated_text) |
|
|
|
# Example usage greedy decode |
|
output = model.generate(**inputs, |
|
do_sample=False, |
|
max_length=100) |
|
|
|
# Decode the generated output to a string |
|
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>") |
|
|
|
def print_with_border(text): |
|
"""Prints the given text with a border around it.""" |
|
from IPython.display import display, HTML |
|
display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>")) |
|
|
|
print_with_border(generated_text) |
|
``` |