nqzfaizal77ai's picture
Update README.md
b4eee4e verified
metadata
library_name: transformers
inference: false
license: cc-by-sa-4.0
base_model:
  - nqzfaizal77ai/swiftstrike-aero-init-580m

Swiftstrike Aero Model (Falcon Pruned Model)

This model is a fine-tuned version of the Swiftstrike Aero Model, specifically tailored for context-aware keyword searches related to culture. It's designed to process 1-block contexts, equivalent to approximately 384 tokens or a single paragraph of wikipedia standard(common length paragraph).

Training Data (Part 1 Culture Context Wikipedia)

The model was trained on a multi-stage dataset derived from Wikipedia's culture-related content:

  1. Base Dataset:
    • 13,000 rows of capitalized and lowercase words extracted from Wikipedia's culture sentences.
  2. Sentence-Level Dataset:
    • 2,300 rows of full sentences from Wikipedia's culture data.
  3. 1-Block Context Dataset:
    • 500 rows of 1-block contexts (approximately 1 paragraph) from Wikipedia's culture data.

Dataset Organization

The dataset is structured hierarchically, with each level representing an increasing level of complexity:

  1. Part: Individual components or elements.
  2. Merge Part: Combination of two or more parts.
  3. Fragment: Combination of two or more merge parts.
  4. Sub-Unit: Combination of two or more fragments.
  5. Unit: Combination of two or more sub-units.
  6. Super-Unit: Combination of two or more units.
  7. Mega-Unit: Combination of two or more super-units.

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import display, HTML

# prompt: load model and generate example
model_name = "nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc"

model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)

torch.manual_seed(3077)

input_text = "The cultural impact of the internet is"
inputs = tokenizer(input_text, return_tensors="pt")

# Example usage stochastic decode
output = model.generate(**inputs,
                          do_sample=True,
                          top_k=50,
                          top_p=0.95,
                          repetition_penalty=1.2,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)

# Example usage greedy decode
output = model.generate(**inputs,
                          do_sample=False,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)