README.md · nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc at main

metadata

library_name: transformers
inference: false
license: cc-by-sa-4.0
base_model:
  - nqzfaizal77ai/swiftstrike-aero-init-580m

Swiftstrike Aero Model (Falcon Pruned Model)

This model is a fine-tuned version of the Swiftstrike Aero Model, specifically tailored for context-aware keyword searches related to culture. It's designed to process 1-block contexts, equivalent to approximately 384 tokens or a single paragraph of wikipedia standard(common length paragraph).

Training Data (Part 1 Culture Context Wikipedia)

The model was trained on a multi-stage dataset derived from Wikipedia's culture-related content:

Base Dataset:
- 13,000 rows of capitalized and lowercase words extracted from Wikipedia's culture sentences.
Sentence-Level Dataset:
- 2,300 rows of full sentences from Wikipedia's culture data.
1-Block Context Dataset:
- 500 rows of 1-block contexts (approximately 1 paragraph) from Wikipedia's culture data.

Dataset Organization

The dataset is structured hierarchically, with each level representing an increasing level of complexity:

Part: Individual components or elements.
Merge Part: Combination of two or more parts.
Fragment: Combination of two or more merge parts.
Sub-Unit: Combination of two or more fragments.
Unit: Combination of two or more sub-units.
Super-Unit: Combination of two or more units.
Mega-Unit: Combination of two or more super-units.

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import display, HTML

# prompt: load model and generate example
model_name = "nqzfaizal77ai/sa-145m-en-wikipedia-culture-part1-1bc"

model = AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)

torch.manual_seed(3077)

input_text = "The cultural impact of the internet is"
inputs = tokenizer(input_text, return_tensors="pt")

# Example usage stochastic decode
output = model.generate(**inputs,
                          do_sample=True,
                          top_k=50,
                          top_p=0.95,
                          repetition_penalty=1.2,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)

# Example usage greedy decode
output = model.generate(**inputs,
                          do_sample=False,
                          max_length=100)

# Decode the generated output to a string
generated_text = tokenizer.decode(output[0], skip_special_tokens=True).replace("\n", "<br>")

def print_with_border(text):
    """Prints the given text with a border around it."""
    from IPython.display import display, HTML
    display(HTML(f"<div style='border: 1px solid black; padding: 10px;'>{text}</div>"))

print_with_border(generated_text)