--- license: apache-2.0 language: - ks tags: - Text --- # Kashmir Text Generation Model ## Model Overview This is a transformer-based text generation model designed for Kashmiri language text generation. The model uses a decoder-only architecture with positional encoding and self-attention mechanisms. ## TRY LIVE DEMO ON SPACES [VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66afb3f1eaf3e876595627bf/OY88f69T0yxwODUz7iDK8.png) ## TRY LIVE DEMO ON SPACES [VIEW HERE (Click)](https://huggingface.co/spaces/Omarrran/kashmiri_text_generation_trail) ## Intended Use - **Primary Use**: Generating coherent Kashmiri text continuations from given prompts - **Intended Users**: Researchers and developers working with Kashmiri language processing - **Out-of-Scope Uses**: Not intended for production deployment without further evaluation ## Model Architecture - **Type**: Decoder-only Transformer - **Components**: - Positional Encoding - Embedding Layer - Transformer Decoder Layers - Linear Output Layer - **Implementation**: PyTorch This is a custom transformer-based text generation model for Kashmiri language. ## Model Details - **Architecture:** Custom Transformer Decoder - **Vocabulary Size:** 36100 - **Embedding Dimension:** 256 - **Number of Layers:** 4 - **Number of Attention Heads:** 8 - **Sequence Length:** 64 - **Training Data:** Kashmiri text corpus ## Technical Specifications - **Framework**: PyTorch - **Input**: Text prompts in Kashmiri - **Output**: Generated text continuation - **Model Parameters**: - Embedding Dimension: Specified in `model_config.json` - Number of Layers: Specified in `model_config.json` - Number of Attention Heads: Specified in `model_config.json` - Sequence Length: Specified in `model_config.json` - Dropout Rate: 0.2 ## Files Structure ``` ├── root / │ ├── model.pt # Trained model weights │ ├── word_to_int.json # Word to integer mapping │ ├── int_to_word.json # Integer to word mapping │ └── model_config.json # Model configuration ``` ## NOTE 1. Ensure all required files are present in the root directory ## Setup in Google Colab 1. Create a new Google Colab notebook 2. Copy and paste the following code into a code cell: ```python !git clone https://huggingface.co/Omarrran/Kashur_gpt_version_1 ``` ## Required Files The model requires the following files which will be downloaded from the HuggingFace repository: - `model.pt`: The trained model weights - `model_config.json`: Model configuration parameters - `word_to_int.json`: Vocabulary mapping from words to integers - `int_to_word.json`: Vocabulary mapping from integers to words ## NOTE 1. Ensure all required files are present in the root directory ``` import os import shutil # Define the source and destination paths source_path = "/content/Kashur_gpt_version_1/" destination_path = "/content/" # Loop through all files in the source directory and move them to the destination for filename in os.listdir(source_path): file_path = os.path.join(source_path, filename) if os.path.isfile(file_path): shutil.move(file_path, destination_path) print(f"All files from {source_path} moved to {destination_path}") ``` ## Usage ### 1. Import Required Libraries ```python import torch import torch.nn as nn import torch.nn.functional as F import math import json import os ``` # 2. Device configuration ```python # Device configuration device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') def generate_square_subsequent_mask(sz): mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1) mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0)) return mask class PositionalEncoding(nn.Module): def __init__(self, max_len, d_model, dropout=0.1): super(PositionalEncoding, self).__init__() self.dropout = nn.Dropout(p=dropout) pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): x = x + self.pe[:, :x.size(1)] return self.dropout(x) class TextGen(nn.Module): def __init__(self, vocab_size, embed_dim, num_layers, num_heads, sequence_length): super(TextGen, self).__init__() self.pos_encoder = PositionalEncoding(max_len=sequence_length, d_model=embed_dim) self.emb = nn.Embedding(vocab_size, embed_dim) self.decoder_layer = nn.TransformerDecoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True) self.decoder = nn.TransformerDecoder(decoder_layer=self.decoder_layer, num_layers=num_layers) self.linear = nn.Linear(embed_dim, vocab_size) self.dropout = nn.Dropout(0.2) def forward(self, x): emb = self.emb(x) input_mask = generate_square_subsequent_mask(x.size(1)).to(x.device) x = self.pos_encoder(emb) x = self.decoder(x, memory=x, tgt_mask=input_mask, memory_mask=input_mask) x = self.dropout(x) out = self.linear(x) return out def load_model(): # Load configuration with open('model_config.json', 'r') as f: config = json.load(f) # Load vocabularies with open('word_to_int.json', 'r', encoding='utf-8') as f: word_to_int = json.load(f) with open('int_to_word.json', 'r', encoding='utf-8') as f: int_to_word = json.load(f) # Initialize model model = TextGen( vocab_size=config['vocab_size'], embed_dim=config['embed_dim'], num_layers=config['num_layers'], num_heads=config['num_heads'], sequence_length=config['sequence_length'] ).to(device) # Load model weights model.load_state_dict(torch.load('model.pt', map_location=device)) model.eval() return model, word_to_int, int_to_word, config['sequence_length'] @torch.no_grad() def generate_text(model, prompt, word_to_int, int_to_word, sequence_length, max_length=100, temperature=1.0): model.eval() words = prompt.split() current_seq = torch.LongTensor([word_to_int.get(w, 0) for w in words]).unsqueeze(0).to(device) for _ in range(max_length): if current_seq.size(1) > sequence_length: current_seq = current_seq[:, -sequence_length:] output = model(current_seq) next_token_logits = output[:, -1, :] / temperature next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1) current_seq = torch.cat([current_seq, next_token], dim=1) next_word = int_to_word.get(str(next_token.item()), "") words.append(next_word) if next_word == ".": break return " ".join(words) if __name__ == "__main__": # Load model and required files model, word_to_int, int_to_word, sequence_length = load_model() ``` ### Load the Model The model will automatically load after running the provided code above. It uses either GPU (if available) or CPU. ### 3. Generate Text To generate text, use the following format: ```python # Example prompt (in Kashmiri) prompt = " دِتم مصمت۔یم بگُل غلام چھُ آں تس اکھ حمزہ گویی" # Replace With your Kashmiri text prompt generated_text = generate_text( model, prompt, word_to_int, int_to_word, sequence_length, max_length=100 # Adjust this value to control output length ) print(f"Generated text: {generated_text}") ``` ## Parameters You can adjust the following parameters for text generation: - `max_length`: Maximum number of words to generate (default: 100) - `temperature`: Controls randomness in generation (default: 1.0) - Higher values (>1.0) make the output more random - Lower values (<1.0) make the output more focused and deterministic ## Generation Parameters - **Temperature**: Controls randomness in generation (default: 1.0) - Higher values (>1.0) result in more diverse outputs - Lower values (<1.0) make the output more deterministic - **Max Length**: Maximum number of tokens to generate (default: 100) - **Sequence Length**: Maximum context window size (specified in config) ## Limitations - The model operates at word-level tokenization - Limited by the maximum sequence length specified in the configuration - Generation stops at the first period (.) encountered - Performance may vary based on input prompt quality and length ## Performance Considerations - Runs on both CPU and CUDA-enabled GPUs - Memory usage scales with sequence length and batch size - Inference speed depends on hardware capabilities and generation parameters ## Dependencies - Python 3.6+ - PyTorch - Math - JSON - OS ## License [See above card] ## Citation If you use this model in your research, please cite: ```bibtex @misc{{kashmiri_text_gen, author = {{Haq Nawaz Malik}}, title = {{Kashmiri Text Generation Model}}, year = {{2024}}, journal = {{for Preprint}}, howpublished = {{\\url{{https://huggingface.co/Omarrran/kashmiri_text_gen_model}}}} }} ``` ## Contact [Add contact information for model maintainers] ## Updates and Maintenance - Version: 1.0 - Last Updated: [26-10-2024] - [Working to make an updated version]