--- license: apache-2.0 datasets: - agentlans/high-quality-english-sentences - agentlans/finewebedu-sentences language: - en base_model: - agentlans/pythia-14m-finewebedu-sentences pipeline_tag: text-generation library_name: transformers --- # Pythia-14M Fine-Tuned for High-Quality English Sentence Generation This model is a fine-tuned version of the Pythia-14M language model, optimized for generating high-quality English sentences. It builds upon the base model [agentlans/pythia-14m-finewebedu-sentences](https://huggingface.co/agentlans/pythia-14m-finewebedu-sentences) and has been further trained on a curated dataset of well-formed English sentences [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences). ## Model Description The model is based on the Pythia-14M architecture, which is a relatively compact language model. It has been fine-tuned specifically for generating (mostly) grammatically correct and coherent English sentences across a variety of topics and styles. ## Intended Uses & Limitations This model is designed for: - Generating high-quality English sentences - Completing partial sentences - Assisting with writing tasks that require well-formed English Limitations: - Not suitable for tasks requiring deep domain knowledge - May struggle with very long-form text generation - Fails on non-English text - It's tiny so don't expect too much ## Training Data The model was fine-tuned on a combination of datasets: - Web-scraped educational content (finewebedu) - High-quality web text (fineweb) - Filtered Common Crawl data (C4) For the composition and preprocessing of the training data, see [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences). ## How To Use To generate 10 random sentences starting from an empty string on a CUDA device: ```python from transformers import pipeline, set_seed generator = pipeline('text-generation', model='agentlans/pythia-14m-sentences', device='cuda') set_seed(1234) results = generator("", max_length=100, num_return_sequences=10, do_sample=True) for x in results: print(x['generated_text']) ``` Output: ```text The most common cause of the number of diseases is the common cause of death. And there are many people in the war. The average household income is 35.5 percent. He was the most influential theologians of the country in this world. On the other hand, the students will be able to learn the value of the current and the time. However, the effect of the study would be greater than that of a drug-related drug drug. To understand today, our nation's largest international commitment to the use of new technology and technology across the country. On Sunday, the UK was first held in the state of the Australian, where a foreign trade union was used since the first year. I've said that the program is most effective in education in the middle of the world. So a year, it is important to identify a community where a student has a disability. ``` To let the model continue the sentence: ```python results = generator("The meaning of life is", max_length=100, num_return_sequences=10, do_sample=True) for x in results: print(x['generated_text']) ``` Output: ```text The meaning of life is one of the most extraordinary stories of the great world, and some of the most brilliant examples of the world of science. The meaning of life is to develop. The meaning of life is to the person, or to make it a personal impression of what is the case for the reader. The meaning of life is no longer the most important concept of the human language. The meaning of life is the form of a personal or personal character. The meaning of life is the world's real and our future. The meaning of life is the true one of the nation's largest historical experiences. The meaning of life is the basis of the Church's first, the church of the Holy Spirit, and a living faith. The meaning of life is that the law requires that the truth be lost. The meaning of life is the best reason for the poor and poor economy. ``` ## Training Procedure The model was trained using the following hyperparameters: - Learning rate: 5e-05 - Train batch size: 8 - Eval batch size: 8 - Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08) - LR scheduler: Linear - Number of epochs: 3.0 ## Evaluation Results On the evaluation set, the model achieved: - Loss: 6.2540 - Accuracy: 0.1776 ## Ethical Considerations As with any text generation model, users should be aware of potential biases in the training data that may be reflected in the model's outputs. The model should not be used to generate or propagate harmful content. ## Technical Specifications - Library: Transformers 4.45.1 - Framework: PyTorch 2.4.1+cu121 - Datasets: 3.0.1 - Tokenizers: 0.20.0