synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
Runtime error8👀Note Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
Running on Zero84🐠Instruction Synthesizer
Note Instruction pre-training is a new approach that enhances LLM pretraining by using instruction-response pairs from an instruction synthesizer instead of raw data.
Running on Zero69🐦⬛Magpie
Note Magpie is a data synthesis pipeline that creates high-quality alignment data without relying on prompt engineering or seed questions. Instead, it generates instruction data by prompting aligned LLMs with a pre-query template.
Running on Zero7💬Bonito
Note This is a demo for Bonito, an open-source model for conditional task generation, which involves converting unannotated text into task-specific synthetic instruction tuning data.