Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper β’ 2406.20094 β’ Published 4 days ago β’ 69
LLM Compiler Collection Meta LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning. β’ 4 items β’ Updated 5 days ago β’ 128
view article Article Going multimodal: How Prezi is leveraging the Hub and the Expert Support Program to accelerate their ML roadmap 14 days ago β’ 6
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers β’ 66 items β’ Updated 12 days ago β’ 43
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper β’ 2406.14491 β’ Published 12 days ago β’ 75
FP8 LLMs for vLLM Collection Accurate FP8 quantized models by Neural Magic, ready for use with vLLM! β’ 15 items β’ Updated about 7 hours ago β’ 16
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper β’ 2211.05100 β’ Published Nov 9, 2022 β’ 25
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper β’ 2303.03915 β’ Published Mar 7, 2023 β’ 6
Magpie-Pro Collection Dataset built with Meta Llama 3 70B. Models are fine-tuned from Llama 3 8B. β’ 8 items β’ Updated 1 day ago β’ 14
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback Paper β’ 2406.00888 β’ Published about 1 month ago β’ 29
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Paper β’ 2406.01574 β’ Published 29 days ago β’ 42
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model β’ 4 items β’ Updated 12 days ago β’ 20
view article Article Synthetic dataset generation techniques: generating custom sentence similarity data By davanstrien β’ May 23 β’ 12
view article Article Train custom AI models with the trainer API and adapt them to π€ By not-lain β’ 3 days ago β’ 28
Model Merging Collection Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! β’ 30 items β’ Updated 20 days ago β’ 192
view article Article seemore: Implement a Vision Language Model from Scratch By AviSoori1x β’ 9 days ago β’ 48
TransformerFAM: Feedback attention is working memory Paper β’ 2404.09173 β’ Published Apr 14 β’ 42
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Paper β’ 2404.03715 β’ Published Apr 4 β’ 58
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order Paper β’ 2404.00399 β’ Published Mar 30 β’ 40
DIBT Prompt collective SPIN Collection This collection contains resources related to the replication of SPIN with the dibt prompt collective dataset β’ 8 items β’ Updated Mar 12 β’ 7
Awesome Document AI Collection A collection of open-source document AI π π π β’ 27 items β’ Updated Mar 11 β’ 41
Pre-trained LMs ES Collection Monolingual language models pre-trained on Spanish and related languages. β’ 20 items β’ Updated May 6 β’ 6
Instruction-Tuned Models ES Collection Instruction-tuned models in Spanish and other related languages β’ 7 items β’ Updated May 6 β’ 4
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper β’ 2402.13753 β’ Published Feb 21 β’ 106
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper β’ 2402.13064 β’ Published Feb 20 β’ 46
User-LLM: Efficient LLM Contextualization with User Embeddings Paper β’ 2402.13598 β’ Published Feb 21 β’ 18
Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains Paper β’ 2402.05140 β’ Published Feb 6 β’ 18
Instruction-tuned Language Models are Better Knowledge Learners Paper β’ 2402.12847 β’ Published Feb 20 β’ 24
OLMo Suite Collection Artifacts for the first set of OLMo models. β’ 14 items β’ Updated 7 days ago β’ 37
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss Paper β’ 2402.10790 β’ Published Feb 16 β’ 40
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts Paper β’ 2402.07625 β’ Published Feb 12 β’ 10
datasets-SPIN Collection Generated synthetic data used to finetune SPIN. β’ 8 items β’ Updated Feb 9 β’ 10
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback Paper β’ 2402.01391 β’ Published Feb 2 β’ 41
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Paper β’ 2401.16158 β’ Published Jan 29 β’ 16
The Rise and Potential of Large Language Model Based Agents: A Survey Paper β’ 2309.07864 β’ Published Sep 14, 2023 β’ 5
Canonical models Collection This collection lists all the historical (pre-"Hub") canonical model checkpoints, i.e. repos that were not under an org or user namespace β’ 68 items β’ Updated Feb 13 β’ 13
Improving Text Embeddings with Large Language Models Paper β’ 2401.00368 β’ Published Dec 31, 2023 β’ 77
haiku Collection πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO β’ 3 items β’ Updated 12 days ago β’ 4