Text Generation
Transformers
PyTorch
mpt
Composer
MosaicML
llm-foundry
custom_code
text-generation-inference
jacobfulano commited on
Commit
6ec8c48
1 Parent(s): cdc3a50

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ datasets:
4
+ - mosaicml/dolly_hhrlhf
5
+ tags:
6
+ - Composer
7
+ - MosaicML
8
+ - llm-foundry
9
+ ---
10
+
11
+ # MPT-7B-Instruct
12
+
13
+ MPT-7B-Instruct is a model for short-form instruction following.
14
+ It is built by finetuning [MPT-7B (Base)](https://huggingface.co/spaces/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
15
+ * License: _CC-By-SA-3.0_ (commercial use permitted)
16
+ * [Online Demo](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
17
+
18
+
19
+ This model was trained by [MosaicML](https://www.mosaicml.com) and follows a modified decoder-only transformer architecture.
20
+
21
+ ## Model Date
22
+
23
+ May 5, 2023
24
+
25
+ ## Model License
26
+
27
+ Apache-2.0 (commercial use permitted)
28
+
29
+ ## Documentation
30
+
31
+ * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](www.mosaicml.com/blog/mpt-7b)
32
+ * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
33
+ * Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)!
34
+
35
+ ### Example Dialogue
36
+
37
+ **SamIAm85**:
38
+ I want you to come up with a tweet based on this summary of the article:
39
+
40
+ "Introducing MPT-7B, the latest entry in our MosaicML Foundation Series.
41
+ MPT-7B is a transformer trained from scratch on IT tokens of text and code.
42
+ It is open source, available for commercial use, and it matches the quality of LLaMA-7B.
43
+ MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k.
44
+ Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch.
45
+ For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-Story Writer-65k+, the last of which uses a context length of 65k tokens!
46
+
47
+ **MPT-7B-Instruct**:
48
+ >Here's my suggestion for a tweet:
49
+ >"Exciting news! @MosaicML has launched #MPT_7B - a new standard for open-source, commercially usable LLMs. With training on 17 tokens of text & code, it matches the quality of LLaMA-7B. Now you can train, fine-tune, and deploy your own private MPT models via their platform. Check out the full blog post here: https://bit.ly/3j6UJzZ"
50
+
51
+ **SamIAm85**:
52
+ Is there anything you want to say to MosaicML about this exciting news?
53
+
54
+ **MPT-7B-Instruct**:
55
+ > As an Al language model myself, I am very impressed with the advancements made by MosaicML in developing such high-quality open-source LLMs.
56
+ > This new version of MPT-7B is truly impressive and I look forward to seeing what innovative applications developers will create using these powerful tools.
57
+ > Thank you for your hard work and dedication to advancing Al research and development.
58
+
59
+ ## How to Use
60
+
61
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we use a custom model architecture that is not yet part of the `transformers` package.
62
+
63
+ It includes options for many training efficiency features such as [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), QK LayerNorm, and more.
64
+
65
+ ```python
66
+ import transformers
67
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-instruct', trust_remote_code=True, torch_dtype=torch.bfloat16)
68
+ ```
69
+
70
+ To use the optimized triton implementation of FlashAttention, you can load with `attn_impl='triton'` and move the model to `bfloat16` like so:
71
+
72
+ ```python
73
+ model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-instruct', trust_remote_code=True, torch_dtype=torch.bfloat16, attn_impl='triton')
74
+ model.to(device='cuda:0', dtype=torch.bfloat16)
75
+ ```
76
+
77
+ ## Model Description
78
+
79
+ The architecture is a modification of a standard decoder-only transformer.
80
+
81
+ The model has been modified from a standard transformer in the following ways:
82
+ * It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
83
+ * It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
84
+ * It does not use biases
85
+
86
+
87
+ | Hyperparameter | Value |
88
+ |----------------|-------|
89
+ |n_parameters | 6.7B |
90
+ |n_layers | 32 |
91
+ | n_heads | 32 |
92
+ | d_model | 4096 |
93
+ | vocab size | 50432 |
94
+ | sequence length | 2048 |
95
+
96
+ ## PreTraining Data
97
+
98
+ For more details on the pretraining process, see [MPT-7B](https://huggingface.co/mosaicml/mpt-7b).
99
+
100
+ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
101
+
102
+ ## Training Configuration
103
+
104
+ This model was finetuned on 440 A100-40GBs for about half a day using the [MosaicML Platform](https://www.mosaicml.com/platform). The model was trained with sharded data parallelism using FSDP.
105
+
106
+ ## Acknowledgements
107
+
108
+