MosaicML's MPT-7B-Chat GGML
These files are GGML format model files for MosaicML's MPT-7B-Chat.
Please note that these GGMLs are not compatible with llama.cpp, or currently with text-generation-webui. Please see below for a list of tools known to work with these model files.
KoboldCpp just added GPU accelerated (OpenCL) support for MPT models, so that is the client I recommend using for these models.
Note: Please make sure you're using KoboldCpp version 1.32.3 or later, as a number of MPT-related bugs are fixed.
Repositories available
- 4, 5, and 8-bit GGML models for CPU+GPU inference
- Unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Prompt template
<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.<|im_end|>
<|im_start|>user
prompt goes here<|im_end|>
<|im_start|>assistant
A note regarding context length: 4K
The base model has an 4K context length.
KoboldCpp supports 4K context if you manually set it to 4K by adjusting the text box above the slider, like in this example:
(Set it to 4K, not 8K for this model.)
Compatibilty
These files are not compatible with text-generation-webui, llama.cpp, or llama-cpp-python.
Currently they can be used with:
- KoboldCpp, a powerful inference engine based on llama.cpp, with good UI and GPU accelerated support for MPT models: KoboldCpp
- The ctransformers Python library, which includes LangChain support: ctransformers
- The LoLLMS Web UI which uses ctransformers: LoLLMS Web UI
- rustformers' llm
- The example
mpt
binary provided with ggml
As other options become available I will endeavour to update them here (do let me know in the Community tab if I've missed something!)
Tutorial for using LoLLMS Web UI
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
mpt-7b-chat.ggmlv0.q4_0.bin | q4_0 | 4 | 16.85 GB | 19.35 GB | 4-bit. |
mpt-7b-chat.ggmlv0.q4_1.bin | q4_1 | 4 | 18.73 GB | 21.23 GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
mpt-7b-chat.ggmlv0.q5_0.bin | q5_0 | 5 | 20.60 GB | 23.10 GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
mpt-7b-chat.ggmlv0.q5_1.bin | q5_1 | 5 | 22.47 GB | 24.97 GB | 5-bit. Even higher accuracy, resource usage and slower inference. |
mpt-7b-chat.ggmlv0.q8_0.bin | q8_0 | 8 | 31.83 GB | 34.33 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
Discord
For further support, and discussions on these models and AI in general, join us at:
Thanks, and how to contribute.
Thanks to the chirper.ai team!
I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
- Patreon: https://patreon.com/TheBlokeAI
- Ko-Fi: https://ko-fi.com/TheBlokeAI
Special thanks to: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
Patreon special mentions: zynix , ya boyyy, Trenton Dambrowitz, Imad Khwaja, Alps Aficionado, chris gileta, John Detwiler, Willem Michiel, RoA, Mano Prime, Rainer Wilmers, Fred von Graf, Matthew Berman, Ghost , Nathan LeClaire, Iucharbius , Ai Maven, Illia Dulskyi, Joseph William Delisle, Space Cruiser, Lone Striker, Karl Bernard, Eugene Pentland, Greatston Gnanesh, Jonathan Leane, Randy H, Pierre Kircher, Willian Hasse, Stephen Murray, Alex , terasurfer , Edmond Seymore, Oscar Rangel, Luke Pendergrass, Asp the Wyvern, Junyu Yang, David Flickinger, Luke, Spiking Neurons AB, subjectnull, Pyrater, Nikolai Manek, senxiiz, Ajan Kanaga, Johann-Peter Hartmann, Artur Olbinski, Kevin Schuppel, Derek Yates, Kalila, K, Talal Aujan, Khalefa Al-Ahmad, Gabriel Puliatti, John Villwock, WelcomeToTheClub, Daniel P. Andersen, Preetika Verma, Deep Realms, Fen Risland, trip7s trip, webtim, Sean Connelly, Michael Levine, Chris McCloskey, biorpg, vamX, Viktor Bowallius, Cory Kujawski
Thank you to all my generous patrons and donaters!
Original model card: MosaicML's MPT-7B-chat
MPT-7B-Chat
MPT-7B-Chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the ShareGPT-Vicuna, HC3, Alpaca, HH-RLHF, and Evol-Instruct datasets.
- License: CC-By-NC-SA-4.0 (non-commercial use only)
- Demo on Hugging Face Spaces
This model was trained by MosaicML and follows a modified decoder-only transformer architecture.
Model Date
May 5, 2023
Model License
CC-By-NC-SA-4.0 (non-commercial use only)
Documentation
- Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs
- Codebase (mosaicml/llm-foundry repo)
- Questions: Feel free to contact us via the MosaicML Community Slack!
Example Dialogue
SamIAm85:
I want you to come up with a tweet based on this summary of the article: "Introducing MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on IT tokens of text and code. It is open source and it matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch. For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-Story Writer-65k+, the last of which uses a context length of 65k tokens!
MPT-7B-Instruct:
Here's my suggestion for a tweet: "Exciting news! @MosaicML has launched #MPT_7B - a new standard for open-source LLMs. With training on 17 tokens of text & code, it matches the quality of LLaMA-7B. Now you can train, fine-tune, and deploy your own private MPT models via their platform. Check out the full blog post here: https://bit.ly/3j6UJzZ"
SamIAm85:
Is there anything you want to say to MosaicML about this exciting news?
MPT-7B-Instruct:
As an Al language model myself, I am very impressed with the advancements made by MosaicML in developing such high-quality open-source LLMs. This new version of MPT-7B is truly impressive and I look forward to seeing what innovative applications developers will create using these powerful tools. Thank you for your hard work and dedication to advancing Al research and development.
How to Use
This model is best used with the MosaicML llm-foundry repository for training and finetuning.
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-chat',
trust_remote_code=True
)
Note: This model requires that trust_remote_code=True
be passed to the from_pretrained
method.
This is because we use a custom MPT
model architecture that is not yet part of the Hugging Face transformers
package.
MPT
includes options for many training efficiency features such as FlashAttention, ALiBi, QK LayerNorm, and more.
To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0
) with attn_impl='triton'
and with bfloat16
precision:
import torch
import transformers
name = 'mosaicml/mpt-7b-chat'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.bfloat16, # Load model weights in bfloat16
trust_remote_code=True
)
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
import transformers
name = 'mosaicml/mpt-7b-chat'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True
)
This model was trained with the EleutherAI/gpt-neox-20b tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
The model can then be used, for example, within a text-generation pipeline.
Note: when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
from transformers import pipeline
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.bfloat16):
print(
pipe('Here is a recipe for vegan banana bread:\n',
max_new_tokens=100,
do_sample=True,
use_cache=True))
Model Description
The architecture is a modification of a standard decoder-only transformer.
The model has been modified from a standard transformer in the following ways:
- It uses FlashAttention
- It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
- It does not use biases
Hyperparameter | Value |
---|---|
n_parameters | 6.7B |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 50432 |
sequence length | 2048 |
Training Configuration
This model was trained on 8 A100-80GBs for about 8.2 hours, followed by training for 6.7 hours on 32 A100-40GBs using the MosaicML Platform. The model was trained with sharded data parallelism using FSDP and used the AdamW optimizer.
Limitations and Biases
The following language is modified from EleutherAI's GPT-NeoX-20B
MPT-7B-Chat can produce factually incorrect output, and should not be relied on to produce factually accurate information. MPT-7B-Chat was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Acknowledgements
This model was finetuned by Sam Havens and the MosaicML NLP team
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.
MosaicML Platform
If you're interested in training and deploying your own MPT or LLMs on the MosaicML Platform, sign up here.
Citation
Please cite this model using the following format:
@online{MosaicML2023Introducing,
author = {MosaicML NLP Team},
title = {Introducing MPT-7B: A New Standard for Open-Source,
ly Usable LLMs},
year = {2023},
url = {www.mosaicml.com/blog/mpt-7b},
note = {Accessed: 2023-03-28}, % change this date
urldate = {2023-03-28} % change this date
}