Question regarding token length and memory

#34
by jdc4429 - opened

Can we get an idea of how much VRAM is needed for the different token lengths.
ie. 8GB max tokens, 12GB max tokens, 16GB max tokens, 24GB max tokens, 48gb max tokens..

Hello!

Im also interested in this. Besides that, its possible to run in a desktop GPU, like an RTX 4070? Even if takes longer than usual time?
Also, do you Mosaic plan on launching a smaller version of storywriter? Like 32k context?

@rodrigofarias , you'll notice that the MPT-7B-StoryWriter model has roughly the same memory footprint as MPT-7B and and MPT-7B-Instruct/Chat, which is roughly 12 GB for the model weights. This is because the linear bias matrices in ALiBi can simply be increased or decreased depending on the desired context length (see this video by Ofir Press). If you would like to work with 32k context length, you can simply do:

import transformers

name = 'mosaicml/mpt-7b-storywriter'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 32768 # (input + output) tokens can be defined by the user

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

The distinction here is that MPT-7B-StoryWriter has been finetuned on much longer texts relative to MPT-7B-Instruct/Chat finetuning.

@jdc4429 for inference, once the input sequence length is increased, the forward pass takes up more memory (roughly quadratic in memory as the QK^T matrix is [batch,...,T, T] where T is maximum sequence length. The linear bias ALiBi matrix is the same dimension as QK^T, and can be increased/decreased accordingly. We don't have a table for the memory requirements of 2048, 4196, 8192 etc. max sequence lengths, but it should be straightforward to profile.

A few community efforts have quantized MPT-7B-StoryWriter that you might find interesting:

sam-mosaic changed discussion status to closed

Sign up or log in to comment