Memory when passing external memories

by xmrt - opened Mar 27

xmrt

Mar 27

Hi (again :)),
I'm having trouble when I try to run use model.generate with a lot of external memories (10 documents that together give approximately 100,000 words). Even when I run topk=0 it runs out of memory after an hour and does not finish a single question. Ideally, I would like to be able to run the model using the 100,000 tokens with a topk=10. I am using an instance with a memory 72 GiB.

Here is how I am loading my model:

configuration = transformers.AutoConfig.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)
configuration.max_seq_len = 2048
configuration.init_device="meta"
configuration.attn_config['alibi'] = True
configuration.attn_config['attn_impl'] = torch
configuration.use_cache = True

generator = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", device_map="cpu", config=configuration, trust_remote_code=True)
generator.empty_memories()

tokenizer = AutoTokenizer.from_pretrained("normalcomputing/extended-mind-mpt-7b", padding_side='left')

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

And this is the tokenisation and generation:

for question, question_index in tqdm(zip(question_data, question_indices), total=len(question_indices)):
    print(question['answers'])
    userprompt = question['question']

    # Get the documents
    docs = question['contexts']
    doc_indices = random.sample(range(10), 10)
    [docs.extend(data[i]['contexts']) for i in doc_indices]

    # Create external memories
    external_memories = " ".join(docs)
    memory_ids = tokenizer(external_memories, return_tensors='pt')['input_ids'].to(device)

If you have any inputs as to what should be changed in order to be able to run the model with this many memories I would be very happy to hear them! Is there for example a possibility to send the tokenised external memories into the model in batches?

phoebeklett

Normal Computing org Mar 27

Hey! I'd recommend using memory_type=faiss, for starters. You can also try increasing the stride parameter in the generate_cache method. This may result in lower quality memories, but will be faster! The stride is used in an analogous way as this tutorial if you want to check it out: https://huggingface.co/docs/transformers/en/perplexity. Let me know if that helps!

xmrt

Mar 30

Thanks a lot for your quick response!

I have tried to set memory_type=faiss and tried to increase stride to 2048, however, it still runs out of memory. Is there a way to estimate how much memory is expected to be used with large external memories? Then I can try to upgrade my resources to match these requirements :)

phoebeklett

Normal Computing org Apr 3

If you're using faiss, the main cost is generating the cache before you pass the vectors to the db store. That cost, (if you're using stride=2048) is roughly n=input_length//2048 passes through the model. (You'll need memory for the model + ~2048 inputs, as well as for the growing vector db). Hope that helps!

xmrt

Apr 13

It does indeed! Thanks :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment