How much VRAM is needed to run this model? Like for the bare minimum length etc?
I have 3 GPUs, an NVIDIA 4070 TI (12GB), NVIDIA 4060 TI (16GB), and an NVIDIA Tesla T4 (16GB) and I can't get it to split using this:
"
from transformers import AutoModel
from torch.nn import DataParallel
embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v1")
for module_key, module in embedding_model._modules.items():
embedding_model._modules[module_key] = DataParallel(module)
"
and changing the batch size using this:
"
get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size=2
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length)
"
and setting the max embedding length to 512 still causes OOM on both the 4070 TI and the 4060 TI. So how much VRAM does this model need and what can I do to run it on my system?
You will realise it only loaded on one gpu, not the 3. That's why you get oom error.
How could we load it into multiple GPUs, or can it not be originally?