metadata

license: apache-2.0

Model Card for Zamba

Zamba-7B-v1-phase1 is a hybrid model between Mamba, a state-space model, and transformers. It uses a mamba backbone with a shared transformer layer every 6 blocks. Zamba was trained using next-token prediction. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1-phase-1 was pre-trained on 1T tokens of text and code data sourced from open web-datasets. Unlike Zamba-v1, this model represents the checkpoint after pure prertaining only on web-datasets. We envision its use primarily as a comparison tool to explore the effects of our annealing process.

Quick start

Presequities

Zamba requires you use transformers version 4.39.0 or higher:

pip install transformers>=4.39.0

In order to run optimized Mamba implementations on a CUDA device, you first need to install mamba-ssm and causal-conv1d:

pip install mamba-ssm causal-conv1d>=1.2.0

You can run the model not using the optimized Mamba kernels, but it is not recommended as it will result in significantly higher latency.

To run on CPU, please specify use_mamba_kernels=False when loading the model using AutoModelForCausalLM.from_pretrained.

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba-7B-v1-phase1")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1-phase1", device_map="auto", torch_dtype=torch.bfloat16)

input_text = "A funny prompt would be "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Notice

Zamba is a pretrained base model and therefore does not have any moderation mechanism.