tiiuae/falcon-7b-instruct · Share your recommended configurations for speed.

I'm just getting started running some generation scenarios but I'd prefer not to spend all day finding the best parameters for speed/accuracy to be optimized. What works for you configuration wise?

I have been testing with:

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained (model, trust_remote_code=True)

model.config.max_new_tokens = 2000
gen_cfg = GenerationConfig.from_model_config(model.config)
gen_cfg.max_new_tokens = 2000
gen_cfg.max_time = 90.0

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    generation_config=gen_cfg
)