compressed-llm/llama-2-13b-chat-sqllm
Text Generation
•
Updated
•
5
None defined yet.
This repo contains compressed LLMs used in the Decoding Compressed Trust project. The models are prepared by Visual Informatics Group @ University of Texas at Austin (VITA-group) and Center for Applied Scientific Computing at LLNL.
License: MIT License
Simplified lists:
Setup environment
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.31.0
pip install accelerate
pip install auto-gptq # for gptq
How to use pruned models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'llama-2-7b'
comp_method = 'magnitude_unstructured'
comp_degree = 0.2
model_path = f'compressed-llm/{base_model}_{comp_method}'
model = AutoModelForCausalLM.from_pretrained(
model_path,
revision=f's{comp_degree}',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
How to use wanda+gptq models
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
How to use gptq models
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g'
# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model_path = 'compressed-llm/vicuna-7b-v1.3_gptq'
tokenizer_path = 'lmsys/vicuna-7b-v1.3'
model = AutoGPTQForCausalLM.from_quantized(
model_path,
# inject_fused_attention=False, # or
disable_exllama=True,
device_map='auto',
revision='2bit_128g',
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
If you are using models in this hub, please consider citing our papers.
@article{hong2024comptrust,
title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng
and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James
and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya
and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
journal={arXiv},
year={2024}
}
Some of the models were used in previous publications.
@article{jaiswal2023emergence,
title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
journal={arXiv},
year={2023}
}
@article{jaiswal2023compressing,
title={Compressing LLMs: The Truth is Rarely Pure and Never Simple},
author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
year={2023},
journal={arXiv},
}
Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations.
For any question, please contact Junyuan Hong.