4-bit Quantized Llama 3 Model
Description
This repository hosts the 4-bit quantized version of the Llama 3 model. Optimized for reduced memory usage and faster inference, this model is suitable for deployment in environments where computational resources are limited.
Model Details
- Model Type: Transformer-based language model.
- Quantization: 4-bit precision.
- Advantages:
- Memory Efficiency: Reduces memory usage significantly, allowing deployment on devices with limited RAM.
- Inference Speed: Accelerates inference times, depending on the hardware's ability to process low-bit computations.
How to Use
To utilize this model efficiently, follow the steps below:
Loading the Quantized Model
Load the model with specific parameters to ensure it utilizes 4-bit precision:
from transformers import AutoModelForCausalLM
model_4bit = AutoModelForCausalLM.from_pretrained("SweatyCrayfish/llama-3-8b-quantized", device_map="auto", load_in_4bit=True)
Adjusting Precision of Components
Adjust the precision of other components, which are by default converted to torch.float16:
import torch
from transformers import AutoModelForCausalLM
model_4bit = AutoModelForCausalLM.from_pretrained("SweatyCrayfish/llama-3-8b-quantized", load_in_4bit=True, torch_dtype=torch.float32)
print(model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype)
Citation
Original repository and citations: @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} }
- Downloads last month
- 812
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.