File size: 1,083 Bytes
a3a74c7
 
 
 
 
 
 
 
 
 
c3cfe18
 
 
a3a74c7
2b6173b
a3a74c7
b5a333d
a3a74c7
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
license: apache-2.0
language:
- zh
- en
tags:
- code
---
# Chinese-CodeLlama-7B-PT

We have further expanded the vocabulary based on Chinese-LLaMA-2-7B which from 55296 to 75548, it is worth noting that the most of them are code tokens. On [MBPP](https://huggingface.co/datasets/mbpp), we calculated the compression rate of the tokenizer to be 38.59%.

We pre-trained the model based on LoRA which the rank is 8 and the trainable LoRA layers contain `q_proj` and `v_proj`, at the same time, `embed_tokens` and `lm_head` layers were trained with full parameters. All trainable parameters are float32.

The training data contains approximately 400 million tokens which from high-quality code dataset on HuggingFace.

In addition, we applied `memory_efficient_attention` to the pre-training, which saves us a lot of GPU memory space. If you want to quickly use this technology in your LLaMA model, you can refer to my GitHub: https://github.com/FrankMinions/memory_efficient_adapter.

Our model can be used for SFT, and we hope to contribute more valuable work in the Chinese field.