frankminors123
/

Chinese-CodeLlama-7B-PT

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Chinese-CodeLlama-7B-PT / README.md

frankminors123's picture

Update README.md

517e389 10 months ago

|

history blame contribute delete

No virus

1.63 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- code
	---
	# Chinese-CodeLlama-7B-PT

	We have further expanded the vocabulary based on Chinese-LLaMA-2-7B which from 55296 to 75548, it is worth noting that the most of them are code tokens. On [MBPP](https://huggingface.co/datasets/mbpp), we calculated the compression rate of the tokenizer to be 4.509 `bytes/token`, and we will reduce this value in the future work to improve training and inference efficiency.

	We pre-trained the model based on LoRA which the rank is 8 and the trainable LoRA layers contain `q_proj` and `v_proj`, at the same time, `embed_tokens` and `lm_head` layers were trained with full parameters. All trainable parameters are float32.

	The training data contains approximately 400 million tokens which from high-quality code dataset on HuggingFace.

	In addition, we applied `memory_efficient_attention` to the pre-training, which saves us a lot of GPU memory space. If you want to quickly use this technology in your LLaMA model, you can refer to my GitHub: https://github.com/FrankMinions/memory_efficient_adapter.

	Our model can be used for SFT, and we hope to contribute more valuable work in the Chinese field.

	The second version of our fine-tuned model named [Chinese-CodeLlama-7B-SFT-V2](https://huggingface.co/frankminors123/Chinese-CodeLlama-7B-SFT-V2) has been launched. We use a sequence length of 1k for pre-training (this model), and continue training based on this length during the fine-tuning stage. Based on a larger base period of rotary positional embeddings, it can support up 15k context length extrapolation at inference time.