# Fast-Inference with Ctranslate2

Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.

quantized version of Salesforce/codet5p-770m-py

pip install hf-hub-ctranslate2>=2.0.7

Converted on 2023-05-20 using

ct2-transformers-converter --model Salesforce/codet5p-770m-py --output_dir /home/michael/tmp-ct2fast-codet5p-770m-py --force --copy_files merges.txt README.md tokenizer_config.json vocab.json special_tokens_map.json added_tokens.json .gitattributes --quantization float16

Checkpoint compatible to ctranslate2>=3.13.0 and hf-hub-ctranslate2>=2.0.6

compute_type=int8_float16 for device="cuda"
compute_type=int8 for device="cpu"

from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
from transformers import AutoTokenizer

model_name = "michaelfeil/ct2fast-codet5p-770m-py"
# use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
model = TranslatorCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name, 
        device="cuda",
        compute_type="int8_float16",
        tokenizer=AutoTokenizer.from_pretrained("Salesforce/codet5p-770m-py")
)
outputs = model.generate(
    text=["def print_hello_world():", "def hello_name(name:"],
    decode_tok_kwargs=dict(skip_special_tokens=True),
    max_decoding_length=64,
    end_token=["def"]
)
print(outputs)

Licence and other remarks:

This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

Original description

CodeT5+ 770M (further tuned on Python)

Model description

CodeT5+ is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. encoder-only, decoder-only, and encoder-decoder) to support a wide range of code understanding and generation tasks. It is introduced in the paper:

CodeT5+: Open Code Large Language Models for Code Understanding and Generation by Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution).

Compared to the original CodeT5 family (base: 220M, large: 770M), CodeT5+ is pretrained with a diverse set of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, it employs a simple yet effective compute-efficient pretraining method to initialize the model components with frozen off-the-shelf LLMs such as CodeGen to efficiently scale up the model (i.e. 2B, 6B, 16B), and adopts a "shallow encoder and deep decoder" architecture. Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B) following Code Alpaca.

How to use

This model can be easily loaded using the T5ForConditionalGeneration functionality and employs the same tokenizer as original CodeT5.

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Salesforce/codet5p-770m-py"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# ==>     print('Hello World!')

Pretraining data

This checkpoint is trained on the stricter permissive subset of the deduplicated version of the github-code dataset. The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”, “cc0-1.0”, “unlicense”, “isc”). Supported languages (9 in total) are as follows: c, c++, c-sharp, go, java, javascript, php, python, ruby.

Training procedure

This checkpoint is first trained on the multilingual unimodal code data at the first-stage pretraining, which includes a diverse set of pretraining tasks including span denoising and two variants of causal language modeling. After that, it is further trained on the Python subset with the causal language modeling objective for another epoch to better adapt for Python code generation. Please refer to the paper for more details.

Evaluation results

CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings: zero-shot, finetuning, and instruction-tuning. Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g., 8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4). In 2 math programming tasks on MathQA-Python and GSM8K-Python, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters. Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 mode Please refer to the paper for more details.

Specifically for this checkpoint, it achieves 15.5% pass@1 on HumanEval in the zero-shot setting, which is comparable to much larger LLMs such as Incoder 6B’s 15.2%, GPT-NeoX 20B’s 15.4%, and PaLM 62B’s 15.9%.

BibTeX entry and citation info

@article{wang2023codet5plus,
  title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
  author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
  journal={arXiv preprint},
  year={2023}
}