Text2Text Generation
Transformers
PyTorch
t5
text-generation-inference
Inference Endpoints
codet5-large / README.md
yuewang-sf's picture
Update README.md
119c23b
---
license: bsd-3-clause
---
# CodeT5 (large-size model 770M)
## Model description
CodeT5 is a family of encoder-decoder language models for code from the paper: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi.
The checkpoint included in this repository is denoted as **CodeT5-large** (770M), which is introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
## Training data
CodeT5-large was pretrained on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data in six programming languages (Ruby/JavaScript/Go/Python/Java/PHP). See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
## Training procedure
CodeT5-large was pretrained using masked span prediction objective for 150 epochs. See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
## Evaluation results
We validate the effectiveness of this checkpoint pretrained with simplified strategies on [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) benchmark. See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
## How to use
This model can be easily loaded using the `T5ForConditionalGeneration` functionality:
```python
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large")
text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids
# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
```
## BibTeX entry and citation info
```bibtex
@inproceedings{CodeT52021,
author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
booktitle = {EMNLP},
pages = {8696--8708},
publisher = {Association for Computational Linguistics},
year = {2021}
}
@article{CodeRL2022
author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
journal = {arXiv preprint},
volume = {abs/2207.01780},
year = {2022}
}
```