Salesforce
/

codet5-large

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

codet5-large / README.md

yuewang-sf's picture

Update README.md

119c23b over 2 years ago

|

history blame contribute delete

2.75 kB

	---
	license: bsd-3-clause
	---
	# CodeT5 (large-size model 770M)

	## Model description

	CodeT5 is a family of encoder-decoder language models for code from the paper: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi.

	The checkpoint included in this repository is denoted as CodeT5-large (770M), which is introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.

	## Training data

	CodeT5-large was pretrained on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data in six programming languages (Ruby/JavaScript/Go/Python/Java/PHP). See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.

	## Training procedure

	CodeT5-large was pretrained using masked span prediction objective for 150 epochs. See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.


	## Evaluation results
	We validate the effectiveness of this checkpoint pretrained with simplified strategies on [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) benchmark. See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.


	## How to use

	This model can be easily loaded using the `T5ForConditionalGeneration` functionality:

	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration
	tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large")
	model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large")
	text = "def greet(user): print(f'hello <extra_id_0>!')"
	input_ids = tokenizer(text, return_tensors="pt").input_ids

	# simply generate a single sequence
	generated_ids = model.generate(input_ids, max_length=8)
	print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
	```

	## BibTeX entry and citation info

	```bibtex
	@inproceedings{CodeT52021,
	author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
	title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
	booktitle = {EMNLP},
	pages = {8696--8708},
	publisher = {Association for Computational Linguistics},
	year = {2021}
	}

	@article{CodeRL2022
	author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
	title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
	journal = {arXiv preprint},
	volume = {abs/2207.01780},
	year = {2022}
	}
	```