metadata

language:
  - code
  - en
license: apache-2.0
tags:
  - code
  - gpt2
  - generation
datasets:
  - codeparrot/codeparrot-clean
  - openai_humaneval
  - semeru/code-text-python
  - semeru/galeras-causal4se-3k-levenshtein
metrics:
  - evaluate-metric/code_eval

Compatibilized CodeParrot 🦜 (small)

This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code.

The compatibilization is based on the sequential-rationales process formulated by Vafa et.al.

Usage

You can load the CodeParrot model and tokenizer directly in transformers and use Galeras dataset for sampling the model:

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")

df_sampled_code['size'] =  df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']

Training

The model was trained on the cleaned CodeParrot 🦜 dataset with the following settings:

Config	Value
Batch size	192
Context size	1024
Training steps	150'000
Gradient accumulation	1
Gradient checkpointing	False
Learning rate	5e-4
Weight decay	0.1
Warmup steps	2000
Schedule	Cosine

The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.

Performance

We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:

Metric	Value
pass@1	3.80%
pass@10	6.57%
pass@100	12.78%

The pass@k metric tells the probability that at least one out of k generations passes the tests.

Resources

Dataset: full, train, valid
Code: repository
Spaces: generation, highlighting