loubnabnl's picture
loubnabnl HF staff
Update README.md
7753edb
metadata
language:
  - code
license: apache-2.0
tags:
  - code
  - gpt2
  - generation
datasets:
  - codeparrot/github-code-clean
  - openai_humaneval
metrics:
  - evaluate-metric/code_eval

CodeParrot-Multi 🦜 (small)

CodeParrot-Multi 🦜 is a GPT-2 model (110M parameters) trained to generate code in 9 programming languages: "Java", "JavaScript", "PHP", "Python", "C#", "C++", "GO", "Ruby" and "TypeScript".

Usage

You can load the CodeParrot-Multi model and tokenizer directly in transformers:

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

or with a pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi")
outputs = pipe("def hello_world():")

Training

The model was trained on the small Github code small after near deduplication, a subset of Github code dataset with the following settings:

Config Value
Batch size 192
Context size 1024
Training steps 300'000
Gradient accumulation 2
Gradient checkpointing False
Learning rate 5e-4
Weight decay 0.1
Warmup steps 2000
Schedule Cosine

The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 58 billion tokens.

Performance

We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:

Metric Value
pass@1 --%
pass@10 --%
pass@100 --%

The pass@k metric tells the probability that at least one out of k generations passes the tests.

Resources