THIS MODEL HAS EVAL DATA LEAKED INTO THE DATASET
THIS IS NOT AN OFFICIAL MODEL CARD
NewHope: Harnessing 99% of GPT-4's Programming Capabilities
We introduce NewHope, a fine-tuned chat model based on llama-2-13b, aiming to provide a strong coding capability. NewHope handle different languages including Python, C++, Java, JavaScript, Go, and more. Preliminary evaluation on HumanEval shows that NewHope possesses 99% of GPT-4's programming capabilities.
Contact: SLAM (SUFE Large AI Model) is a research group at Shanghai University of Finance and Economics. [email protected]
TODO: We will release more evaluatation results and training details later.
Evaluation Results
We evaluated NewHope on HumanEval using the official evaluation script by OpenAI. We compared the Pass@1 metric of NewHope with other models. The results of other models are from PapersWithCode.
Model | Pass@1 |
---|---|
GPT-4 | 67.0 |
PanGu-Coder2 15B | 61.6 |
WizardCoder 15B | 57.3 |
phi-1 1.3B | 50.6 |
GPT-3.5 | 48.1 |
phi-1-small | 45.0 |
PaLM-Coder | 36.0 |
CodeGeeX2-6B | 35.9 |
Model Weights
We have open-sourced the model weights NewHope.
We are uploading the model weights. The weights will be available in a few hours.
Usage
To load the NewHope model using Transformers, use the following code:
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
base_model = "SLAM-group/NewHope"
tokenizer = LlamaTokenizer.from_pretrained(base_model)
model = LlamaForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
# model.config.use_cache is default to `False`. For inference: `model.config.use_cache = True`
Note: At least Huggingface Transformers 4.31.0 is required to load this model!
You can ask NewHope to generate code with instructions. We provide a simple example of how NewHope model generates code with the specific prompt:
# Suppose required tokenizer and model have already been loaded
instruction = "Write a Python function to tell me what the date is today."
prompt = f"<s> ### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=True, top_p=0.9, max_new_tokens=2048)[0]
decoded_output = tokenizer.decode(output, skip_special_tokens=True).split("### Response:\n")[-1].strip()
print(decoded_output)
You can also interact with NewHope in a dialog manner with the following prompt:
<s> ### Instruction:\nQ1\n\n### Response:\nA1</s><s> ### Instruction:\nQ2\n\n### Response:\nA2</s>
Evaluation
Local setup
Install HumanEval for evaluation. Details
Install dependencies
pip install -r requirements.txt
For HumanEval, we use the following prompt:
example_input = 'def is_odd(number: int) -> bool:\n """ Check whether the given number is odd\n >>> is_odd(3)\n True\n >>> is_odd(6)\n False\n """\n'
example_output = 'def is_odd(number: int) -> bool:\n """ Check whether the given number is odd\n >>> is_odd(3)\n True\n >>> is_odd(6)\n False\n """\n return number % 2 == 1'
task_in_humaneval = "REPLACE `task_in_humaneval` WITH THE SPECIFIC TASK IN HUMANEVAL DATA"
prompt = f"<s> ### Instruction:\nComplete the given function below:\n\n{example_input}\n\n### Response:\n{example_output}</s><s> ### Instruction:\nComplete the given function below:\n\n{task_in_human_eval}\n\n### Response:\n"
To reproduce the results on HumanEval, use the following script:
python complete.py --base_model SLAM-group/NewHope --output_dir output --n_gpu 8
The above script will generate samples.jsonl
in output_dir
, which can be directly evaluated by HumanEval. Evaluation procedure. We conducted the experiment with fp16
on 8xA800, 80GB GPUs, reaching 66.5%
on Pass@1 (v.s. GPT4 67.0%
).
Citation
@misc{2023newhope,
title={NewHope: Harnessing 99% of GPT-4's Programming Capabilities},
author={Wanyun Cui and Qianle Wang},
howpublished = https://github.com/SLAM-group/newhope,
year={2023}
}
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 51.9 |
ARC (25-shot) | 61.09 |
HellaSwag (10-shot) | 84.03 |
MMLU (5-shot) | 55.73 |
TruthfulQA (0-shot) | 44.96 |
Winogrande (5-shot) | 74.98 |
GSM8K (5-shot) | 15.85 |
DROP (3-shot) | 26.66 |
- Downloads last month
- 1,506