leaderboard-pr-bot's picture
Adding Evaluation Results
d461c63
|
raw
history blame
4.35 kB
metadata
license: cc-by-nc-4.0
datasets:
  - jondurbin/airoboros-gpt4-1.2

Overview

This is a qlora fine-tuned 65b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros

This is mostly an extension of 1.1, but with a 65b model and thousands of new training data and an update to allow "PLAINFORMAT" at the end of coding prompts to just print the code without backticks or explanations/usage/etc.

The dataset used to fine-tune this model is available here, with a specific focus on:

  • coding
  • math/reasoning (using orca style ELI5 instruction/response pairs)
  • trivia
  • role playing
  • multiple choice and fill-in-the-blank
  • context-obedient question answering
  • theory of mind
  • misc/general

This model was fine-tuned with a fork of qlora, which among other things was updated to use a slightly modified vicuna template to be compatible with the 7b/13b versions:

A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. USER: [prompt] ASSISTANT: 

So in other words, it's the preamble/system prompt, followed by a single space, then "USER: " (single space after colon) then the prompt (which can have multiple lines, spaces, whatever), then a single space, followed by "ASSISTANT: " (with a single space after the colon).

Usage

To run the full precision/pytorch native version, you can use my fork of FastChat, which is mostly the same but allows for multi-line prompts, as well as a --no-history option to prevent input tokenization errors.

pip install git+https://github.com/jondurbin/FastChat

Be sure you are pulling the latest branch!

Then, you can invoke it like so (after downloading the model):

python -m fastchat.serve.cli \
  --model-path airoboros-65b-gpt4-1.2 \
  --temperature 0.5 \
  --max-new-tokens 2048 \
  --no-history

Alternatively, please check out TheBloke's quantized versions:

Coding updates from gpt4/1.1:

I added a few hundred instruction/response pairs to the training data with "PLAINFORMAT" as a single, all caps term at the end of the normal instructions, which produce plain text output instead of markdown/backtick code formatting.

It's not guaranteed to work all the time, but mostly it does seem to work as expected.

So for example, instead of:

Implement the Snake game in python.

You would use:

Implement the Snake game in python.  PLAINFORMAT

Other updates from gpt4/1.1:

  • Several hundred role-playing data.
  • A few thousand ORCA style reasoning/math questions with ELI5 prompts to generate the responses (should not be needed in your prompts to this model however, just ask the question).
  • Many more coding examples in various languages, including some that use specific libraries (pandas, numpy, tensorflow, etc.)

Usage and License Notices

All airoboros models and datasets are intended and licensed for research use only. I've used the 'cc-nc-4.0' license, but really it is subject to a custom/special license because:

  • the base model is LLaMa, which has it's own special research license
  • the dataset(s) were generated with OpenAI (gpt-4 and/or gpt-3.5-turbo), which has a clausing saying the data can't be used to create models to compete with openai

So, to reiterate: this model (and datasets) cannot be used commercially.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 56.96
ARC (25-shot) 65.87
HellaSwag (10-shot) 86.08
MMLU (5-shot) 63.37
TruthfulQA (0-shot) 52.72
Winogrande (5-shot) 79.56
GSM8K (5-shot) 26.54
DROP (3-shot) 24.56