Beyonder-4x7B-v2 / README.md

Adding Evaluation Results (#10)

69b1b4a verified 8 months ago

12.3 kB

	---
	license: other
	tags:
	- moe
	- merge
	- mergekit
	- Mistral
	- openchat/openchat-3.5-1210
	- beowolx/CodeNinja-1.0-OpenChat-7B
	- maywell/PiVoT-0.1-Starling-LM-RP
	- WizardLM/WizardMath-7B-V1.1
	license_name: microsoft-research-license
	license_link: https://huggingface.co/WizardLM/WizardMath-7B-V1.1/resolve/main/LICENSE
	model-index:
	- name: Beyonder-4x7B-v2
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 68.77
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 86.8
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 65.1
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 60.68
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 80.9
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 71.72
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
	name: Open LLM Leaderboard
	---

	![](https://i.imgur.com/vq1QHEA.jpg)

	# Beyonder-4x7B-v2

	This model is a Mixture of Experts (MoE) made with [mergekit](https://github.com/cg123/mergekit) (mixtral branch). It uses the following base models:
	* [openchat/openchat-3.5-1210](https://huggingface.co/openchat/openchat-3.5-1210)
	* [beowolx/CodeNinja-1.0-OpenChat-7B](https://huggingface.co/beowolx/CodeNinja-1.0-OpenChat-7B)
	* [maywell/PiVoT-0.1-Starling-LM-RP](https://huggingface.co/maywell/PiVoT-0.1-Starling-LM-RP)
	* [WizardLM/WizardMath-7B-V1.1](https://huggingface.co/WizardLM/WizardMath-7B-V1.1)

	The recommended context length is 8k.

	## ⚡ Quantized models

	Thanks to TheBloke for the quantized models:

	* GGUF: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-GGUF
	* AWQ: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-AWQ
	* GPTQ: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-GPTQ
	* EXL2: https://huggingface.co/bartowski/Beyonder-4x7B-v2-exl2

	## 🏆 Evaluation

	Beyonder-4x7B-v2 is competitive with Mixtral-8x7B-Instruct-v0.1 on the Open LLM Leaderboard, while only having 4 experts instead of 8.

	![](https://i.imgur.com/5raBff0.png)

	It also displays a significant improvement over the individual experts.

	![](https://i.imgur.com/7Idwkb0.png)

	It also performs very well compared to other models on Nous benchmark suite. It's almost as good as the best Yi-34B fine-tune, which is a much bigger model: 24.2B parameters + only two experts are selected during inference (so ~12B) vs. 34B param.

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|--------------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[Beyonder-4x7B-v2](https://huggingface.co/shadowml/Beyonder-4x7B-v2)\| 45.29\| 75.95\| <u>60.86</u>\| 46.4\| 57.13\|
	\|[NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B)\| 43.67\| 73.24\| 55.37\| 41.76\| 53.51\|
	\|[OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)\| 42.75\| 72.99\| 52.99\| 40.94\| 52.42\|
	\|[Nous-Hermes-2-SOLAR-10.7B](https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B)\| 47.79\| 74.69\| 55.92\| 44.84\| 55.81\|
	\|[Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B)\| <u>50.27</u>\| <u>76.00</u>\| 60.34\| <u>46.69</u>\| <u>58.33</u>\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|23.62\|± \| 2.67\|
	\| \| \|acc_norm\|23.62\|± \| 2.67\|
	\|agieval_logiqa_en \| 0\|acc \|41.47\|± \| 1.93\|
	\| \| \|acc_norm\|43.01\|± \| 1.94\|
	\|agieval_lsat_ar \| 0\|acc \|23.04\|± \| 2.78\|
	\| \| \|acc_norm\|23.48\|± \| 2.80\|
	\|agieval_lsat_lr \| 0\|acc \|51.57\|± \| 2.22\|
	\| \| \|acc_norm\|52.94\|± \| 2.21\|
	\|agieval_lsat_rc \| 0\|acc \|64.31\|± \| 2.93\|
	\| \| \|acc_norm\|64.68\|± \| 2.92\|
	\|agieval_sat_en \| 0\|acc \|79.13\|± \| 2.84\|
	\| \| \|acc_norm\|79.13\|± \| 2.84\|
	\|agieval_sat_en_without_passage\| 0\|acc \|43.20\|± \| 3.46\|
	\| \| \|acc_norm\|43.20\|± \| 3.46\|
	\|agieval_sat_math \| 0\|acc \|34.55\|± \| 3.21\|
	\| \| \|acc_norm\|32.27\|± \| 3.16\|

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|61.86\|± \| 1.42\|
	\| \| \|acc_norm\|64.51\|± \| 1.40\|
	\|arc_easy \| 0\|acc \|85.06\|± \| 0.73\|
	\| \| \|acc_norm\|82.45\|± \| 0.78\|
	\|boolq \| 1\|acc \|88.35\|± \| 0.56\|
	\|hellaswag \| 0\|acc \|68.04\|± \| 0.47\|
	\| \| \|acc_norm\|85.12\|± \| 0.36\|
	\|openbookqa \| 0\|acc \|37.80\|± \| 2.17\|
	\| \| \|acc_norm\|48.60\|± \| 2.24\|
	\|piqa \| 0\|acc \|83.08\|± \| 0.87\|
	\| \| \|acc_norm\|83.95\|± \| 0.86\|
	\|winogrande \| 0\|acc \|78.69\|± \| 1.15\|

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|44.55\|± \| 1.74\|
	\| \| \|mc2 \|60.86\|± \| 1.57\|

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|58.95\|± \| 3.58\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|66.40\|± \| 2.46\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|48.84\|± \| 3.12\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|22.56\|± \| 2.21\|
	\| \| \|exact_str_match \|13.37\|± \| 1.80\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|30.40\|± \| 2.06\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|20.57\|± \| 1.53\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|52.00\|± \| 2.89\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|44.40\|± \| 2.22\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|52.10\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|69.75\|± \| 1.03\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|55.36\|± \| 2.35\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|23.65\|± \| 1.35\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|77.35\|± \| 3.12\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|73.02\|± \| 1.41\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|46.80\|± \| 1.58\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|22.08\|± \| 1.17\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|19.03\|± \| 0.94\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|52.00\|± \| 2.89\|

	## 🧩 Configuration

	```yaml
	base_model: mlabonne/Marcoro14-7B-slerp
	experts:
	- source_model: openchat/openchat-3.5-1210
	positive_prompts:
	- "chat"
	- "assistant"
	- "tell me"
	- "explain"
	- source_model: beowolx/CodeNinja-1.0-OpenChat-7B
	positive_prompts:
	- "code"
	- "python"
	- "javascript"
	- "programming"
	- "algorithm"
	- source_model: maywell/PiVoT-0.1-Starling-LM-RP
	positive_prompts:
	- "storywriting"
	- "write"
	- "scene"
	- "story"
	- "character"
	- source_model: WizardLM/WizardMath-7B-V1.1
	positive_prompts:
	- "reason"
	- "math"
	- "mathematics"
	- "solve"
	- "count"
	```

	## 💻 Usage

	Here's a [notebook](https://colab.research.google.com/drive/1ypy8fEAJe9RkNmNQR1BduOzy2Qn6CnMl#scrollTo=myLRfwjZcIyP) to run this model in 4-bit precision using a free T4 GPU on Google Colab.

	```python
	!pip install -qU transformers bitsandbytes accelerate

	from transformers import AutoTokenizer
	import transformers
	import torch

	model = "mlabonne/Beyonder-4x7B-v2"

	tokenizer = AutoTokenizer.from_pretrained(model)
	pipeline = transformers.pipeline(
	"text-generation",
	model=model,
	model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
	)

	messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
	prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
	print(outputs[0]["generated_text"])
	```

	Output:

	> A Mixture of Experts (ME) is a machine learning technique that combines multiple expert models to make predictions or decisions. Each expert model is specialized in a different aspect of the problem, and their outputs are combined to produce a more accurate and robust solution. This approach allows the model to leverage the strengths of individual experts and compensate for their weaknesses, improving overall performance.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_mlabonne__Beyonder-4x7B-v2)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|72.33\|
	\|AI2 Reasoning Challenge (25-Shot)\|68.77\|
	\|HellaSwag (10-Shot) \|86.80\|
	\|MMLU (5-Shot) \|65.10\|
	\|TruthfulQA (0-shot) \|60.68\|
	\|Winogrande (5-shot) \|80.90\|
	\|GSM8k (5-shot) \|71.72\|