File size: 2,611 Bytes
c636409
05f8aa2
36dca4c
 
 
 
 
05f8aa2
36dca4c
 
 
05f8aa2
36dca4c
 
 
 
 
 
05f8aa2
36dca4c
 
 
 
 
 
 
 
05f8aa2
c636409
 
36dca4c
c636409
22290ac
c636409
36dca4c
c636409
 
 
36dca4c
 
 
 
 
 
c636409
36dca4c
c636409
36dca4c
c636409
36dca4c
c636409
36dca4c
c636409
36dca4c
 
 
c636409
 
 
36dca4c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
language:
- en
license: apache-2.0
tags:
- open-source
- code
- math
- chemistry
- biology
- text-generation
- question-answering
datasets:
- Open-Orca/SlimOrca
- glaiveai/glaive-code-assistant
- camel-ai/physics
- camel-ai/math
- camel-ai/chemistry
- camel-ai/biology
- WizardLM/WizardLM_evol_instruct_V2_196k
- microsoft/orca-math-word-problems-200k
- grimulkan/theory-of-mind
- Vezora/Tested-22k-Python-Alpaca
- m-a-p/Code-Feedback
- Locutusque/arc-cot
- jondurbin/airoboros-2.1
- WizardLM/WizardLM_evol_instruct_70k
pipeline_tag: text-generation
---

# OpenCerebrum-1.0-7B-SFT

OpenCerebrum-1.0-7B-SFT is an open-source language model fine-tuned from the alpindale/Mistral-7B-v0.2-hf base model on a diverse dataset aimed at replicating capabilities of AetherResearch's proprietary Cerebrum model. 

The model was fine-tuned on approximately 1.2 million examples across 14 datasets spanning coding, math, science, reasoning, and general instruction-following. The goal was to assemble public datasets that could help the model achieve strong performance on benchmarks where Cerebrum excels.

## Model Details

- **Base Model:** alpindale/Mistral-7B-v0.2-hf
- **Parameters:** 7 billion 
- **Fine-Tuning Dataset Size:** ~1,200,000 examples
- **Fine-Tuning Data:** Amalgamation of 14 public datasets
- **Language:** English
- **License:** Apache 2.0

## Intended Use

OpenCerebrum-1.0-7B-SFT is intended to be a powerful open-source model for coding, math, science, and general question-answering and text generation tasks. Its diverse fine-tuning data aims to equip it with broad knowledge and reasoning capabilities.

However, as an open-source replica trained on a subset of data compared to the original Cerebrum, it may not match Cerebrum's full performance. Additionally, biases and limitations of the fine-tuning data may be reflected in the model's outputs.

## Limitations and Biases

- The model may have biases and limitations inherited from its fine-tuning datasets. Thorough testing is needed to characterize these.
- With 1.2 million training examples, the fine-tuning data is still limited compared to the proprietary Cerebrum data.
- As the model is based on a 7B parameter model, it has computational and memory constraints compared to larger models.

## Training Details

The model was fine-tuned on the 14 datasets listed in the Datasets section, totaling approximately 1.2 million examples. Default training hyperparameters were used. In the future, the fine-tuning dataset may be condensed to more closely match the 5,000 example dataset reputedly used for the original Cerebrum model.