@Locutusque on Hugging Face: "Exciting news! 🎉 I've created the OpenCerebrum datasets, open-source…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Locutusque

posted an update Mar 27

Post

2637

Exciting news! 🎉 I've created the OpenCerebrum datasets, open-source alternatives to Aether Research's proprietary Cerebrum dataset.

The first, OpenCerebrum SFT, is a text-generation and question-answering dataset with ~1.2M examples, curated from sources like Open-Orca, glaiveai, camel-ai, and more! 📚

The second, OpenCerebrum DPO, is a smaller dataset with ~21k examples, focusing on data point optimization. It's curated from sources like jondurbin, argilla, grimulkan, and others. 📊

Both datasets are licensed under Apache-2.0 and are available in English. They're ready for use in your projects, and I welcome any feedback for future improvements! 🚀

Locutusque/OpenCerebrum-dpo
Locutusque/OpenCerebrum-SFT
Locutusque/OpenCerebrum-1.0-7b-SFT
Locutusque/OpenCerebrum-1.0-7b-DPO

Fizzarolli

Mar 27

feels a bit disingenous to try and claim that it's an "Open Cerebrum" to me? the entire point of cerebrum's work, from my perspective, is their dataset in the first place w/ its relatively small size, targeted concepts, and (presumably) human-written-ness (or at least it's what they imply). a collection of synthetic data from random datasets, even with care taken to filter things around, doesn't reaaaally feel very close to me?

regardless, nice work! even if it's not an exact replication in my book it could always be useful for something

Locutusque

Mar 27

Your right. I did mention this in the dataset card that it does not match the size of the Cerebrum dataset, and is something I'm going to try to achieve in the future, and this is used as a way to sort of test how I would go about structuring such a dataset. For now I'm trying to achieve the same performance, then I'll work towards structuring it similarly to the Cerebrum dataset. Thank you for holding me accountable about this.

osanseviero

Mar 27

This is very cool! @dvilasuero check this out!

lewtun

Mar 27

Super cool release, thank you for sharing these datasets with the community! I'm not familiar with Aether Research or their Cerebrum dataset - is this something that has been used to train other open LLMs?

Locutusque

Mar 27

https://huggingface.co/AetherResearch/Cerebrum-1.0-7b. As I had mentioned earlier, although it's a bit different from the proprietary dataset created by Aether Research, this is used as a foundation to hopefully achieve that in the future.

In this post