Base Model or Finetuned Version?

by jphme - opened Apr 11

Apr 11

It´s not really clear from your description whether this is the extracted base model or whether you already did finetuning on top of it?
If the latter, which data + prompt format did you use?

I´d be interested in just the extracted base model without any additional finetuning.

Vezora

Owner Apr 11

•

edited Apr 11

@jphme
Ur good! It’s a fine tune, I will release the base model, along with the v.2 lora so anyone who would like to fine tune it with lora from either my check point or from scratch can. My wifi bandwidth can only go so far. And I havent slept since mixtral 22b dropped 😅 also the safetensors files are almost done uploading, I would say like 15 ish minutes.

flozi00

Apr 11

Nice Catch @jphme
I'd love to play around with the base weights too 😉
@Vezora are you doing the computations locally that you are limited by hotel wlan ?
If you need an dedicated vserver with higher bandwidth and GPU we can talk about some sponsoring

Vezora

Owner Apr 11

•

edited Apr 11

@flozi00 yeah, all computation was done locally my rooms a bit toasty right now 😂. Here if my twitter I just followed you (from ur HF profile) that way we can DM https://twitter.com/mejia_petit . I would love to talk more about this! (Preferably tomorrow haven't slept since mixtral 22b dropped)

Winmodel

Apr 11

•

edited Apr 11

please drop quantize version

Vezora

Owner Apr 11

@Winmodel I’d love to but I’m currently training, you can easily quantize it with BNB using load_in_4bit and save prettrained dir.

thomasgauthier

Apr 11

@jphme @flozi00 I have a untrained versions of each raw extracted experts as dense Mistral 22B models.

https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-0

The other experts are there also on my profile

There is also this one which is a linear merge of all experts into one model : https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-lerp

They all output gibberish for the most part, expert 2 seems to be the most coherent from my limited tests. Expert 0 has the lowest perplexity on wikitext but I wasn't able to generate coherent text with it.

I'll be sharing code and evals in the next hours

algorithm

Apr 11

Would love to know the prompt format as well.
Thank you.

Suparious

Apr 12

•

edited Apr 12

@Winmodel Working to get an AWQ quant of this, debugging a few errors.

Vezora

Owner Apr 12

@jphme @flozi00 I have a untrained versions of each raw extracted experts as dense Mistral 22B models.

https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-0

The other experts are there also on my profile

There is also this one which is a linear merge of all experts into one model : https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-lerp

They all output gibberish for the most part, expert 2 seems to be the most coherent from my limited tests. Expert 0 has the lowest perplexity on wikitext but I wasn't able to generate coherent text with it.

I'll be sharing code and evals in the next hours

Those were some of my findings as well. I also found expert 2 to be the only once to write consistent english words. Provided it was completely unrelated to what I said asked, it atleast was an attempt every other expert truly did become an expert in langauge. As shown here “https://huggingface.co/blog/moe#what-does-an-expert-learn”. Some would have finicy spacing and symboks, and somewhere just mangled nonsense.

Vezora

Owner Apr 12

•

edited Apr 12

Would love to know the prompt format as well.
Thank you.

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

algorithm

Apr 12

•

edited Apr 12

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

Great! I look forward to V2 :)

Vezora

Owner Apr 12

Alpaca! V2 is almost done, and its also alpaca, but in multi turn raw format. (So same thing for you just more work prepping dataset for me)

Great! I look forward to V2 :)

Thank you!! V2 is essentially the test to see if using all experts equally is the best thing to do, or just using a single one, by increase the data size by 8x i will easily he able to verify the knowledge of the model. There are still other methods I have yet to try, so I’m not done I’m gonna keep going till I get a 22b that out preforms mistral 7b as expected, by a 22b model.

qwp4w3hyb

Apr 12

•

edited Apr 12

@Winmodel

please drop quantize version

I am uploading some GGUF quants(with importance matrix) here: https://huggingface.co/qwp4w3hyb/Mistral-22B-v0.1-iMat-GGUF

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment