Wait? 4x13b model?

#1
by mirek190 - opened

Yeah, transformer suppots defining your own MoE fine-tuning with MoE.

A MoE is implemented by nn.linear softmax then choose the top # models for sampling. But I am still curious how it is done

Edit:
Here is the disscussion : https://huggingface.co/Undi95/Llamix2-MLewd-4x13B/discussions/1?not-for-all-audiences=true

This is probably a dumb question but i won't know the answer till i ask and research hasn't quite made it clear to me. How do i determine the max context size i can use with this? I see it limited to 2048 in the default sillytavern setup, but i've seen it mentioned that you can turn it up higher in some cases. If I'm asking in the wrong place i apologize and then ask, where's the right place?

@smartdavik the max is 4096 for this model as its a l2 based model which is 4096 ctx.

Sign up or log in to comment