Can you create a version with a bigger dataset with more coding languages and capabilities? I can provide the data

by rombodawg - opened Aug 1, 2023

Aug 1, 2023

As the title says I have created a dataset that would be perfect to train the V2 of your model.
I would have made the model myself but i simply dont have the recourses to create it sadly.
Here is the dataset i am talking about bellow, feel free to use it.
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k

juyongjiang

DeepSE org Aug 2, 2023

Awesome! Thanks for your kindness.
Okay, let me retrain a better version with your dataset as soon as possible.
When it is available, I will let you know. : )

rombodawg

Aug 9, 2023

Any progress? I have a new dataset now too:

https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored

juyongjiang

DeepSE org Aug 12, 2023

Any progress? I have a new dataset now too:

https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored

Cool! Almost done! 1638/1840, I guess deepse/CodeUp-Llama-2-13b-chat-hf equipped with 190K code instruction data will be available by the day after tomorrow. : )

juyongjiang

DeepSE org Aug 12, 2023

•

edited Aug 12, 2023

@rombodawg How can I contact you privately? My email is [email protected] . I have a plan for the data and want to discuss it with you more. : )

rombodawg

Aug 12, 2023

Message me on discord
My discord name is rombodawg

JuLuComputing

Aug 12, 2023

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

rombodawg

Aug 13, 2023

•

edited Aug 13, 2023

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

Is this a question for me or juyongjiang? Because i dont have powerful pc hardware for training. I made my datasets with notepad++

JuLuComputing

Aug 13, 2023

Is this a question for me or juyongjiang? Because i dont have powerful pc hardware for training. I made my datasets with notepad++

Ya, aiming for juyongjiang. Sorry I didn't call it.

Notepad++ has some nice column editing features that have come in handy in the past. Unfortunately, some of my datasets have outgrown Notepad++'s ability to edit them, I have to edit the big datasets with scripts. Speaking of Scripts, how have you been uncensoring your datasets? Are you using scripts like the users ehartford, ewof, or anon8231489123 use on their datasets?

rombodawg

Aug 13, 2023

Ive been creating my own custom scripts using chat gpt borrowing some code from sources like you mentioned.

juyongjiang

DeepSE org Aug 13, 2023

•

edited Aug 13, 2023

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

@JuLuComputing

Awesome!!

"There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets."

I am very interested in it. How can I contact you privately? Let us discuss more details. : )

juyongjiang

DeepSE org Aug 13, 2023

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

JuLuComputing

Aug 13, 2023

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

Fantastic! I'm interested in trying it out.

Also, I just now sent you an email about collaboration. :-)

juyongjiang

DeepSE org Aug 14, 2023

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

Fantastic! I'm interested in trying it out.

Also, I just now sent you an email about collaboration. :-)

@JuLuComputing Great! I have released the CodeUp-alpha-13b-hf. Please have a try, https://huggingface.co/deepse/CodeUp-alpha-13b-hf. : )

rombodawg

Aug 14, 2023

•

edited Aug 14, 2023

@juyongjiang
Awesome, can you link my dataset in your readme right after the license like this:

datasets:

rombodawg/Legacy_MegaCodeTraining200k

Im assuming you used that version if not you can link the updated uncensored version like this:

datasets:

rombodawg/2XUNCENSORED_MegaCodeTraining188k

juyongjiang

DeepSE org Aug 14, 2023

@rombodawg Cool! No problem. Done!

rombodawg

Aug 16, 2023

@juyongjiang Ive made a much bigger and better coding dataset if you are interested in making a version 2
https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_2.2m_Evol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment