KrisPi/Wizard-Coder-0.66-Redmond-Hermes-0.33-ct2fast

This model is a merge between 66% of Wizard Coder and 33% of Redmond Hermes Coder (which is Wizard Coder fine-tune):

https://huggingface.co/NousResearch/Redmond-Hermes-Coder https://huggingface.co/WizardLM/WizardCoder-15B-V1.0

Merger done by the most basic value average.

Using CTranslate2 for quantization and inference achieving as much as 37 tokens /s on RTX 3090 GPU.

Inference is done by using text-generation-webui:

Added this code and ran an update on requirements.txt: https://github.com/oobabooga/text-generation-webui/pull/2828

There is one thing extra to be changed in the code: reply = apply_extensions('output', reply) to: reply = apply_extensions('output', reply, state)

The idea was to get some of the coding abilities back that were lost in fine-tune but retain at least basic capabilities to summarize text and work with context. This experiment was also focused on using CT2 for its speed.

I believe the presented approach is the best available compromise between speed, coding accuracy, and a little of general LLM use.

Please note that CT2 8bit quant seems to have better HumanEval scores than load-in-8bit

The community now mostly focuses on making non-coding models - code as making coding models be more general seems near impossible. However, my daily use is focused on DevOps questions, summarizing content, and script development. Further development will be around intent analysis for integration with TODO lists and calendar extracting actions and notes from my voice transcription. This model doesn't seem to work well enough on those tasks so next time will attempt actual fine-tunes of Wizard Coder or just run two models at the same time. I hope to fit under 24GB VRAM which would mean I will also evaluate 4 bit quantization.

My initial testing was checking if the model finds:

Overflow: "what is mistake in following C++ code: int a = 1e9+7; int b = 1e9+9; int c = a*b; cout << c;"

Out of bounds: "what is bug in the following C++ code: int a = 100; vector <int> b(a); b[a] = 20; cout << b[a] << '\n';"

and propose using "docker update" for "how to stop docker container so it doesnt start every reboot"

I have run those prompts in the loop, with different presets and ended up picking this preset: ['temperature'] = 1.31 ['top_p'] = 0.29 ['top_k'] = 72 ['repetition_penalty'] = 1.09

Testing of the above prompts has shown that Hermes Coder CT2 was not able to answer correctly most of the time while Wizard Coder and this merge did. The merged model seems to retain the ability to use "### Input:" in the prompt and became more sensitive to non-coding instruction. (Wizard Coder almost completely disregards it)

In the bottom you can see EvalPlus benchmarks of three mentioned models - seems they all performed in a similar way with the default preset. I'm not sure if I'm not doing the benchmark right or if those quants are not working properly with default preset. As I noticed custom preset considerably improved the result.

I would greatly appreciate if anyone can confirm how good this model is with proposed preset as the result I got really positively suprised me.(seems better than any other Wizard Coder 8bit quant

CT2 int8_float16 merge, custom preset: Base {'pass@1': 0.47560975609756095} Base + Extra {'pass@1': 0.45121951219512196}

For summarization I propose following prompt:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction: Please provide a concise, summary for each topic presented in the input below. Ensure clarity, coherence, and avoid redundant information.

### Input: [CONTENT TO SUMMARIZE]

### Response:The summary for each topic presented in the input is as follows:

Optionally iterate over the output with following prompt:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction: Rewrite summary from Input. Fix typos, add missing spaces. Ensure clarity, coherence, and remove redundant information.

### Input: [OUTPUT FROM PREVIOUS PROMPT]

### Response:

HumanEval run using: https://github.com/my-other-github-account/llm-humaneval-benchmarks/ and sudo docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples results/{model_name}.jsonl

Custom preset: ['temperature'] = 1.31 ['top_p'] = 0.29 ['top_k'] = 72 ['repetition_penalty'] = 1.09

CT2 int8_float16 merge, custom preset: Base {'pass@1': 0.47560975609756095} Base + Extra {'pass@1': 0.45121951219512196} one of the worse reruns: {'pass@1': 0.4573170731707317} Base + Extra {'pass@1': 0.4146341463414634}

CT2 int8_float16 Wizard Coder: Base {'pass@1': 0.43902439024390244} Base + Extra {'pass@1': 0.3597560975609756} Retry: Base {'pass@1': 0.42073170731707316} Base + Extra {'pass@1': 0.3475609756097561}

Full-weight Wizard Coder loaded with --load-in-8bit, custom preset: Base {'pass@1': 0.3475609756097561} Base + Extra {'pass@1': 0.3170731707317073}

Default llm-humaneval-benchmarks preset: ['temperature'] = 1 ['top_p'] = 1 ['top_k'] = 0 ['repetition_penalty'] = 1

CT2 int8_float16 - this model: Base {'pass@1': 0.4634146341463415} Base + Extra {'pass@1': 0.4024390243902439}

CT2 int8_float16 Redmond Hermes Coder: Base {'pass@1': 0.4695121951219512} Base + Extra {'pass@1': 0.4146341463414634}

CT2 int8_float16 Wizard Coder: Base {'pass@1': 0.4695121951219512} Base + Extra {'pass@1': 0.3902439024390244}

Full-weight Wizard Coder loaded with --load-in-8bit, default preset: Base {'pass@1': 0.43902439024390244} Base + Extra {'pass@1': 0.3719512195121951}

Full-weight merged model loaded with --load-in-8bit, default preset: Base {'pass@1': 0.43902439024390244} Base + Extra {'pass@1': 0.3902439024390244}

Full-weight Hermes Coder model loaded with --load-in-8bit, default preset: Base {'pass@1': 0.4451219512195122} Base + Extra {'pass@1': 0.4146341463414634}