Llama-Salad-4x8B-V3 / README.md
HiroseKoichi's picture
Adding Evaluation Results (#1)
025a422 verified
metadata
license: llama3
library_name: transformers
tags:
  - nsfw
  - not-for-all-audiences
  - llama-3
  - text-generation-inference
  - moe
  - mergekit
  - merge
model-index:
  - name: Llama-Salad-4x8B-V3
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 66.54
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 31.93
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 8.53
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 7.05
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 6.45
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 27.98
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=HiroseKoichi/Llama-Salad-4x8B-V3
          name: Open LLM Leaderboard

Llama-Salad-4x8B-V3

Changes in V3:

  • Uses L3-8B-Stheno-v3.2 as the base model instead of Meta-Llama-3-8B-Instruct
  • Removed opus-v1.2-llama-3-8b-instruct-run3.5-epoch2.5 and added Einstein-v6.1-Llama3-8B
  • Swapped Llama-3-Soliloquy-8B-v2 for L3-8B-Stheno-v3.2

I was clearly wrong when I said V2 would be difficult to improve on, because V3 is significantly better in just about every aspect. Stheno-v3.2 fixed all of the issues present in Stheno-v3.1, making it my favorite roleplay model and the best base model for llama-3 MoE merges.

The one thing I do want to improve on is finding a better conversational model than Meta-Llama-3-8B-Instruct; it's good for that use case, but I'm sure there's a better one out there. I tried using llama-3-cat-8b-instruct-v1, but it absolutely tanked the model's situational awareness and kept making blatantly contradictory statements.

Quantization Formats

GGUF

Details

Models Used

Merge Config

base_model: Sao10K/L3-8B-Stheno-v3.2
gate_mode: hidden
dtype: bfloat16
experts_per_token: 2
experts:
  - source_model: NousResearch/Meta-Llama-3-8B-Instruct
    positive_prompts:
    - "chat"
    - "conversation"
  - source_model: Weyaxi/Einstein-v6.1-Llama3-8B
    positive_prompts:
    - "science"
    - "physics"
    - "chemistry"
    - "biology"
    - "math"
    - "step-by-step"
    - "logical reasoning"
    - "multilingual"
    - "translation"
    - "language translation"
    - "foreign language"
    negative_prompts:
    - "programming language"
  - source_model: migtissera/Llama-3-8B-Synthia-v3.5
    positive_prompts:
    - "summarize"
    - "paraphrase"
    - "list"
    - "explain"
    - "define"
    - "analyze"
    - "rephrase"
    - "elaborate"
    - "programming language"
    - "JavaScript"
    - "Python programming language"
    - "Rust programming language"
    - "C++ programming language"
    - "GO programming language"
    - "Ruby programming language"
    - "Haskell programming language"
    - "SQL query language"
    - "CSS markup styling language"
    - "code"
  - source_model: Sao10K/L3-8B-Stheno-v3.2
    positive_prompts:
    - "characters"
    - "scene"
    - "roleplay"
    - "erotic roleplay"
    - "sexual fetish"
    - "NSFW"
    - "creative writing"
    - "storytelling"
    - "narration"
    - "narrative setting"
    - "narrative plot"
    - "narrative exposition"
    - "narrative theme"
    - "narrative climax"

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 24.75
IFEval (0-Shot) 66.54
BBH (3-Shot) 31.93
MATH Lvl 5 (4-Shot) 8.53
GPQA (0-shot) 7.05
MuSR (0-shot) 6.45
MMLU-PRO (5-shot) 27.98