mtasic85's picture
benchmarks
fd25fe3
metadata
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
language:
  - en
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - li
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - ns
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - rm
  - ro
  - ru
  - sa
  - si
  - sc
  - sd
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tn
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zu
datasets:
  - yahma/alpaca-cleaned
  - gbharti/wealth-alpaca_lora
  - databricks/databricks-dolly-15k
  - VMware/open-instruct
  - saillab/taco-datasets
  - xu-song/cc100-samples
  - jordiclive/wikipedia-summary-dataset
  - bigcode/the-stack-smol-xs
  - m-a-p/CodeFeedback-Filtered-Instruction
  - jtatman/python-code-dataset-500k
  - iamtarun/python_code_instructions_18k_alpaca
  - HuggingFaceH4/CodeAlpaca_20K
  - cognitivecomputations/dolphin-coder
  - fblgit/simple-math
  - gair-prox/open-web-math-pro
  - rvv-karma/Math-QA
  - ajibawa-2023/Maths-College
  - microsoft/orca-math-word-problems-200k
  - meta-math/MetaMathQA
  - TIGER-Lab/MathInstruct
  - TIGER-Lab/WebInstructSub
  - SkunkworksAI/reasoning-0.01
  - KingNish/reasoning-base-20k
  - Magpie-Align/Magpie-Reasoning-150K
  - thesven/gsm8k-reasoning
  - AlgorithmicResearchGroup/math_reasoning_autoformalization_track
  - badrex/llm-emoji-dataset
tags:
  - litgpt
  - litdata

tangled-llama-t-32k-base-v0.1

logo

A pretrained language model based on the Llama model with about 25M parameters. This model has been trained on 22.1B (22,111,299,936) tokens from more than 3.6M (3,597,088) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

loss, val_loss

val_ppl

epoch

learning_rate

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
             Tasks                 |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|

|---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |arc_challenge | 1|none | 0|acc |↑ |0.1971|± |0.0116| | | |none | 0|acc_norm |↑ |0.2423|± |0.0125| |gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.0099|± |0.0027| | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000| |hellaswag | 1|none | 0|acc |↑ |0.2608|± |0.0044| | | |none | 0|acc_norm |↑ |0.2665|± |0.0044| |mmlu | 2|none | |acc |↑ |0.2451|± |0.0036| | - humanities | 2|none | |acc |↑ |0.2470|± |0.0063| | - formal_logic | 1|none | 0|acc |↑ |0.3254|± |0.0419| | - high_school_european_history | 1|none | 0|acc |↑ |0.2545|± |0.0340| | - high_school_us_history | 1|none | 0|acc |↑ |0.2745|± |0.0313| | - high_school_world_history | 1|none | 0|acc |↑ |0.2194|± |0.0269| | - international_law | 1|none | 0|acc |↑ |0.2231|± |0.0380| | - jurisprudence | 1|none | 0|acc |↑ |0.2685|± |0.0428| | - logical_fallacies | 1|none | 0|acc |↑ |0.2025|± |0.0316| | - moral_disputes | 1|none | 0|acc |↑ |0.2457|± |0.0232| | - moral_scenarios | 1|none | 0|acc |↑ |0.2670|± |0.0148| | - philosophy | 1|none | 0|acc |↑ |0.1865|± |0.0221| | - prehistory | 1|none | 0|acc |↑ |0.2500|± |0.0241| | - professional_law | 1|none | 0|acc |↑ |0.2523|± |0.0111| | - world_religions | 1|none | 0|acc |↑ |0.1871|± |0.0299| | - other | 2|none | |acc |↑ |0.2456|± |0.0077| | - business_ethics | 1|none | 0|acc |↑ |0.3400|± |0.0476| | - clinical_knowledge | 1|none | 0|acc |↑ |0.2113|± |0.0251| | - college_medicine | 1|none | 0|acc |↑ |0.2543|± |0.0332| | - global_facts | 1|none | 0|acc |↑ |0.1800|± |0.0386| | - human_aging | 1|none | 0|acc |↑ |0.1749|± |0.0255| | - management | 1|none | 0|acc |↑ |0.3398|± |0.0469| | - marketing | 1|none | 0|acc |↑ |0.2479|± |0.0283| | - medical_genetics | 1|none | 0|acc |↑ |0.3100|± |0.0465| | - miscellaneous | 1|none | 0|acc |↑ |0.2171|± |0.0147| | - nutrition | 1|none | 0|acc |↑ |0.2647|± |0.0253| | - professional_accounting | 1|none | 0|acc |↑ |0.2270|± |0.0250| | - professional_medicine | 1|none | 0|acc |↑ |0.2978|± |0.0278| | - virology | 1|none | 0|acc |↑ |0.3133|± |0.0361| | - social sciences | 2|none | |acc |↑ |0.2584|± |0.0079| | - econometrics | 1|none | 0|acc |↑ |0.2193|± |0.0389| | - high_school_geography | 1|none | 0|acc |↑ |0.2677|± |0.0315| | - high_school_government_and_politics| 1|none | 0|acc |↑ |0.2435|± |0.0310| | - high_school_macroeconomics | 1|none | 0|acc |↑ |0.2538|± |0.0221| | - high_school_microeconomics | 1|none | 0|acc |↑ |0.2647|± |0.0287| | - high_school_psychology | 1|none | 0|acc |↑ |0.2679|± |0.0190| | - human_sexuality | 1|none | 0|acc |↑ |0.3435|± |0.0416| | - professional_psychology | 1|none | 0|acc |↑ |0.2190|± |0.0167| | - public_relations | 1|none | 0|acc |↑ |0.2091|± |0.0390|
| - security_studies | 1|none | 0|acc |↑ |0.2980|± |0.0293| | - sociology | 1|none | 0|acc |↑ |0.2836|± |0.0319| | - us_foreign_policy | 1|none | 0|acc |↑ |0.3000|± |0.0461| | - stem | 2|none | |acc |↑ |0.2287|± |0.0075| | - abstract_algebra | 1|none | 0|acc |↑ |0.2100|± |0.0409| | - anatomy | 1|none | 0|acc |↑ |0.2000|± |0.0346| | - astronomy | 1|none | 0|acc |↑ |0.2434|± |0.0349| | - college_biology | 1|none | 0|acc |↑ |0.3333|± |0.0394|
| - college_chemistry | 1|none | 0|acc |↑ |0.3000|± |0.0461|
| - college_computer_science | 1|none | 0|acc |↑ |0.2600|± |0.0441|
| - college_mathematics | 1|none | 0|acc |↑ |0.3100|± |0.0465|
| - college_physics | 1|none | 0|acc |↑ |0.2353|± |0.0422|
| - computer_security | 1|none | 0|acc |↑ |0.2300|± |0.0423|
| - conceptual_physics | 1|none | 0|acc |↑ |0.2085|± |0.0266|
| - electrical_engineering | 1|none | 0|acc |↑ |0.2621|± |0.0366|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.2011|± |0.0206|
| - high_school_biology | 1|none | 0|acc |↑ |0.2097|± |0.0232|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.2217|± |0.0292|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.2300|± |0.0423|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.1926|± |0.0240|
| - high_school_physics | 1|none | 0|acc |↑ |0.2318|± |0.0345|
| - high_school_statistics | 1|none | 0|acc |↑ |0.1806|± |0.0262|
| - machine_learning | 1|none | 0|acc |↑ |0.2857|± |0.0429|
|truthfulqa_mc2 | 2|none | 0|acc |↑ |0.4880|± |0.0161|
|winogrande | 1|none | 0|acc |↑ |0.5185|± |0.0140|

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2451 ± 0.0036
- humanities 2 none acc 0.2470 ± 0.0063
- other 2 none acc 0.2456 ± 0.0077
- social sciences 2 none acc 0.2584 ± 0.0079
- stem 2 none acc 0.2287 ± 0.0075
litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
leaderboard N/A
- leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm 0.4600 ± 0.0316
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm 0.5027 ± 0.0367
- leaderboard_bbh_date_understanding 1 none 3 acc_norm 0.1720 ± 0.0239
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm 0.2960 ± 0.0289
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm 0.4880 ± 0.0317
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm 0.0000 ± 0
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm 0.5160 ± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm 0.2000 ± 0.0253
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm 0.1480 ± 0.0225
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm 0.3160 ± 0.0295
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm 0.2360 ± 0.0269
- leaderboard_bbh_navigate 1 none 3 acc_norm 0.4680 ± 0.0316
- leaderboard_bbh_object_counting 1 none 3 acc_norm 0.0480 ± 0.0135
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm 0.1918 ± 0.0327
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm 0.1440 ± 0.0222
- leaderboard_bbh_ruin_names 1 none 3 acc_norm 0.2360 ± 0.0269
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm 0.1360 ± 0.0217
- leaderboard_bbh_snarks 1 none 3 acc_norm 0.5225 ± 0.0375
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm 0.4560 ± 0.0316
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm 0.2960 ± 0.0289
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm 0.2120 ± 0.0259
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm 0.1840 ± 0.0246
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm 0.3160 ± 0.0295
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm 0.5200 ± 0.0317
- leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.2172 ± 0.0294
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2454 ± 0.0184
- leaderboard_gpqa_main 1 none 0 acc_norm 0.2478 ± 0.0204
- leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.1727 ± N/A
none 0 inst_level_strict_acc 0.1559 ± N/A
none 0 prompt_level_loose_acc 0.0832 ± 0.0119
none 0 prompt_level_strict_acc 0.0795 ± 0.0116
- leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_counting_and_prob_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_geometry_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_intermediate_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_num_theory_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_prealgebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_precalculus_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_mmlu_pro 0.1 none 5 acc 0.1135 ± 0.0029
- leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm 0.5240 ± 0.0316
- leaderboard_musr_object_placements 1 none 0 acc_norm 0.2734 ± 0.0279
- leaderboard_musr_team_allocation 1 none 0 acc_norm 0.3000 ± 0.0290
litgpt evaluate --tasks 'bbh_zeroshot,bbh_fewshot,bbh_cot_fewshot,bbh_cot_zeroshot' --out_dir 'evaluate-bigbenchhard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'mmlu_multilingual,mgsm' --out_dir 'evaluate-multilinguals/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.0099 ± 0.0027
strict-match 5 exact_match 0.0000 ± 0.0000
mathqa 1 none 0 acc 0.2121 ± 0.0075
none 0 acc_norm 0.2114 ± 0.0075
litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/