tangled-llama-t-32k-base-v0.1

A pretrained language model based on the Llama model with about 25M parameters. This model has been trained on 22.1B (22,111,299,936) tokens from more than 3.6M (3,597,088) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

             Tasks                 |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|

|---------------------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |arc_challenge | 1|none | 0|acc |↑ |0.1971|± |0.0116| | | |none | 0|acc_norm |↑ |0.2423|± |0.0125| |gsm8k | 3|flexible-extract| 5|exact_match|↑ |0.0099|± |0.0027| | | |strict-match | 5|exact_match|↑ |0.0000|± |0.0000| |hellaswag | 1|none | 0|acc |↑ |0.2608|± |0.0044| | | |none | 0|acc_norm |↑ |0.2665|± |0.0044| |mmlu | 2|none | |acc |↑ |0.2451|± |0.0036| | - humanities | 2|none | |acc |↑ |0.2470|± |0.0063| | - formal_logic | 1|none | 0|acc |↑ |0.3254|± |0.0419| | - high_school_european_history | 1|none | 0|acc |↑ |0.2545|± |0.0340| | - high_school_us_history | 1|none | 0|acc |↑ |0.2745|± |0.0313| | - high_school_world_history | 1|none | 0|acc |↑ |0.2194|± |0.0269| | - international_law | 1|none | 0|acc |↑ |0.2231|± |0.0380| | - jurisprudence | 1|none | 0|acc |↑ |0.2685|± |0.0428| | - logical_fallacies | 1|none | 0|acc |↑ |0.2025|± |0.0316| | - moral_disputes | 1|none | 0|acc |↑ |0.2457|± |0.0232| | - moral_scenarios | 1|none | 0|acc |↑ |0.2670|± |0.0148| | - philosophy | 1|none | 0|acc |↑ |0.1865|± |0.0221| | - prehistory | 1|none | 0|acc |↑ |0.2500|± |0.0241| | - professional_law | 1|none | 0|acc |↑ |0.2523|± |0.0111| | - world_religions | 1|none | 0|acc |↑ |0.1871|± |0.0299| | - other | 2|none | |acc |↑ |0.2456|± |0.0077| | - business_ethics | 1|none | 0|acc |↑ |0.3400|± |0.0476| | - clinical_knowledge | 1|none | 0|acc |↑ |0.2113|± |0.0251| | - college_medicine | 1|none | 0|acc |↑ |0.2543|± |0.0332| | - global_facts | 1|none | 0|acc |↑ |0.1800|± |0.0386| | - human_aging | 1|none | 0|acc |↑ |0.1749|± |0.0255| | - management | 1|none | 0|acc |↑ |0.3398|± |0.0469| | - marketing | 1|none | 0|acc |↑ |0.2479|± |0.0283| | - medical_genetics | 1|none | 0|acc |↑ |0.3100|± |0.0465| | - miscellaneous | 1|none | 0|acc |↑ |0.2171|± |0.0147| | - nutrition | 1|none | 0|acc |↑ |0.2647|± |0.0253| | - professional_accounting | 1|none | 0|acc |↑ |0.2270|± |0.0250| | - professional_medicine | 1|none | 0|acc |↑ |0.2978|± |0.0278| | - virology | 1|none | 0|acc |↑ |0.3133|± |0.0361| | - social sciences | 2|none | |acc |↑ |0.2584|± |0.0079| | - econometrics | 1|none | 0|acc |↑ |0.2193|± |0.0389| | - high_school_geography | 1|none | 0|acc |↑ |0.2677|± |0.0315| | - high_school_government_and_politics| 1|none | 0|acc |↑ |0.2435|± |0.0310| | - high_school_macroeconomics | 1|none | 0|acc |↑ |0.2538|± |0.0221| | - high_school_microeconomics | 1|none | 0|acc |↑ |0.2647|± |0.0287| | - high_school_psychology | 1|none | 0|acc |↑ |0.2679|± |0.0190| | - human_sexuality | 1|none | 0|acc |↑ |0.3435|± |0.0416| | - professional_psychology | 1|none | 0|acc |↑ |0.2190|± |0.0167| | - public_relations | 1|none | 0|acc |↑ |0.2091|± |0.0390|
| - security_studies | 1|none | 0|acc |↑ |0.2980|± |0.0293| | - sociology | 1|none | 0|acc |↑ |0.2836|± |0.0319| | - us_foreign_policy | 1|none | 0|acc |↑ |0.3000|± |0.0461| | - stem | 2|none | |acc |↑ |0.2287|± |0.0075| | - abstract_algebra | 1|none | 0|acc |↑ |0.2100|± |0.0409| | - anatomy | 1|none | 0|acc |↑ |0.2000|± |0.0346| | - astronomy | 1|none | 0|acc |↑ |0.2434|± |0.0349| | - college_biology | 1|none | 0|acc |↑ |0.3333|± |0.0394|
| - college_chemistry | 1|none | 0|acc |↑ |0.3000|± |0.0461|
| - college_computer_science | 1|none | 0|acc |↑ |0.2600|± |0.0441|
| - college_mathematics | 1|none | 0|acc |↑ |0.3100|± |0.0465|
| - college_physics | 1|none | 0|acc |↑ |0.2353|± |0.0422|
| - computer_security | 1|none | 0|acc |↑ |0.2300|± |0.0423|
| - conceptual_physics | 1|none | 0|acc |↑ |0.2085|± |0.0266|
| - electrical_engineering | 1|none | 0|acc |↑ |0.2621|± |0.0366|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.2011|± |0.0206|
| - high_school_biology | 1|none | 0|acc |↑ |0.2097|± |0.0232|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.2217|± |0.0292|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.2300|± |0.0423|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.1926|± |0.0240|
| - high_school_physics | 1|none | 0|acc |↑ |0.2318|± |0.0345|
| - high_school_statistics | 1|none | 0|acc |↑ |0.1806|± |0.0262|
| - machine_learning | 1|none | 0|acc |↑ |0.2857|± |0.0429|
|truthfulqa_mc2 | 2|none | 0|acc |↑ |0.4880|± |0.0161|
|winogrande | 1|none | 0|acc |↑ |0.5185|± |0.0140|

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2451	±	0.0036
- humanities	2	none	acc	↑	0.2470	±	0.0063
- other	2	none	acc	↑	0.2456	±	0.0077
- social sciences	2	none	acc	↑	0.2584	±	0.0079
- stem	2	none	acc	↑	0.2287	±	0.0075

litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A
- leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5027	±	0.0367
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.1720	±	0.0239
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.2960	±	0.0289
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.4880	±	0.0317
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.0000	±	0
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.5160	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.2000	±	0.0253
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.1480	±	0.0225
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.3160	±	0.0295
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.4680	±	0.0316
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.0480	±	0.0135
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.1918	±	0.0327
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.1440	±	0.0222
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.1360	±	0.0217
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.5225	±	0.0375
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4560	±	0.0316
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2960	±	0.0289
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.2120	±	0.0259
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1840	±	0.0246
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3160	±	0.0295
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.5200	±	0.0317
- leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2172	±	0.0294
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2454	±	0.0184
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2478	±	0.0204
- leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.1727	±	N/A
		none	0	inst_level_strict_acc	↑	0.1559	±	N/A
		none	0	prompt_level_loose_acc	↑	0.0832	±	0.0119
		none	0	prompt_level_strict_acc	↑	0.0795	±	0.0116
- leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_counting_and_prob_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_geometry_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_intermediate_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_num_theory_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_prealgebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_precalculus_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.1135	±	0.0029
- leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5240	±	0.0316
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.2734	±	0.0279
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3000	±	0.0290

litgpt evaluate --tasks 'bbh_zeroshot,bbh_fewshot,bbh_cot_fewshot,bbh_cot_zeroshot' --out_dir 'evaluate-bigbenchhard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

litgpt evaluate --tasks 'mmlu_multilingual,mgsm' --out_dir 'evaluate-multilinguals/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0099	±	0.0027
		strict-match	5	exact_match	↑	0.0000	±	0.0000
mathqa	1	none	0	acc	↑	0.2121	±	0.0075
		none	0	acc_norm	↑	0.2114	±	0.0075

litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

tangledgroup
/

tangled-llama-t-128k-base-v0.1

tangled-llama-t-32k-base-v0.1

lm-evaluation-harness

Datasets used to train tangledgroup/tangled-llama-t-128k-base-v0.1