metadata

license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
language:
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - bn_rom
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hi_rom
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - li
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - my_zaw
  - ne
  - nl
  - 'no'
  - ns
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - rm
  - ro
  - ru
  - sa
  - si
  - sc
  - sd
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - ta_rom
  - te
  - te_rom
  - th
  - tl
  - tn
  - tr
  - ug
  - uk
  - ur
  - ur_rom
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zh-Hans
  - zh-Hant
  - zu
datasets:
  - yahma/alpaca-cleaned
  - gbharti/wealth-alpaca_lora
  - saillab/taco-datasets
  - xu-song/cc100-samples
  - ontocord/fineweb-permissive-multilingual-2m
  - MuskumPillerum/General-Knowledge
  - yirenc/general_knowledge_boolean
  - nampdn-ai/tiny-textbooks
  - nampdn-ai/tiny-codes
  - bigcode/the-stack-smol-xs
  - m-a-p/CodeFeedback-Filtered-Instruction
  - jtatman/python-code-dataset-500k
  - iamtarun/python_code_instructions_18k_alpaca
  - HuggingFaceH4/CodeAlpaca_20K
  - gair-prox/open-web-math-pro
  - rvv-karma/Math-QA
  - ajibawa-2023/Maths-College
  - microsoft/orca-math-word-problems-200k
  - fblgit/simple-math
  - SkunkworksAI/reasoning-0.01
  - badrex/llm-emoji-dataset
tags:
  - litgpt
  - litdata

tangled-llama-154m-32k-base-v0.1

A pretrained language model based on the Llama model with about 154M parameters. This model has been trained on 11.4B (11,422,750,857) tokens from more than 0.8M (796,399) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_challenge	1	none	0	acc	↑	0.1877	±	0.0114
		none	0	acc_norm	↑	0.2389	±	0.0125
gsm8k	3	flexible-extract	5	exact_match	↑	0.0136	±	0.0032
		strict-match	5	exact_match	↑	0.0008	±	0.0008
hellaswag	1	none	0	acc	↑	0.2660	±	0.0044
		none	0	acc_norm	↑	0.2697	±	0.0044
mmlu	2	none		acc	↑	0.2377	±	0.0036
- humanities	2	none		acc	↑	0.2372	±	0.0062
- formal_logic	1	none	0	acc	↑	0.2619	±	0.0393
- high_school_european_history	1	none	0	acc	↑	0.2182	±	0.0323
- high_school_us_history	1	none	0	acc	↑	0.2500	±	0.0304
- high_school_world_history	1	none	0	acc	↑	0.2447	±	0.0280
- international_law	1	none	0	acc	↑	0.2066	±	0.0370
- jurisprudence	1	none	0	acc	↑	0.2778	±	0.0433
- logical_fallacies	1	none	0	acc	↑	0.2025	±	0.0316
- moral_disputes	1	none	0	acc	↑	0.2514	±	0.0234
- moral_scenarios	1	none	0	acc	↑	0.2425	±	0.0143
- philosophy	1	none	0	acc	↑	0.1768	±	0.0217
- prehistory	1	none	0	acc	↑	0.2562	±	0.0243
- professional_law	1	none	0	acc	↑	0.2379	±	0.0109
- world_religions	1	none	0	acc	↑	0.2515	±	0.0333
- other	2	none		acc	↑	0.2462	±	0.0077
- business_ethics	1	none	0	acc	↑	0.2800	±	0.0451
- clinical_knowledge	1	none	0	acc	↑	0.2377	±	0.0262
- college_medicine	1	none	0	acc	↑	0.2370	±	0.0324
- global_facts	1	none	0	acc	↑	0.2500	±	0.0435
- human_aging	1	none	0	acc	↑	0.2108	±	0.0274
- management	1	none	0	acc	↑	0.1942	±	0.0392
- marketing	1	none	0	acc	↑	0.2436	±	0.0281
- medical_genetics	1	none	0	acc	↑	0.2700	±	0.0446
- miscellaneous	1	none	0	acc	↑	0.2554	±	0.0156
- nutrition	1	none	0	acc	↑	0.2778	±	0.0256
- professional_accounting	1	none	0	acc	↑	0.2411	±	0.0255
- professional_medicine	1	none	0	acc	↑	0.2279	±	0.0255
- virology	1	none	0	acc	↑	0.2530	±	0.0338
- social sciences	2	none		acc	↑	0.2525	±	0.0078
- econometrics	1	none	0	acc	↑	0.2281	±	0.0395
- high_school_geography	1	none	0	acc	↑	0.1465	±	0.0252
- high_school_government_and_politics	1	none	0	acc	↑	0.2539	±	0.0314
- high_school_macroeconomics	1	none	0	acc	↑	0.2333	±	0.0214
- high_school_microeconomics	1	none	0	acc	↑	0.2269	±	0.0272
- high_school_psychology	1	none	0	acc	↑	0.2330	±	0.0181
- human_sexuality	1	none	0	acc	↑	0.2824	±	0.0395
- professional_psychology	1	none	0	acc	↑	0.2859	±	0.0183
- public_relations	1	none	0	acc	↑	0.2364	±	0.0407
- security_studies	1	none	0	acc	↑	0.3388	±	0.0303
- sociology	1	none	0	acc	↑	0.2637	±	0.0312
- us_foreign_policy	1	none	0	acc	↑	0.2700	±	0.0446
- stem	2	none		acc	↑	0.2157	±	0.0073
- abstract_algebra	1	none	0	acc	↑	0.2000	±	0.0402
- anatomy	1	none	0	acc	↑	0.2222	±	0.0359
- astronomy	1	none	0	acc	↑	0.1842	±	0.0315
- college_biology	1	none	0	acc	↑	0.2639	±	0.0369
- college_chemistry	1	none	0	acc	↑	0.2100	±	0.0409
- college_computer_science	1	none	0	acc	↑	0.2400	±	0.0429
- college_mathematics	1	none	0	acc	↑	0.2200	±	0.0416
- college_physics	1	none	0	acc	↑	0.2059	±	0.0402
- computer_security	1	none	0	acc	↑	0.2400	±	0.0429
- conceptual_physics	1	none	0	acc	↑	0.2553	±	0.0285
- electrical_engineering	1	none	0	acc	↑	0.2414	±	0.0357
- elementary_mathematics	1	none	0	acc	↑	0.2328	±	0.0218
- high_school_biology	1	none	0	acc	↑	0.1839	±	0.0220
- high_school_chemistry	1	none	0	acc	↑	0.1626	±	0.0260
- high_school_computer_science	1	none	0	acc	↑	0.2300	±	0.0423
- high_school_mathematics	1	none	0	acc	↑	0.2037	±	0.0246
- high_school_physics	1	none	0	acc	↑	0.1921	±	0.0322
- high_school_statistics	1	none	0	acc	↑	0.1852	±	0.0265
- machine_learning	1	none	0	acc	↑	0.2857	±	0.0429
truthfulqa_mc2	2	none	0	acc	↑	0.4650	±	0.0161
winogrande	1	none	0	acc	↑	0.4957	±	0.0141

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2377	±	0.0036
- humanities	2	none	acc	↑	0.2372	±	0.0062
- other	2	none	acc	↑	0.2462	±	0.0077
- social sciences	2	none	acc	↑	0.2525	±	0.0078
- stem	2	none	acc	↑	0.2157	±	0.0073

litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard	N/A
- leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.4720	±	0.0316
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5187	±	0.0366
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.2000	±	0.0253
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.3560	±	0.0303
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.4640	±	0.0316
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.0800	±	0.0172
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.5160	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.1760	±	0.0241
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.1600	±	0.0232
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.3320	±	0.0298
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.2640	±	0.0279
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.5840	±	0.0312
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.0840	±	0.0176
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.1986	±	0.0331
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.1720	±	0.0239
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.1560	±	0.0230
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.4663	±	0.0375
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4520	±	0.0315
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2160	±	0.0261
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.2200	±	0.0263
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1480	±	0.0225
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3200	±	0.0296
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.4880	±	0.0317
- leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2020	±	0.0286
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2656	±	0.0189
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2656	±	0.0209
- leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.2290	±	N/A
		none	0	inst_level_strict_acc	↑	0.1990	±	N/A
		none	0	prompt_level_loose_acc	↑	0.1128	±	0.0136
		none	0	prompt_level_strict_acc	↑	0.1017	±	0.0130
- leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_counting_and_prob_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_geometry_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_intermediate_algebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_num_theory_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_prealgebra_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_math_precalculus_hard	1	none	4	exact_match	↑	0.0000	±	0
- leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.1104	±	0.0029
- leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.4920	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.2891	±	0.0284
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3440	±	0.0301

litgpt evaluate --tasks 'bbh_zeroshot,bbh_fewshot,bbh_cot_fewshot,bbh_cot_zeroshot' --out_dir 'evaluate-bigbenchhard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

None

litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc	↑	0.2377	±	0.0036
- humanities	2	none		acc	↑	0.2372	±	0.0062
- formal_logic	1	none	0	acc	↑	0.2619	±	0.0393
- high_school_european_history	1	none	0	acc	↑	0.2182	±	0.0323
- high_school_us_history	1	none	0	acc	↑	0.2500	±	0.0304
- high_school_world_history	1	none	0	acc	↑	0.2447	±	0.0280
- international_law	1	none	0	acc	↑	0.2066	±	0.0370
- jurisprudence	1	none	0	acc	↑	0.2778	±	0.0433
- logical_fallacies	1	none	0	acc	↑	0.2025	±	0.0316
- moral_disputes	1	none	0	acc	↑	0.2514	±	0.0234
- moral_scenarios	1	none	0	acc	↑	0.2425	±	0.0143
- philosophy	1	none	0	acc	↑	0.1768	±	0.0217
- prehistory	1	none	0	acc	↑	0.2562	±	0.0243
- professional_law	1	none	0	acc	↑	0.2379	±	0.0109
- world_religions	1	none	0	acc	↑	0.2515	±	0.0333
- other	2	none		acc	↑	0.2462	±	0.0077
- business_ethics	1	none	0	acc	↑	0.2800	±	0.0451
- clinical_knowledge	1	none	0	acc	↑	0.2377	±	0.0262
- college_medicine	1	none	0	acc	↑	0.2370	±	0.0324
- global_facts	1	none	0	acc	↑	0.2500	±	0.0435
- human_aging	1	none	0	acc	↑	0.2108	±	0.0274
- management	1	none	0	acc	↑	0.1942	±	0.0392
- marketing	1	none	0	acc	↑	0.2436	±	0.0281
- medical_genetics	1	none	0	acc	↑	0.2700	±	0.0446
- miscellaneous	1	none	0	acc	↑	0.2554	±	0.0156
- nutrition	1	none	0	acc	↑	0.2778	±	0.0256
- professional_accounting	1	none	0	acc	↑	0.2411	±	0.0255
- professional_medicine	1	none	0	acc	↑	0.2279	±	0.0255
- virology	1	none	0	acc	↑	0.2530	±	0.0338
- social sciences	2	none		acc	↑	0.2525	±	0.0078
- econometrics	1	none	0	acc	↑	0.2281	±	0.0395
- high_school_geography	1	none	0	acc	↑	0.1465	±	0.0252
- high_school_government_and_politics	1	none	0	acc	↑	0.2539	±	0.0314
- high_school_macroeconomics	1	none	0	acc	↑	0.2333	±	0.0214
- high_school_microeconomics	1	none	0	acc	↑	0.2269	±	0.0272
- high_school_psychology	1	none	0	acc	↑	0.2330	±	0.0181
- human_sexuality	1	none	0	acc	↑	0.2824	±	0.0395
- professional_psychology	1	none	0	acc	↑	0.2859	±	0.0183
- public_relations	1	none	0	acc	↑	0.2364	±	0.0407
- security_studies	1	none	0	acc	↑	0.3388	±	0.0303
- sociology	1	none	0	acc	↑	0.2637	±	0.0312
- us_foreign_policy	1	none	0	acc	↑	0.2700	±	0.0446
- stem	2	none		acc	↑	0.2157	±	0.0073
- abstract_algebra	1	none	0	acc	↑	0.2000	±	0.0402
- anatomy	1	none	0	acc	↑	0.2222	±	0.0359
- astronomy	1	none	0	acc	↑	0.1842	±	0.0315
- college_biology	1	none	0	acc	↑	0.2639	±	0.0369
- college_chemistry	1	none	0	acc	↑	0.2100	±	0.0409
- college_computer_science	1	none	0	acc	↑	0.2400	±	0.0429
- college_mathematics	1	none	0	acc	↑	0.2200	±	0.0416
- college_physics	1	none	0	acc	↑	0.2059	±	0.0402
- computer_security	1	none	0	acc	↑	0.2400	±	0.0429
- conceptual_physics	1	none	0	acc	↑	0.2553	±	0.0285
- electrical_engineering	1	none	0	acc	↑	0.2414	±	0.0357
- elementary_mathematics	1	none	0	acc	↑	0.2328	±	0.0218
- high_school_biology	1	none	0	acc	↑	0.1839	±	0.0220
- high_school_chemistry	1	none	0	acc	↑	0.1626	±	0.0260
- high_school_computer_science	1	none	0	acc	↑	0.2300	±	0.0423
- high_school_mathematics	1	none	0	acc	↑	0.2037	±	0.0246
- high_school_physics	1	none	0	acc	↑	0.1921	±	0.0322
- high_school_statistics	1	none	0	acc	↑	0.1852	±	0.0265
- machine_learning	1	none	0	acc	↑	0.2857	±	0.0429
mmlu_pro	2	custom-extract		exact_match	↑	0.0000	±	0.0000
- biology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- business	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- chemistry	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- computer_science	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- economics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- engineering	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- health	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- history	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- law	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- math	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- other	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- philosophy	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- physics	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000
- psychology	1	custom-extract	5	exact_match	↑	0.0000	±	0.0000

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.2377	±	0.0036
- humanities	2	none	acc	↑	0.2372	±	0.0062
- other	2	none	acc	↑	0.2462	±	0.0077
- social sciences	2	none	acc	↑	0.2525	±	0.0078
- stem	2	none	acc	↑	0.2157	±	0.0073
mmlu_pro	2	custom-extract	exact_match	↑	0.0000	±	0.0000

litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
arc_challenge	1	none	acc	↑	0.1903	±	0.0115
		none	acc_norm	↑	0.2406	±	0.0125
boolq	2	none	acc	↑	0.5838	±	0.0086
gpqa_diamond_cot_n_shot	2	flexible-extract	exact_match	↑	0.1212	±	0.0233
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_cot_zeroshot	1	flexible-extract	exact_match	↑	0.1465	±	0.0252
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_generative_n_shot	2	flexible-extract	exact_match	↑	0.2273	±	0.0299
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_diamond_n_shot	2	none	acc	↑	0.2475	±	0.0307
		none	acc_norm	↑	0.2475	±	0.0307
gpqa_diamond_zeroshot	1	none	acc	↑	0.2273	±	0.0299
		none	acc_norm	↑	0.2273	±	0.0299
gpqa_extended_cot_n_shot	2	flexible-extract	exact_match	↑	0.1392	±	0.0148
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_cot_zeroshot	1	flexible-extract	exact_match	↑	0.1502	±	0.0153
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_generative_n_shot	2	flexible-extract	exact_match	↑	0.2289	±	0.0180
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_extended_n_shot	2	none	acc	↑	0.2344	±	0.0181
		none	acc_norm	↑	0.2344	±	0.0181
gpqa_extended_zeroshot	1	none	acc	↑	0.2582	±	0.0187
		none	acc_norm	↑	0.2582	±	0.0187
gpqa_main_cot_n_shot	2	flexible-extract	exact_match	↑	0.1429	±	0.0166
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_cot_zeroshot	1	flexible-extract	exact_match	↑	0.1629	±	0.0175
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_generative_n_shot	2	flexible-extract	exact_match	↑	0.2366	±	0.0201
		strict-match	exact_match	↑	0.0000	±	0.0000
gpqa_main_n_shot	2	none	acc	↑	0.2500	±	0.0205
		none	acc_norm	↑	0.2500	±	0.0205
gpqa_main_zeroshot	1	none	acc	↑	0.2746	±	0.0211
		none	acc_norm	↑	0.2746	±	0.0211
hellaswag	1	none	acc	↑	0.2658	±	0.0044
		none	acc_norm	↑	0.2690	±	0.0044
openbookqa	1	none	acc	↑	0.1380	±	0.0154
		none	acc_norm	↑	0.2740	±	0.0200
piqa	1	none	acc	↑	0.5555	±	0.0116
		none	acc_norm	↑	0.5571	±	0.0116
truthfulqa_mc2	2	none	acc	↑	0.4650	±	0.0160
winogrande	1	none	acc	↑	0.4949	±	0.0141

litgpt evaluate --tasks 'mmlu_multilingual,mgsm' --out_dir 'evaluate-multilinguals/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

None

litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0136	±	0.0032
		strict-match	5	exact_match	↑	0.0008	±	0.0008
mathqa	1	none	0	acc	↑	0.2191	±	0.0076
		none	0	acc_norm	↑	0.2181	±	0.0076

litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/

Tasks	Version	Filter	Metric		Value		Stderr
qasper_bool	1	none	f1	↑	0.8215	±	0.0222
qasper_freeform	2	none	f1_abstractive	↑	0.0390	±	0.0045
wikitext	2	none	bits_per_byte	↓	2.6525	±	N/A
		none	byte_perplexity	↓	6.2874	±	N/A
		none	word_perplexity	↓	18611.9448	±	N/A