TurkuNLP/gpt3-finnish-8B

Generative Pretrained Transformer with 8B parameteres for Finnish.

TurkuNLP Finnish GPT-3-models are a model family of pretrained monolingual GPT-style language models that are based on BLOOM-architecture. Note that the models are pure language models, meaning that they are not instruction finetuned for dialogue or answering questions.

These models are intended to be used as foundational models that can be e.g. instruction finetuned to serve as modern chat-models.

All models are trained for 300B tokens.

Parameters

Model	Layers	Dim	Heads	Params
Small	12	768	12	186M
Medium	24	1024	16	437M
Large	24	1536	16	881M
XL	24	2064	24	1.5B
”3B”	32	2560	32	2.8B
”8B”	32	4096	32	7.5B
"13B"	40	5120	40	13.3B

Datasets

We used a combination of multiple Finnish resources.

Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
Common Crawl Finnish https://TODO
Finnish Wikipedia https://fi.wikipedia.org/wiki
Lönnrot Projekti Lönnrot http://www.lonnrot.net/
ePub National library ”epub” collection
National library ”lehdet” collection
Suomi24 The Suomi 24 Corpus 2001-2020 http://urn.fi/urn:nbn:fi:lb-2021101527
Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
STT Finnish News Agency Archive 1992-2018 http://urn.fi/urn:nbn:fi:lb-2019041501
Yle Finnish News Archive 2011-2018 http://urn.fi/urn:nbn:fi:lb-2017070501
Yle Finnish News Archive 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050401
Yle News Archive Easy-to-read Finnish 2011-2018 http://urn.fi/urn:nbn:fi:lb-2019050901
Yle News Archive Easy-to-read Finnish 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050701
ROOTS TODO

Sampling ratios

Dataset	Chars	Ratio	Weight	W.Ratio
Parsebank	35.0B	16.9%	1.5	22.7%
mC4-Fi	46.3B	22.4%	1.0	20.0%
CC-Fi	79.6B	38.5%	1.0	34.4%
Fiwiki	0.8B	0.4%	3.0	1.0%
Lönnrot	0.8B	0.4%	3.0	1.0%
Yle	1.6B	0.8%	2.0	1.4%
STT	2.2B	1.1%	2.0	1.9%
ePub	13.5B	6.5%	1.0	5.8%
Lehdet	5.8B	2.8%	1.0	2.5%
Suomi24	20.6B	9.9%	1.0	8.9%
Reddit-Fi	0.7B	0.4%	1.0	0.3%
TOTAL	207.0B	100.0%	N/A	100.0%

More documentation and a paper coming soon.