Icebreaker tokenizer

Developed by: Sigurdur Haukur Birgisson
Model type: GPT2Tokenizer
Language(s) (NLP): Icelandic

This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.

Model Details

BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.

It has a vocab size of 3200.

Use the code below to get started with the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
tokens = tokenizer("Halló heimur!")

Sigurdur Haukur Birgissson: [email protected]