--- language: - is library_name: transformers --- # Icelandic Tokenizer README ## Overview This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens. ## Usage Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage: ```python from transformers import GPT2Tokenizer # Load the tokenizer tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer") tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer("Halló heimur!")["input_ids"] ``` ## Citation If you use this tokenizer in your work, please cite the original source of the training data: ```bibtex @misc{20.500.12537/254, title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version}, author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð}, url = {http://hdl.handle.net/20.500.12537/254}, note = {{CLARIN}-{IS}}, year = {2022} } ``` ## Feedback We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions. Happy tokenizing! Sigurdur Haukur Birgisson (readme created with chatgpt)