Edit model card

donut-base-ascii

This is "naver-clova-ix/donut-base" but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.

The original model, "naver-clova-ix/donut-base", did not have a token for "1", so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.

This has not been trained any more than the original model.

I made a whole video about it: https://youtu.be/Uzr553x1gdM

I did a quick speed test for generation against the default model and using bad_words_ids. The bad_words_ids was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.

Speed script here
Launched with this

approach time to generate 10 tokens
"naver-clova-ix/donut-base" 205ms
"naver-clova-ix/donut-base" + 12k bad_words_ids 280ms
"donut-base-ascii" 195ms
Downloads last month
11
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Collection including nbroad/donut-base-ascii