Finetunning text recognition models for other languages
Hi, I recently came across doctr and it gives in my case a way better OCR results that tesseract however it makes "silly" mistakes because it doesn't know polish language. I would like to fine-tune recognition model to polish language. I'm wondering how did you solve the problem of dataset for french language? Did you generated artificial data? How many samples were needed to fine tune the model to french language?
Thanks in advance!
Tomek
Hi Tomek :) ,
mindee has created a internal real dataset of ~500k
different documents (~10M
word crops) to train the models from scratch. The polish vocabulary is very similar to the french one so it should be possible to fine-tune the pretrained models with less data (~10k-20k
samples).
Best regards,
Felix
Thanks Felix,
we will give it a try :)