OCRerrcr / README.md
eliotj's picture
Update README.md
b8501a2 verified
metadata
license: apache-2.0
language:
  - en
  - fr
  - de

OCRerrcr is a small language model specialized for the detection of OCR error.

OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).

To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, OCRoscope, that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.

The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.

Example

The following is a low-error example sentence taken from Common Corpus:

They did not approach cer, but turned away and passed irom her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the tho plains.

And the OCRerrcr detection (with formatting for clarity):

They did not approach <er>cer,</er> but turned away and passed <er>irom</er> her presence, filled with sorrow and moved with sympathy, which her intense emotions seemed to communicate to even these thoughtless young men of the <er>tho</er> plains.