musadac/VilanOCR-Urdu-English-Chinese · Apply for community grant: Academic project

Digitizing multilingual documents is a crucial step towards preserving and promoting the linguistic and cultural heritage of the world's diverse communities. With the rapid advancement of technology, digitization has become increasingly important in language documentation and revitalization efforts. However, the digitization of low resource languages presents unique challenges that can hinder their preservation and promotion. Moreover, the digitization of handwritten documents has become an increasingly important area of research as organizations strive to leverage their vast amounts of unstructured data to drive informed decision making. However, current state-of-the-art text extraction approaches are primarily focused on monolingual documents written in a single script, such as English. This limitation presents a significant challenge for organizations that deal with bilingual documents written in different scripts. In this study, we attempt to tackle this problem by proposing a novel approach for the digitization of handwritten bilingual documents that contain both Urdu and English languages. Our approach involves the extraction of text from the documents, followed by the parsing of relevant entities to store the information in a structured format. This structured format enables organizations to effectively analyze and make data-driven decisions based on the information contained in the documents.