arxiv:2412.02592

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Published on Dec 3

· Submitted by

wanderkid on Dec 4

Upvote

Authors:

Junyuan Zhang ,

Qintong Zhang ,

Bin Wang ,

Linke Ouyang ,

Zichen Wen ,

Ying Li ,

Ka-Ho Chow ,

Conghui He ,

Abstract

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench

View arXiv page View PDF Add to collection

Community

wanderkid

Paper author Paper submitter 9 days ago

We introduced OHR-Bench, the first benchmark designed to evaluate the cascading impact of OCR quality on RAG systems. Our contributions include:

A dataset of unstructured PDFs with ground truth parsing results from six RAG application areas, challenging current OCR solutions.
Q&As derived from multimodal elements of PDFs, serving as an ideal testbed for assessing OCR's impact on RAG.
We identify two primary OCR noise types: Semantic Noise and Formatting Noise. We create perturbation datasets by manually introducing these noise types based on real-world OCR results.

Using these datasets, we conduct a comprehensive analysis of OCR's impact on RAG systems, paving the way for future research.

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.02592 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.02592 in a Space README.md to link it from this page.