Unifying Vision, Text, and Layout for Universal Document Processing
Abstract
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.
Community
- Proposes Universal Document Processing (UDOP) using a foundation model that uses a vision-text-layout transformer. It can generate document images, QA with document (understanding), classification, etc. while being trained on a single (unified) generative pretraining objective; different from text autoregressive generation because documents have a layout (forms, tables, etc.); converts layout to tokens; proposes vision text layout (VTL) transformer containing unified (multimodal) encoder with text-layout and vision decoder. A document image (page) has text tokens and corresponding bounding boxes (detected through OCR); patchify document image (into tokens), encode into D-dimensional tokens, also convert text tokens (vocabulary lookup); transform sequence of text tokens by adding vision tokens to text tokens (where text bounding box is in the vision token's patch), some text tokens (like task prompt) and vision tokens with no text are appended; discretize layout modality (text bounding box coordinates) into layout tokens (normalize and multiply by vocabulary size); add position bias as 2D relative attention bias (like in T5 and TILT). Text-layout decoder is a uni-directional transformer decoder that generates tokens in a sequence-to-sequence manner; vision/image decoder is inspired by MAE decoder (reconstruct masked patches).
- Pretraining involves multiple self-supervised tasks: joint text-layout reconstruction (mask words and model predicting word and bounding box), layout modeling/prediction (get text from given bbox), masked image reconstruction (mask vision patches, pass through unified encoder, and give output concatenated with text character embeddings to decoder to generate image pixels). Supervised pre-training tasks (using labeled data): document classification (RVL-CDIP), layout analysis (bounding boxes that cover a mentioned entity) using PubLayNet dataset, information extraction (given text query get entity label and bounding boxes) using multiple datasets (like DocBank, KCL, PWC, DeepForm), question answering (on WebSRC, VisualMRC, DocQVA, etc. datasets), and document natural language inference (NLI) to predict entailment relationship between sentences (using TabFact dataset). Follows architecture and tokenizer of T5, uses IIT-CDIP document collection for self-supervised training (11M scanned docs with OCR); curriculum learning (increase resolution in 3 phases during training); SSL pre-trained models are fine-tuned using supervision tasks (dataset specific).
- Better than T5-based and LayoutLM models on QA, information extraction, and NLI. Analysis shows that masked document processing facilitates editing and rearranging (edit the text and layout tokens in the masked patches). Auxiliary training (like in TILT) improves performance further. Also proposes a two-tower encoder (one for text and layout and the other for vision), UDOP-Dual, but using one unified encoder is better. Appendix contains more visualizations, UDOP-Dual comparisons, supervised pretraining task definitions (and dataset information), curriculum learning ablations, performance variance, and limitations. From UNC (Chapel Hill) and Microsoft.
Links: Blog, arxiv, HuggingFace Space (collection), HF Transformers, GitHub (main codebase, tutorial)
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper