|
--- |
|
datasets: |
|
- lmms-lab/DocVQA |
|
language: |
|
- en |
|
library_name: transformers |
|
license: mit |
|
tags: |
|
- document |
|
pipeline_tag: sentence-similarity |
|
--- |
|
# LayoutLM-Byne-v0.1 |
|
## The new SOTA in page retrieval from visually-rich documents. |
|
|
|
[![Logo](https://armalytix.s3.eu-west-2.amazonaws.com/TRUST+THE+COUNSEL+(1).png "Logo")](https://bynedocs.com "Logo") |
|
|
|
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg 'Open in Colab')](https://colab.research.google.com/drive/1YkPtCOrXdDMTv_gm14VoZeofJoNRotzO?usp=sharing) |
|
|
|
|
|
We're glad to introduce one of the first document page embedding models, LayoutLM-Byne-v0.1. |
|
|
|
With the rise of multimodal LLMs, there is a growing adoption of applying models directly to a document without pre-processing it first, as was done before with RAG. This approach is significantly more robust than text-only RAG on a large subset of documents, especially visually rich ones. |
|
|
|
On the other hand, there is a significant lack of research focused on extracting a relevant page from a PDF or a DOCX document. Most practitioners would parse the page into text and apply regular text embeddings to the text, losing much positional context in the process. |
|
|
|
LayoutLM [1] is an excellent solution for the problems because, at its core, it is a regular BERT-alike model, but it is uniquely capable of embedding positional information about the text alongside the text itself. |
|
|
|
We have fine-tuned the model on the DocVQA [2] dataset, showing the potential improvement upon the current SOTA: |
|
|
|
| Model | HR@3 | HR@5 | HR@10 | |
|
|---------------------------------|----------------|----------------|----------------| |
|
| all-mpnet-base-v2 | 0.2500 | 0.2900 | 0.3600 | |
|
| gte-base-en-v1.5 | 0.3454 | 0.3899 | 0.4554 | |
|
| snowflake-arctic-embed-m-v1.5 | **0.3548** | 0.4042 | 0.4573 | |
|
| LayoutLM-Byne (our model) | 0.3491 | **0.4269** | **0.5436** | |
|
| Improvement over best competitor| -1.61% | +5.62% | +18.87% | |
|
|
|
It is important to highlight that the model is still in alpha, so further work is required to reveal its potential. |
|
|
|
### Usage |
|
Please refer to the [Colab workbook](https://colab.research.google.com/drive/1YkPtCOrXdDMTv_gm14VoZeofJoNRotzO?usp=sharing) or the blog post to learn more! |
|
|
|
### Get in touch |
|
Reach out to [[email protected]](mailto:[email protected]) if you'd like help with deploying the model in a commercial setting. |
|
|
|
[1] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200). |
|
|
|
[2] Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2200-2209). |