File size: 3,148 Bytes
256ca36
 
 
 
 
6ddcd3e
7a21c99
5f16ab8
 
af742a1
82fb7e2
03277c4
82fb7e2
 
 
 
f5e12ae
 
 
03277c4
82fb7e2
 
 
 
 
 
 
6253fb4
76aaefb
 
 
 
 
 
 
 
82fb7e2
9ad8596
6253fb4
82fb7e2
d63b58a
82fb7e2
 
9ad8596
c80bfdc
 
65756b2
6253fb4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
datasets:
- lmms-lab/DocVQA
language:
- en
library_name: transformers
license: mit
tags:
- document
pipeline_tag: sentence-similarity
---
# LayoutLM-Byne-v0.1
## The new SOTA in page retrieval from visually-rich documents.

[![Logo](https://armalytix.s3.eu-west-2.amazonaws.com/TRUST+THE+COUNSEL+(1).png "Logo")](https://bynedocs.com "Logo")

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg 'Open in Colab')](https://colab.research.google.com/drive/1YkPtCOrXdDMTv_gm14VoZeofJoNRotzO?usp=sharing)


We're glad to introduce one of the first document page embedding models, LayoutLM-Byne-v0.1.

With the rise of multimodal LLMs, there is a growing adoption of applying models directly to a document without pre-processing it first, as was done before with RAG. This approach is significantly more robust than text-only RAG on a large subset of documents, especially visually rich ones.

On the other hand, there is a significant lack of research focused on extracting a relevant page from a PDF or a DOCX document. Most practitioners would parse the page into text and apply regular text embeddings to the text, losing much positional context in the process.

LayoutLM [1] is an excellent solution for the problems because, at its core, it is a regular BERT-alike model, but it is uniquely capable of embedding positional information about the text alongside the text itself.

We have fine-tuned the model on the DocVQA [2] dataset, showing the potential improvement upon the current SOTA:

| Model                           | HR@3           | HR@5           | HR@10          |
|---------------------------------|----------------|----------------|----------------|
| all-mpnet-base-v2               | 0.2500         | 0.2900         | 0.3600         |
| gte-base-en-v1.5                | 0.3454         | 0.3899         | 0.4554         |
| snowflake-arctic-embed-m-v1.5   | **0.3548**     | 0.4042         | 0.4573         |
| LayoutLM-Byne (our model)       | 0.3491         | **0.4269**     | **0.5436**     |
| Improvement over best competitor| -1.61%         | +5.62%         | +18.87%        |

It is important to highlight that the model is still in alpha, so further work is required to reveal its potential.

### Usage
Please refer to the [Colab workbook](https://colab.research.google.com/drive/1YkPtCOrXdDMTv_gm14VoZeofJoNRotzO?usp=sharing) or the [blog post](https://blog.bynedocs.com/layoutlm-byne-v0.1-the-new-sota-in-page-retrieval-from-visually-rich-documents) to learn more!

### Get in touch
Reach out to [[email protected]](mailto:[email protected]) if you'd like help with deploying the model in a commercial setting.

[1] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

[2] Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2200-2209).