Byne
/

LayoutLM-Byne-v0.1

Sentence Similarity

text-classification

Inference Endpoints

Model card Files Files and versions Community

LayoutLM-Byne-v0.1 / README.md

Boriscii's picture

Update README.md

af742a1 verified about 1 month ago

|

No virus

2.86 kB

	---
	datasets:
	- lmms-lab/DocVQA
	language:
	- en
	library_name: transformers
	license: mit
	tags:
	- document
	pipeline_tag: sentence-similarity
	---

	# LayoutLM-Byne-v0.1
	## The new SOTA in page retrieval from visually-rich documents.

	[![Logo](https://armalytix.s3.eu-west-2.amazonaws.com/TRUST+THE+COUNSEL+(1).png "Logo")](https://bynedocs.com "Logo")

	We're glad to introduce one of the first document page embedding models, LayoutLM-Byne-v0.1.

	With the rise of multimodal LLMs, there is a growing adoption of applying models directly to a document without pre-processing it first, as was done before with RAG. This approach is significantly more robust than text-only RAG on a large subset of documents, especially visually rich ones.

	On the other hand, there is a significant lack of research focused on extracting a relevant page from a PDF or a DOCX document. Most practitioners would parse the page into text and apply regular text embeddings to the text, losing much positional context in the process.

	LayoutLM [1] is an excellent solution for the problems because, at its core, it is a regular BERT-alike model, but it is uniquely capable of embedding positional information about the text alongside the text itself.

	We have fine-tuned the model on the DocVQA [2] dataset, showing the potential improvement upon the current SOTA:

	\| Model \| HR@3 \| HR@5 \| HR@10 \|
	\|---------------------------------\|----------------\|----------------\|----------------\|
	\| all-mpnet-base-v2 \| 0.2500 \| 0.2900 \| 0.3600 \|
	\| gte-base-en-v1.5 \| 0.3454 \| 0.3899 \| 0.4554 \|
	\| snowflake-arctic-embed-m-v1.5 \| 0.3548 \| 0.4042 \| 0.4573 \|
	\| LayoutLM-Byne (our model) \| 0.3491 \| 0.4269 \| 0.5436 \|
	\| Improvement over best competitor\| -1.61% \| +5.62% \| +18.87% \|

	It is important to highlight that the model is still in alpha, so further work is required to reveal it's potential.

	### Usage
	Please refer to the [Colab workbook](https://colab.research.google.com/drive/1YkPtCOrXdDMTv_gm14VoZeofJoNRotzO?usp=sharing) or the blog post to learn more!

	### Get in touch
	Reach out to [[email protected]](mailto:[email protected]) if you'd like help with deploying the model in commerical setting.

	[1] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

	[2] Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2200-2209).