Norm
/

nougat-latex-base

vision-encoder-decoder

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

nougat-latex-base / README.md

Norm's picture

Update README.md

9ee8c29 about 1 year ago

|

3.35 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-to-text
	---

	# Nougat-LaTeX-based

	- Model type: [Donut](https://huggingface.co/docs/transformers/model_doc/donut)
	- Finetuned from: [facebook/nougat-base](https://huggingface.co/facebook/nougat-base)
	- Repository: [source code](https://github.com/NormXU/nougat-latex-ocr)

	Nougat-LaTeX-based is fine-tuned from [facebook/nougat-base](https://huggingface.co/facebook/nougat-base) with [im2latex-100k](https://zenodo.org/record/56198#.V2px0jXT6eA) to boost its proficiency in generating LaTeX code from images.
	Since the initial encoder input image size of nougat was unsuitable for equation image segments, leading to potential rescaling artifacts that degrades the generation quality of LaTeX code. To address this, Nougat-LaTeX-based adjusts the input resolution and uses an adaptive padding approach to ensure that equation image segments in the wild are resized to closely match the resolution of the training data.


	### Evaluation
	Evaluated on an image-equation pair dataset collected from Wikipedia, arXiv, and im2latex-100k, curated by [lukas-blecher](https://github.com/lukas-blecher/LaTeX-OCR#data)

	\|model\| token_acc ↑ \| normed edit distance ↓ \|
	\| --- \| --- \| --- \|
	\|pix2tex\| 0.5346 \| 0.10312
	\|pix2tex*\|0.60\|0.10\|
	\|nougat-latex-based\| 0.623850 \| 0.06180 \|

	pix2tex is a ResNet + ViT + Text Decoder architecture introduced in [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR).

	pix2tex: reported from [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR); pix2tex: my evaluation with the released [checkpoint](https://github.com/lukas-blecher/LaTeX-OCR/releases/tag/v0.0.1) ; nougat-latex-based*: evaluated on results generated with beam-search strategy.


	## Requirements
	```text
	pip install transformers >= 4.34.0
	```

	## Uses
	```python
	import torch
	from PIL import Image
	from transformers import VisionEncoderDecoderModel
	from transformers.models.nougat import NougatTokenizerFast
	from nougat_latex import NougatLaTexProcessor

	model_name = "Norm/nougat-latex-base"
	device = "cuda" if torch.cuda.is_available() else "cpu"
	# init model
	model = VisionEncoderDecoderModel.from_pretrained(model_name).to(device)

	# init processor
	tokenizer = NougatTokenizerFast.from_pretrained(model_name)

	latex_processor = NougatLaTexProcessor.from_pretrained(model_name)

	# run test
	image = Image.open("path/to/latex/image.png")
	if not image.mode == "RGB":
	image = image.convert('RGB')

	pixel_values = latex_processor(image, return_tensors="pt").pixel_values

	decoder_input_ids = tokenizer(tokenizer.bos_token, add_special_tokens=False,
	return_tensors="pt").input_ids
	with torch.no_grad():
	outputs = model.generate(
	pixel_values.to(device),
	decoder_input_ids=decoder_input_ids.to(device),
	max_length=model.decoder.config.max_length,
	early_stopping=True,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	use_cache=True,
	num_beams=5,
	bad_words_ids=[[tokenizer.unk_token_id]],
	return_dict_in_generate=True,
	)
	sequence = tokenizer.batch_decode(outputs.sequences)[0]
	sequence = sequence.replace(tokenizer.eos_token, "").replace(tokenizer.pad_token, "").replace(tokenizer.bos_token, "")
	print(sequence)

	```