Spaces:
Build error
Build error
## What's the point of this? | |
LaTeX is the de-facto standard markup language for typesetting pretty equations in academic papers. | |
It is extremely feature rich and flexible but very verbose. | |
This makes it great for typesetting complex equations, but not very convenient for quick note-taking on the fly. | |
For example, here's a short equation from [this page](https://en.wikipedia.org/wiki/Quantum_electrodynamics) on Wikipedia about Quantum Electrodynamics | |
and the corresponding LaTeX code: | |
![Example]( https://wikimedia.org/api/rest_v1/media/math/render/svg/6faab1adbb88a567a52e55b2012e836a011a0675 ) | |
``` | |
{\displaystyle {\mathcal {L}}={\bar {\psi }}(i\gamma ^{\mu }D_{\mu }-m)\psi -{\frac {1}{4}}F_{\mu \nu }F^{\mu \nu },} | |
``` | |
This demo is a first step in solving this problem. | |
Eventually, you'll be able to take a quick partial screenshot from a paper | |
and a program built with this model will generate its corresponding LaTeX source code | |
so that you can just copy/paste straight into your personal notes. | |
No more endless googling obscure LaTeX syntax! | |
## How does it work? | |
Because this problem involves looking at an image and generating valid LaTeX code, | |
the model needs to understand both Computer Vision (CV) and Natural Language Processing (NLP). | |
There are some other projects that aim to solve the same problem with some very interesting models. | |
These generally involve some kind of "encoder" that looks at the image and extracts/encodes the information about the equation from the image, | |
and a "decoder" that takes that information and translates it into what is hopefully both valid and accurate LaTeX code. | |
The "encode" part can be done using classic CNN architectures commonly used for CV tasks, or newer vision transformer architectures. | |
The "decode" part can be done with LSTMs or transformer decoders, using attention mechanism to make sure the decoder understands long range dependencies, e.g. remembering to close a bracket that was opened a long sequence away. | |
I chose to tackle this problem with transfer learning, using an existing OCR model and fine-tuning it for this task. | |
The biggest reason for this is computing constraints - | |
GPU hours are expensive so I wanted training to be reasonably fast, on the order of a couple of hours. | |
There are some other benefits to this approach, | |
e.g. the architecture is already proven to be robust. | |
I chose [TrOCR](https://arxiv.org/abs/2109.10282), a model trained at Microsoft for text recognition tasks which uses transformer architecture for both the encoder and decoder. | |
For the data, I used the `im2latex-100k` dataset, which includes a total of roughly 100k formulas and images. | |
Some preprocessing steps were done by Harvard NLP for the [`im2markup` project](https://github.com/harvardnlp/im2markup). | |
To limit the scope of the project and simplify the task, I limited training data to only look at equations containing 100 LaTeX tokens or less. | |
This covers most single line equations, including fractions, subscripts, symbols, etc, but does not cover large multi line equations, some of which can have up to 500 LaTeX tokens. | |
GPU training was done on a Kaggle GPU Kernel in roughly 3 hours. | |
You can find the full training code on my Kaggle profile [here](https://www.kaggle.com/code/younghoshin/finetuning-trocr/notebook). | |
## What's next? | |
There's multiple improvements that I'm hoping to make to this project. | |
### More robust prediction | |
If you've tried the examples above (randomly sampled from the test set), you've noticed that the model predictions aren't quite perfect and the model occasionally misses, duplicates or mistakes tokens. | |
More training on the existing data set could help with this. | |
### More data | |
There's a lot of LaTeX data available on the internet besides `im2latex-100k`, e.g. arXiv and Wikipedia. | |
It's just waiting to be scraped and used for this project. | |
This means a lot of hours of scraping, cleaning, and processing but having a more diverse set of input images could improve model accuracy significantly. | |
### Faster and smaller model | |
The model currently takes a few seconds to process a single image. | |
I would love to improve performance so that it can run in one second or less, maybe even on mobile devices. | |
This might be impossible with TrOCR which is a fairly large model, designed for use on GPUs. | |
<p style='text-align: center'>Made by Young Ho Shin</p> | |
<p style='text-align: center'> | |
<a href = "mailto: [email protected]">Email</a> | | |
<a href='https://www.github.com/yhshin11'>Github</a> | | |
<a href='https://www.linkedin.com/in/young-ho-shin/'>Linkedin</a> | |
</p> |