Spaces:
Runtime error
Runtime error
I've been working through the first two lessons of | |
[the fastai course](https://course.fast.ai/). For lesson one I trained a model | |
to recognise my cat, Mr Blupus. For lesson two the emphasis is on getting those | |
models out in the world as some kind of demo or application. | |
[Gradio](https://gradio.app) and | |
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a | |
prototype of your model on the internet. | |
This MVP app runs two models to mimic the experience of what a final deployed | |
version of the project might look like. | |
- The first model (a classification model trained with fastai, available on the | |
Huggingface Hub | |
[here](https://huggingface.co/strickvl/redaction-classifier-fastai) and | |
testable as a standalone demo | |
[here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)), | |
classifies and determines which pages of the PDF are redacted. I've written | |
about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html). | |
- The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself | |
built partly on top of fastai)) detects which parts of the image are redacted. | |
This is a model I've been working on for a while and I described my process in | |
a series of blog posts (see below). | |
This MVP app does several things: | |
- it extracts any pages it considers to contain redactions and displays that | |
subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also | |
displays some text alerting you to which specific pages were redacted. | |
- if you click the "Analyse and extract redacted images" checkbox, it will: | |
- pass the pages it considered redacted through the object detection model | |
- calculate what proportion of the total area of the image was redacted as | |
well as what proportion of the actual content (i.e. excluding margins etc | |
where there is no content) | |
- create a PDF that you can download that contains only the redacted images, | |
with an overlay of the redactions that it was able to identify along with | |
the confidence score for each item. | |
## The Dataset | |
I downloaded a few thousand publicly-available FOIA documents from a government | |
website. I split the PDFs up into individual `.jpg` files and then used | |
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described | |
in | |
[a blogpost written last | |
year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).) | |
For the object detection model, the process was quite a bit more involved and I | |
direct you to the series of articles referenced below in the 'Further Reading' section. | |
## Training the model | |
I trained the classification model with fastai's flexible `vision_learner`, fine-tuning | |
`resnet18` which was both smaller than `resnet34` (no surprises there) and less | |
liable to early overfitting. I trained the model for 10 epochs. | |
The object detection model is trained using IceVision, with VFNet as the | |
model and `resnet50` as the backbone. I trained the model for 50 epochs and | |
reached 89% accuracy on the validation data. | |
## Further Reading | |
This initial dataset spurred an ongoing interest in the domain and I've since | |
been working on the problem of object detection, i.e. identifying exactly which | |
parts of the image contain redactions. | |
Some of the key blogs I've written about this project: | |
- How to annotate data for an object detection problem with Prodigy | |
([link](https://mlops.systems/redactionmodel/computervision/datalabelling/2021/11/29/prodigy-object-detection-training.html)) | |
- How to create synthetic images to supplement a small dataset | |
([link](https://mlops.systems/redactionmodel/computervision/python/tools/2022/02/10/synthetic-image-data.html)) | |
- How to use error analysis and visual tools like FiftyOne to improve model | |
performance | |
([link](https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html)) | |
- Creating more synthetic data focused on the tasks my model finds hard | |
([link](https://mlops.systems/tools/redactionmodel/computervision/2022/04/06/synthetic-data-results.html)) | |
- Data validation for object detection / computer vision (a three part series β | |
[part 1](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/19/data-validation-great-expectations-part-1.html), | |
[part 2](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/26/data-validation-great-expectations-part-2.html), | |
[part 3](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/28/data-validation-great-expectations-part-3.html)) | |