File size: 5,941 Bytes
afe420d
 
978a904
 
afe420d
 
 
 
8ca9fa4
afe420d
82647f6
afe420d
 
 
 
 
 
 
 
 
 
 
6a46c2f
afe420d
 
 
 
 
d0976af
 
 
afe420d
d0976af
afe420d
 
 
d0976af
 
 
 
 
afe420d
d0976af
 
 
 
afe420d
d0976af
 
 
 
 
afe420d
 
d0976af
 
 
 
 
 
 
 
afe420d
d0976af
 
afe420d
d0976af
 
 
 
 
 
 
 
 
 
 
afe420d
d0976af
afe420d
 
d0976af
afe420d
d0976af
afe420d
 
 
 
 
 
d0976af
afe420d
d0976af
afe420d
 
d0976af
afe420d
 
d0976af
afe420d
 
 
 
 
 
d0976af
 
 
 
 
 
 
 
 
 
 
 
afe420d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49c5e61
afe420d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: mit
tags:
- vidore
---

# ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

In June 2024, [ColPali](https://arxiv.org/abs/2407.01449) was introduced as an OCR-free document retrieval model, built over [PaliGemma](https://arxiv.org/abs/2407.07726), shifting the paradigm of PDF document retrieval by directly processing images instead of using error-prone and resource-heavy OCR pipelines. However, with three billion parameters, ColPali might be computationally expensive, especially for large document databases. In contrast, text retrieval models like [ColBERT](https://arxiv.org/abs/2004.12832) are more efficient with just a few hundred million parameters, but they require error-prone and expensive OCR pipelines to. To bridge this gap, we introduce ColFlor, an OCR-free visual document retrieval model with only 174 million parameters. ColFlor is 17 times smaller than ColPali, 9.8 times faster in encoding queries and 5.25 faster in encoding images, with only a 1.8% drop in performance on text-rich English documents.

<p align="center"><img width=800 src="https://github.com/AhmedMasryKU/colflor/blob/main/assets/colflor_n32.png?raw=true"/></p>

More details about the model can be found in the [ColFlor blogpost](https://huggingface.co/blog/ahmed-masry/colflor)

## Usage

First, you need to clone the github repo and install the dependencies as follows


```bash
git clone https://github.com/AhmedMasryKU/colflor
cd colflor
pip install -e .
```

Then, you can run the following inference code: 

```python
import pprint
from typing import List, cast

import torch
from datasets import Dataset, load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

from colpali_engine.models import ColFlor
from colpali_engine.models import ColFlorProcessor
from colpali_engine.utils.processing_utils import BaseVisualRetrieverProcessor
from colpali_engine.utils.torch_utils import ListDataset, get_torch_device


def main():
    """
    Example script to run inference with ColFlor.
    """

    device = get_torch_device("auto")
    print(f"Device used: {device}")

    # Model name
    model_name = "ahmed-masry/ColFlor"

    # Load model
    model = ColFlor.from_pretrained(
        model_name,
        #torch_dtype=torch.bfloat16,
        device_map=device,
    ).eval()

    # Load processor
    processor = cast(ColFlorProcessor, ColFlorProcessor.from_pretrained(model_name))

    if not isinstance(processor, BaseVisualRetrieverProcessor):
        raise ValueError("Processor should be a BaseVisualRetrieverProcessor")

    # NOTE: Only the first 16 images are used for demonstration purposes
    dataset = cast(Dataset, load_dataset("vidore/docvqa_test_subsampled", split="test[:16]"))
    images = dataset["image"]

    # Select a few queries for demonstration purposes
    query_indices = [12, 15]
    queries = [dataset[idx]["query"] for idx in query_indices]
    print("Selected queries:")
    pprint.pprint(dict(zip(query_indices, queries)))

    # Run inference - docs
    dataloader = DataLoader(
        dataset=ListDataset[str](images),
        batch_size=4,
        shuffle=False,
        collate_fn=lambda x: processor.process_images(x),
    )
    ds: List[torch.Tensor] = []
    for batch_doc in tqdm(dataloader):
        with torch.no_grad():
            batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
            embeddings_doc = model(**batch_doc)
        ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))

    # Run inference - queries
    dataloader = DataLoader(
        dataset=ListDataset[str](queries),
        batch_size=4,
        shuffle=False,
        collate_fn=lambda x: processor.process_queries(x),
    )

    qs: List[torch.Tensor] = []
    for batch_query in dataloader:
        with torch.no_grad():
            batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
            embeddings_query = model(**batch_query)
        qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))

    # Run scoring
    scores = processor.score(qs, ds).cpu().numpy()
    idx_top_1 = scores.argmax(axis=1)
    print("Indices of the top-1 retrieved documents for each query:", idx_top_1)

    # Sanity check
    if idx_top_1.tolist() == query_indices:
        print("The top-1 retrieved documents are correct.")
    else:
        print("The top-1 retrieved documents are incorrect.")

    return


if __name__ == "__main__":
    typer.run(main)

```

## Limitations

 - **Figures**: While ColFlor exhibits reasonable performance on figures, there's a relatively large gap in performance between it and larger models such as ColPali. 
 - **Multilinguality**: The current version of the model only supports the Engligh language and performs poorly on other languages. 

## License

We release this model under the MIT license.

## Contact

If you have any questions about this work, feel free to reach out to **Ahmed Masry** at **[email protected]** or **[email protected]**. 

## Acknowledgement
This work was carried out at the Intelligent Visualization Lab at York University in Canada. It was supported by the Natural Sciences Engineering Research Council (NSERC) of Canada and Canada Foundation for Innovation (CFI). Additionally, it received support through a GCP credits award from Google's PaliGemma Academic Program. 

We appreciate the well-documented training and evaluation GitHub repositories provided by the ColPali team, which were essential in our model development.
This model card is adapted from [ColPali Model Card](https://huggingface.co/vidore/colpali)

## Citation

If you plan to use ColFlor in your research, please consider citing us as follows:

```bibtex
@misc{masry2024colflor,
    title={ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models},
    url={https://huggingface.co/blog/ahmed-masry/colflor},
    author={Masry, Ahmed},
    month={October},
    year={2024}
}
```