@bokesyo on Hugging Face: "What Happens When RAG System Become Fully Vision-Language Model-Based? HF…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

bokesyo

posted an update Aug 17

Post

3485

What Happens When RAG System Become Fully Vision-Language Model-Based?
HF Demo: bokesyo/MiniCPMV-RAG-PDFQA
Multimodal Dense Retriever: RhapsodyAI/minicpm-visual-embedding-v0
Generation Model: openbmb/MiniCPM-V-2_6
Github: https://github.com/RhapsodyAILab/MiniCPM-V-Embedding-v0-Train

The Vision-Language Model Dense Retriever MiniCPM-Visual-Embedding-v0 reads PDFs directly -- no OCR required. With strong OCR capability and visual understanding capability, it generates multimodal dense representations, allowing you to build and search through your personal library with ease.

Ask a question, it retrieves the most relevant pages. Then, MiniCPM-V-2.6 provides answers based on the retrieved pages, with strong multi-image understanding capabilities.

Whether you’re working with a visually-intensive or text-oriented PDF, it helps you quickly find the information you need. You can also build a personal library with it.

It operates just like a human: reading, storing, retrieving, and answering with full visual comprehension.

Currently, the online demo supports PDFs with up to 50 pages due to GPU time limits. For longer PDFs or entire books, you can deploy it on your own machine.

robert1968

Aug 18

This was I searching for long Time. :) Thanks

bokesyo

Aug 18

Wow this was also I planing to do for long time! :)

ajibawa-2023

Aug 18

Fantastic work! 👍

bokesyo

Aug 18

Thanks!

In this post