Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
bokesyo 
posted an update Aug 17
Post
3485
What Happens When RAG System Become Fully Vision-Language Model-Based?
HF Demo: bokesyo/MiniCPMV-RAG-PDFQA
Multimodal Dense Retriever: RhapsodyAI/minicpm-visual-embedding-v0
Generation Model: openbmb/MiniCPM-V-2_6
Github: https://github.com/RhapsodyAILab/MiniCPM-V-Embedding-v0-Train

The Vision-Language Model Dense Retriever MiniCPM-Visual-Embedding-v0 reads PDFs directly -- no OCR required. With strong OCR capability and visual understanding capability, it generates multimodal dense representations, allowing you to build and search through your personal library with ease.

Ask a question, it retrieves the most relevant pages. Then, MiniCPM-V-2.6 provides answers based on the retrieved pages, with strong multi-image understanding capabilities.

Whether you’re working with a visually-intensive or text-oriented PDF, it helps you quickly find the information you need. You can also build a personal library with it.

It operates just like a human: reading, storing, retrieving, and answering with full visual comprehension.

Currently, the online demo supports PDFs with up to 50 pages due to GPU time limits. For longer PDFs or entire books, you can deploy it on your own machine.

This was I searching for long Time. :) Thanks

·

Wow this was also I planing to do for long time! :)

Fantastic work! 👍

·

Thanks!