Multi-Modal LLM Gradio App
Project Overview
This project is a multi-modal language model Gradio app that accepts text, image, and audio inputs, and outputs text responses. The app mimics a ChatGPT-style interface, allowing users to interact using multiple input modes.
The app leverages:
- CLIP for image processing
- Whisper for audio transcription (ASR)
- A text-based model (like GPT or Phi) for generating text responses
Features
- Text Input: Users can input text directly for response generation.
- Image Input: Users can upload images, which are processed by the CLIP model.
- Audio Input: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response.
- ChatGPT-Like Interface: Simple and intuitive interface to handle multi-modal inputs and provide text-based output.
Installation
Clone the repository:
git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2 cd MultiModel_LLM_ERAV2
Install dependencies:
pip -r requirements.txt
Run the app:
python app.py
How It Works
- Text Processing: Input text is passed to a language model (like GPT or Phi) to generate a response.
- Image Processing: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model.
- Audio Processing: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation.
Usage
- Text Input: Enter text in the provided textbox and click "Submit" to generate a response.
- Image Input: Upload an image and click "Submit" to generate a response based on the image.
- Audio Input: Upload or record an audio file, click "Submit" to transcribe and generate a response.
Future Improvements
- Add advanced features like drag-and-drop file upload or live audio recording for a better user experience.
- Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources.
- Implement end-to-end training of all components for better response quality.
License
This project is licensed under the MIT License.