Multi-Modal LLM Gradio App

Project Overview

This project is a multi-modal language model Gradio app that accepts text, image, and audio inputs, and outputs text responses. The app mimics a ChatGPT-style interface, allowing users to interact using multiple input modes.

The app leverages:

CLIP for image processing
Whisper for audio transcription (ASR)
A text-based model (like GPT or Phi) for generating text responses

Features

Text Input: Users can input text directly for response generation.
Image Input: Users can upload images, which are processed by the CLIP model.
Audio Input: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response.
ChatGPT-Like Interface: Simple and intuitive interface to handle multi-modal inputs and provide text-based output.

Installation

Clone the repository:

git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2
cd MultiModel_LLM_ERAV2

Install dependencies:
```
pip -r requirements.txt
```
Run the app:
```
python app.py
```

How It Works

Text Processing: Input text is passed to a language model (like GPT or Phi) to generate a response.
Image Processing: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model.
Audio Processing: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation.

Usage

Text Input: Enter text in the provided textbox and click "Submit" to generate a response.
Image Input: Upload an image and click "Submit" to generate a response based on the image.
Audio Input: Upload or record an audio file, click "Submit" to transcribe and generate a response.

Future Improvements

Add advanced features like drag-and-drop file upload or live audio recording for a better user experience.
Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources.
Implement end-to-end training of all components for better response quality.

License

This project is licensed under the MIT License.