Vasudevakrishna's picture
Upload 7 files
94f80f5 verified
|
raw
history blame
2.38 kB

Multi-Modal LLM Gradio App

Project Overview

This project is a multi-modal language model Gradio app that accepts text, image, and audio inputs, and outputs text responses. The app mimics a ChatGPT-style interface, allowing users to interact using multiple input modes.

The app leverages:

  • CLIP for image processing
  • Whisper for audio transcription (ASR)
  • A text-based model (like GPT or Phi) for generating text responses

Features

  • Text Input: Users can input text directly for response generation.
  • Image Input: Users can upload images, which are processed by the CLIP model.
  • Audio Input: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response.
  • ChatGPT-Like Interface: Simple and intuitive interface to handle multi-modal inputs and provide text-based output.

Installation

  1. Clone the repository:

    git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2
    cd MultiModel_LLM_ERAV2
    
  2. Install dependencies:

    pip -r requirements.txt
    
  3. Run the app:

    python app.py
    

How It Works

  1. Text Processing: Input text is passed to a language model (like GPT or Phi) to generate a response.
  2. Image Processing: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model.
  3. Audio Processing: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation.

Usage

  • Text Input: Enter text in the provided textbox and click "Submit" to generate a response.
  • Image Input: Upload an image and click "Submit" to generate a response based on the image.
  • Audio Input: Upload or record an audio file, click "Submit" to transcribe and generate a response.

Future Improvements

  • Add advanced features like drag-and-drop file upload or live audio recording for a better user experience.
  • Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources.
  • Implement end-to-end training of all components for better response quality.

License

This project is licensed under the MIT License.