Image Captioning App

This is a mod of Wi-zz/joy-caption-pre-alpha and fancyfeast/joy-caption-alpha-two. Thanks to dominic1021, IceHibiki, BullseyeMxP, Wakeme.

Notice: I will contribute to Wi-zz after shaping the code.

Overview

This application generates descriptive captions for images using advanced ML models. It processes single images or entire directories, leveraging CLIP and LLM models for accurate and contextual captions. It has NSFW captioning support with natural language. This is just an extension of the original author's efforts to improve performance. Their repo is located here: https://huggingface.co/spaces/fancyfeast/joy-caption-alpha-two.

Features

Single image and batch processing
Multiple directory support
Custom output directory
Adjustable batch size
Progress tracking

Usage

Command	Description
`python app.py image.jpg`	Process a single image
`python app.py /path/to/directory`	Process all images in a directory
`python app.py /path/to/dir1 /path/to/dir2`	Process multiple directories
`python app.py /path/to/dir --output /path/to/output`	Specify output directory
`python app.py /path/to/dir --bs 8`	Set batch size (default: 4)

Technical Details

Models: CLIP (vision), LLM (language), custom ImageAdapter
Optimization: CUDA-enabled GPU support
Error Handling: Skips problematic images in batch processing

Requirements

Python 3.x
PyTorch
Transformers library
PEFT library
CUDA-capable GPU (recommended)

Installation

Windows

git clone https://huggingface.co/John6666/joy-caption-alpha-two-cli-mod
cd joy-caption-alpha-two-cli-mod
python -m venv venv
.\venv\Scripts\activate
# Change as per https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Linux

git clone https://huggingface.co/John6666/joy-caption-alpha-two-cli-mod
cd joy-caption-alpha-two-cli-mod
python3 -m venv venv
source venv/bin/activate
pip3 install torch torchvision torchaudio
pip3 install -r requirements.txt

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.