AI Guard Vision Model Card

Overview

This model, AI Guard Vision, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media.

Model Summary

Model Type: Vision Transformer (ViT) – vit-base-patch16-224
Objective: Real vs. AI-generated image classification
License: Apache 2.0
Fine-tuned From: google/vit-base-patch16-224
Training Dataset: CIFake Dataset
Developer: Aashish Kumar, IIIT Manipur

Applications & Use Cases

Content Moderation: Identifying AI-generated images across media platforms.
Digital Forensics: Verifying the authenticity of visual content for investigative purposes.
Trust Preservation: Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images.

How to Use the Model

from transformers import AutoImageProcessor, ViTForImageClassification
import torch
from PIL import Image
from pillow_heif import register_heif_opener, register_avif_opener

register_heif_opener()
register_avif_opener()

def get_prediction(img):
    image = Image.open(img).convert('RGB')
    image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2")
    model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2")
    inputs = image_processor(image, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    top2_labels = logits.topk(2).indices.squeeze().tolist()
    top2_scores = logits.topk(2).values.squeeze().tolist()
    
    response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)]
    return response

Dataset Information

The model was fine-tuned on the CIFake dataset, which contains both real and AI-generated synthetic images:

Real Images: Collected from the CIFAR-10 dataset.
Fake Images: Generated using Stable Diffusion 1.4.
Training Data: 100,000 images (50,000 per class).
Testing Data: 20,000 images (10,000 per class).

Model Architecture

Transformer Encoder Layers: Utilizes self-attention mechanisms.
Positional Encodings: Helps the model understand image structure.
Pretrained Weights: Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance.

Why Vision Transformer?

Scalability and Performance: Excels at high-level global feature extraction.
State-of-the-Art Accuracy: Leverages transformers to outperform traditional CNN models.

Training Details

Learning Rate: 0.0000001
Batch Size: 64
Epochs: 100
Training Time: 1 hr 36 min

Evaluation Metrics

The model was evaluated using the CIFake test dataset, with the following metrics:

Accuracy: 92%
F1 Score: 0.89
Precision: 0.85
Recall: 0.88

Model	Accuracy	F1-Score	Precision	Recall
Baseline	85%	0.82	0.78	0.80
Augmented	88%	0.85	0.83	0.84
Fine-tuned ViT	92%	0.89	0.85	0.88

Evaluation Fig:

System Workflow

Frontend: ReactJS
Backend: Python Flask
Database: PostgreSQL(Supabase)
Model: Deployed via Pytorch and TensorFlow frameworks

Strengths and Limitations

Strengths:

High Accuracy: Achieves state-of-the-art performance in distinguishing real and synthetic images.
Pretrained on ImageNet-21k: Allows for efficient transfer learning and robust generalization.

Limitations:

Synthetic Image Diversity: The model may underperform on novel or unseen synthetic images that are significantly different from the training data.
Data Bias: Like all machine learning models, its predictions may reflect biases present in the training data.

Conclusion and Future Work

This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include:

Hybrid Architectures: Combining transformers with convolutional layers for improved performance.
Multimodal Detection: Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection.

AashishKumar
/

AIvisionGuard-v2