license: mit
metrics:
- accuracy
- f1
Model Card for Model ID
This repo contains models used as raters for media into categories of PG, PG13, R, X, and XXX. These models are single modality models used to create an ensemble or multimodal model. In the case of the multimodal model, the single modality models are used as processor components to create the inputs for a smaller Multilayer Perceptron (MLP)
Model Details
Model Description
The main model here is the multimodal model trained 7/22/24. This model was trained using a weighted soft f1 loss with emphasis on class 0 (PG). This model utilizes finetuned resnet18, ViT, resnet50 with cross validation, prompt Bert, and prompt Roberta in the MultiModalProcessor. This processor passes the proper modality through the proper models and then returns the last hidden layer. These vectors are concatonated to create the input to the Multimodal Models MLP.
Each model was trained on the same balanced downsampled dataset found here. Please note: this dataset contains some mislabeled data across each label. The resnet50-CV is the only model which may have different training/test set data due to the cross validation search, however no data used for evaluation was found in the training/test sets. The data for evaluation is a private dataset labeled by Wolfgang Black and Seb at CivitAI.
- Developed by: Wolfgang Black
- Model type: Multimodal
- Language(s) (NLP): English
- Finetuned from model [optional]: Various - due to the multimodal nature however ony the MLP was truly trained from scratch.
Model Sources [optional]
ResNets
- Link - https://pytorch.org/vision/main/models/resnet.html
- Note: models were initialized with
weights = 'ImageNetV1'
ViT
- Repository: https://huggingface.co/google/vit-base-patch16-224
- Paper [optional]: https://arxiv.org/abs/2010.11929
DistilBert
This model is the basis for promptBert
- Repository: https://huggingface.co/distilbert/distilbert-base-uncased
- Paper: https://arxiv.org/abs/1910.01108
Roberta
This model is the basis for promptRoberta
- Repository: https://huggingface.co/FacebookAI/roberta-large-mnli
- Paper: https://arxiv.org/abs/1907.11692
Uses
These models should be used to classify generated images or text into movie-ratings
How to Get Started with the Model
Warning
: I did not include the code here necessary for the Multimodal Config, Processor, or Model. The code snippet below assumes the users have that code.
from src.multimodal_model import MultimodalConfig, MultimodalModel, MultimodalProcessor
model_dir = '' #where the multimodal directory is
config = MultimodalConfig.from_pretrained(model_dir)
model = MultimodalModel(config).from_pretrained(model_dir) #assumes composite models exist in directories as specified by config
processor = MultimodalProcessor(models = config.models) #assumes composite models exist in directories as specified by config
model.eval()
with torch.no_grad():
outputs = model(**inputs) ##assumes inputs as pil.Image, text = None | str(prompt), tags = None | str(tags), label = None | str
logits = outputs['logits']
torch.argmax(logits, dim = 1).item()
prediction = model.config.id2label[torch.argmax(out['logits'], dim=1).item()]
Out-of-Scope Use
Currently all models are untested on videos
Bias, Risks, and Limitations
Models are entirely finetuned (in the case of composite models) or trained (MLP) on generated images and may not work well on real images or non-digital media
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This includes the poor labels for PG13/R due to personal bias of the dataset as well as that all data for training is generated images