--- license: mit metrics: - accuracy - f1 --- # Model Card for Model ID This repo contains models used as raters for media into categories of PG, PG13, R, X, and XXX. These models are single modality models used to create an ensemble or multimodal model. In the case of the multimodal model, the single modality models are used as processor components to create the inputs for a smaller Multilayer Perceptron (MLP) ## Model Details ### Model Description The main model here is the multimodal model trained 7/22/24. This model was trained using a weighted soft f1 loss with emphasis on class 0 (PG). This model utilizes finetuned resnet18, ViT, resnet50 with cross validation, prompt Bert, and prompt Roberta in the MultiModalProcessor. This processor passes the proper modality through the proper models and then returns the last hidden layer. These vectors are concatonated to create the input to the Multimodal Models MLP. Each model was trained on the same balanced downsampled dataset found [here](https://civitai.com/models/544550/training-data-for-image-classification). Please note: this dataset contains some mislabeled data across each label. The resnet50-CV is the only model which may have different training/test set data due to the cross validation search, however no data used for evaluation was found in the training/test sets. The data for evaluation is a private dataset labeled by Wolfgang Black and Seb at CivitAI. - **Developed by:** Wolfgang Black - **Model type:** Multimodal - **Language(s) (NLP):** English - **Finetuned from model [optional]:** Various - due to the multimodal nature however ony the MLP was truly trained from scratch. ### Model Sources [optional] #### ResNets - **Link** - https://pytorch.org/vision/main/models/resnet.html - Note: models were initialized with `weights = 'ImageNetV1'` #### ViT - **Repository:** https://huggingface.co/google/vit-base-patch16-224 - **Paper [optional]:** https://arxiv.org/abs/2010.11929 #### DistilBert This model is the basis for promptBert - **Repository:** https://huggingface.co/distilbert/distilbert-base-uncased - **Paper:** https://arxiv.org/abs/1910.01108 #### Roberta This model is the basis for promptRoberta - **Repository:** https://huggingface.co/FacebookAI/roberta-large-mnli - **Paper:** https://arxiv.org/abs/1907.11692 ## Uses These models should be used to classify generated images or text into movie-ratings ## How to Get Started with the Model `Warning`: I did not include the code here necessary for the Multimodal Config, Processor, or Model. The code snippet below assumes the users have that code. ``` from src.multimodal_model import MultimodalConfig, MultimodalModel, MultimodalProcessor model_dir = '' #where the multimodal directory is config = MultimodalConfig.from_pretrained(model_dir) model = MultimodalModel(config).from_pretrained(model_dir) #assumes composite models exist in directories as specified by config processor = MultimodalProcessor(models = config.models) #assumes composite models exist in directories as specified by config model.eval() with torch.no_grad(): outputs = model(**inputs) ##assumes inputs as pil.Image, text = None | str(prompt), tags = None | str(tags), label = None | str logits = outputs['logits'] torch.argmax(logits, dim = 1).item() prediction = model.config.id2label[torch.argmax(out['logits'], dim=1).item()] ``` ### Out-of-Scope Use Currently all models are untested on videos ## Bias, Risks, and Limitations Models are entirely finetuned (in the case of composite models) or trained (MLP) on generated images and may not work well on real images or non-digital media ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This includes the poor labels for PG13/R due to personal bias of the dataset as well as that all data for training is generated images