doohickey
/

neopian-diffusion

StableDiffusionPipeline

stable-diffusion

Inference Endpoints

Model card Files Files and versions Community

crumb commited on Nov 19, 2022

Commit

1fb760d

•

1 Parent(s): 3e4e2ea

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ license: creativeml-openrail-m
 # Neopian-Diffusion
-Stable Diffusion models, starting with [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), trained on images extracted from gifs from https://www.neopets.com/funimages.phtml. CLIP ViT-B/32 (OpenAI) was used to filter the best matching frame of the GIF for every given caption/GIF pair. The frame with the minimum spherical distance was chosen and saved for training. In total this amounts to 1950 images around 100x100px. The DreamBooth models were finetuned at 448x448px on a Colab T4 with the term "low-resolution" concatenated onto 1/3 of prompts, to hopefully combat artifacting in the final results (see this link for a hypothesis from someone on Discord about using negative terms while training Textual Inversions https://cdn.discordapp.com/attachments/1008246088148463648/1041538692432527470/image.png).
 Example chosen frame of GIF from CLIP
 | Caption | Unprocessed GIF | Chosen Frame |
@@ -18,7 +18,7 @@ Example chosen frame of GIF from CLIP
 ## Training Details
-Stage 1 (0-12k steps) The text encoder was trained along with the UNet at half precision for 15% of the total 8,000 steps (1,200 steps), and then the UNet was trained alone for the rest. I used a polynomial learning rate decay starting at 2e-6 (the default in fast-DreamBooth).
 ## How to use with `diffusers` library (section from [openjourney](https://huggingface.co/openjourney/openjourney))

 # Neopian-Diffusion
+Stable Diffusion models, starting with [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), trained on images extracted from gifs from https://www.neopets.com/funimages.phtml. CLIP ViT-B/32 (OpenAI) was used to filter the best matching frame of the GIF for every given caption/GIF pair. The frame with the minimum spherical distance was chosen and saved for training. In total this amounts to 1950 images around 100x100px. The DreamBooth models were finetuned on a Colab T4 with the term "low-resolution" concatenated onto prompts at varying weights, to hopefully combat artifacting in the final results (see this link for a hypothesis from someone on Discord about using negative terms while training Textual Inversions https://cdn.discordapp.com/attachments/1008246088148463648/1041538692432527470/image.png).
 Example chosen frame of GIF from CLIP
 | Caption | Unprocessed GIF | Chosen Frame |
 ## Training Details
+Stage 1 (0-12k steps) The text encoder was trained along with the UNet at half precision for 15% of the total 8,000 steps (1,200 steps), and then the UNet was trained alone for the rest. I used a polynomial learning rate decay starting at 2e-6 (the default in fast-DreamBooth). "low quality" concatenated onto 1/3 of the prompts.
 ## How to use with `diffusers` library (section from [openjourney](https://huggingface.co/openjourney/openjourney))