crumb commited on
Commit
1fb760d
1 Parent(s): 3e4e2ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -9,7 +9,7 @@ license: creativeml-openrail-m
9
 
10
  # Neopian-Diffusion
11
 
12
- Stable Diffusion models, starting with [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), trained on images extracted from gifs from https://www.neopets.com/funimages.phtml. CLIP ViT-B/32 (OpenAI) was used to filter the best matching frame of the GIF for every given caption/GIF pair. The frame with the minimum spherical distance was chosen and saved for training. In total this amounts to 1950 images around 100x100px. The DreamBooth models were finetuned at 448x448px on a Colab T4 with the term "low-resolution" concatenated onto 1/3 of prompts, to hopefully combat artifacting in the final results (see this link for a hypothesis from someone on Discord about using negative terms while training Textual Inversions https://cdn.discordapp.com/attachments/1008246088148463648/1041538692432527470/image.png).
13
 
14
  Example chosen frame of GIF from CLIP
15
  | Caption | Unprocessed GIF | Chosen Frame |
@@ -18,7 +18,7 @@ Example chosen frame of GIF from CLIP
18
 
19
  ## Training Details
20
 
21
- Stage 1 (0-12k steps) The text encoder was trained along with the UNet at half precision for 15% of the total 8,000 steps (1,200 steps), and then the UNet was trained alone for the rest. I used a polynomial learning rate decay starting at 2e-6 (the default in fast-DreamBooth).
22
 
23
 
24
  ## How to use with `diffusers` library (section from [openjourney](https://huggingface.co/openjourney/openjourney))
 
9
 
10
  # Neopian-Diffusion
11
 
12
+ Stable Diffusion models, starting with [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), trained on images extracted from gifs from https://www.neopets.com/funimages.phtml. CLIP ViT-B/32 (OpenAI) was used to filter the best matching frame of the GIF for every given caption/GIF pair. The frame with the minimum spherical distance was chosen and saved for training. In total this amounts to 1950 images around 100x100px. The DreamBooth models were finetuned on a Colab T4 with the term "low-resolution" concatenated onto prompts at varying weights, to hopefully combat artifacting in the final results (see this link for a hypothesis from someone on Discord about using negative terms while training Textual Inversions https://cdn.discordapp.com/attachments/1008246088148463648/1041538692432527470/image.png).
13
 
14
  Example chosen frame of GIF from CLIP
15
  | Caption | Unprocessed GIF | Chosen Frame |
 
18
 
19
  ## Training Details
20
 
21
+ Stage 1 (0-12k steps) The text encoder was trained along with the UNet at half precision for 15% of the total 8,000 steps (1,200 steps), and then the UNet was trained alone for the rest. I used a polynomial learning rate decay starting at 2e-6 (the default in fast-DreamBooth). "low quality" concatenated onto 1/3 of the prompts.
22
 
23
 
24
  ## How to use with `diffusers` library (section from [openjourney](https://huggingface.co/openjourney/openjourney))