alea31415 commited on
Commit
1cf989b
1 Parent(s): e1305c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md CHANGED
@@ -1,3 +1,65 @@
1
  ---
2
  license: creativeml-openrail-m
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: creativeml-openrail-m
3
  ---
4
+
5
+ This is the beta version of the yama-no-susume character model (ヤマノススメ, aka encouragement of climb in English).
6
+ Unlike most of the models out there, this model is capable of generating **multi-character scenes** beyond images of a single character.
7
+ Of course, the result is still hit-or-miss, but it is possible to get **as many as 5 characters** right in one shot, and otherwise, you can always rely on inpainting.
8
+ Here are two examples (the first one done with some inpainting):
9
+
10
+ _Coming soon_
11
+
12
+
13
+ ### Dataset description
14
+
15
+ The dataset contains around 40K images with the following composition
16
+ - 11424 anime screenshots from the four seasons of the anime
17
+ - 726 fan arts
18
+ - ~30K customized regularization images
19
+
20
+ The model is trained with a specific weighting scheme to balance between different concepts.
21
+ For example, the above three categories have weights respectively 0.3, 0.2, and 0.5.
22
+ Each category is itself split into many sub-categories in a hierarchical way.
23
+ For more detail on the data preparation process please refer to https://github.com/cyber-meow/anime_screenshot_pipeline
24
+
25
+
26
+ ### Training Details
27
+
28
+ #### Trainer
29
+ The model was trained using [EveryDream1](https://github.com/victorchall/EveryDream-trainer) as
30
+ EveryDream seems to be the only trainer out there that supports sample weighting (through the use of `multiply.txt`).
31
+ Note that for future training it makes sense to migrate to [EveryDream2](https://github.com/victorchall/EveryDream2trainer).
32
+
33
+ #### Hardware and cost
34
+ The model was trained on runpod with an A6000 and cost me around 80 dollors.
35
+ However, I estimate a model of similar quality can be trained with fewer than 20 dollars on runpod.
36
+
37
+ #### Hyperparameter specification
38
+
39
+ - The model was first trained for 18000 steps, at batch size 8, lr 1e-6, resolution 640, and conditional dropping rate of 15%.
40
+ - After this, I modified a little the captions and trained the model for another 22000 steps, at batch size 8, lr 1e-6, reslution 704, and conditional dropping rate of 15%.
41
+
42
+ Note that as a consequence of the weighting scheme which translates into a number of different multiply for each image,
43
+ the count of repeat and epoch has a quite different meaning here.
44
+ For example, depending on the weighting, I have 400K~600K images (some images are used multiple times) in an epoch,
45
+ and therefore I did not even finish an entire epoch with the 40000 steps at batch size 8.
46
+
47
+ ### Failures
48
+
49
+ I tried several things in this model (this is why I trained for so long), but I failed most of them.
50
+
51
+ - I put the number of people at the beginning of the captions, but at the end of 40000 steps the model still cannot count
52
+ (it can generate like 3~5 people when we prompt 3people).
53
+ - I use some tokens to describe the face position within a 5x5 grid but the model did not learn anything about these tokens.
54
+ I think this is either due to 1) face position being too abstract to learn, 2) data imbalance as I did not balance my training for this, or 3) captions not enough focused on these concepts (it is much longer and contains other information).
55
+ - As mentioned, the model can generate multi-character scenes but the success rate becomes lower and lower as we increase the number of character in the scene.
56
+ Character bleeding is always a hard problem to solve.
57
+ - The model is trained with 5% weight for hand images, but I doubt it helps in any kind.
58
+
59
+ Actually, I have a doubt whether the last 22000 steps really improved the models.
60
+ This is how I get my 20$ estimate taking into account that we can simply train at resolution 512 on 3090 with ED2.
61
+
62
+
63
+ ### More Example Generations
64
+
65
+ _coming soon_