alea31415
/

LyCORIS-experiments

Model card Files Files and versions Community

alea31415 commited on Mar 27, 2023

Commit

4997da9

•

1 Parent(s): 08f4713

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -10

README.md CHANGED Viewed

@@ -2,6 +2,8 @@
 license: creativeml-openrail-m
 ---
 ### Trigger words
 ```
@@ -14,23 +16,24 @@ For `0324_all_aniscreen_tags`, I accidentally tag all the character images with
 For `0325_aniscreen_fanart_styles`, things are done correctly (anime screenshots tagged as `aniscreen`, fanart tagged as `fanart`).
-### Settings
 Default settings are
 - loha net dim 8, conv dim 4, alpha 1
-- lr 2e-4 constant scheduler throuout
 - Adam8bit
 - resolution 512
 - clip skip 1
 Names of the files suggest how the setting is changed with respect to this default setup.
-The configuration json files can otherwsie be found in the `config` subdirectories that lies in each folder.
 However, some experiments concern the effect of tags for which I regenerate the txt file and the difference can not be seen from the configuration file in this case.
 For now this concerns `05tag` for which tags are only used with probability 0.5.
 ### Some observations
-For a thorough comparaison please refer to the `generated_samples` folder.
 #### Captioning
@@ -44,13 +47,13 @@ Having all the tags (bottom three rows) remove the traits from subjects if these
 #### The effect of style images on characters
-I do beleive regularization images are important, far more important than tweaking any hyperparameters. They slow down training but also make sure that the undesired aspect are less baked into the model if we have images of other types, even if they are not for the subjects we train for.
 Comparing the models trained with and without style images, we can see that models trained with general style images have less anime styles baked in. The difference is particularly clear for Tilty, who only have anime screenshots for training.
 ![00103-20230327084923](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00103-20230327084923.png)
-On the other hand, the default clothes seem to be better trained when there is no regularization image. While this may seem beneficial, it is worth noticing that I keep all the output tags. Therefore, in a sense we only want to get the outputs when we prompt them explicitly. The magic of having the trigger words to fill in what is not in caption seems to be more pronouncing when we have regularization images. In any case, this magic will not work forever as we will eventually start overfitting. The following image show that we get images that are much closer after putting clothes in prompts.
 ![00105-20230327090703](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00105-20230327090703.png)
@@ -72,7 +75,7 @@ For example, if you want better background it can be simpler to switch the model
 This is one of the most debated topic in LoRa training.
 Both the original paper and the initial implementation of LoRa for SD suggest using quite small ranks.
-However, the 128 dim/alpha became the unfortunate default in many implementations for some times, which resulted in files with more than 100mb.
 Every since LoCon got introduced, we advocate again the use of smaller dimension and default the value of alpha to 1.
 As for LoHa, I have been insisting that the values that I am using here (net dim 8, conv dim 4, alpha 1) should be more than enough in most cases.
@@ -123,8 +126,16 @@ Since the outputs of Dadaptation seems to change more over time, I guess it may
 It is often suggested to set the text encoder learning rate to be smaller than that of unet.
 This of course causes training to be slower white it is hard to evaluate the benefit.
-In one experiment I half the text encoder learning rate and train the model two times longer.
-After spending some time here are two situations that reveal the potential benefit of this practice.
 - In my training set I have anime screenshots, tagged with `aniscreen` and fanarts, taggedd with `fanart`.
 Although they are balanced to have the same weight, the consistency of anime screenshots seems to drive the characters toward this style by default.
@@ -139,7 +150,7 @@ This aspect is difficult to test, but it seems to be confirmed by this "umbrella
 ![00083-20230327015201](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00083-20230327015201.png)
-There may be some disadvantages as well but his needs to be further explored.
 In any case, I still believe if we want to get the best result we should avoid compeletely text encoder training and do [pivotal tuning](https://github.com/cloneofsimo/lora/discussions/121) instead.

 license: creativeml-openrail-m
 ---
+**General advice: Having a good dataset is more important than anything else**
 ### Trigger words
 ```
 For `0325_aniscreen_fanart_styles`, things are done correctly (anime screenshots tagged as `aniscreen`, fanart tagged as `fanart`).
+### Setting
 Default settings are
 - loha net dim 8, conv dim 4, alpha 1
+- lr 2e-4 constant scheduler throughout
 - Adam8bit
 - resolution 512
 - clip skip 1
 Names of the files suggest how the setting is changed with respect to this default setup.
+The configuration json files can otherwise be found in the `config` subdirectories that lies in each folder.
 However, some experiments concern the effect of tags for which I regenerate the txt file and the difference can not be seen from the configuration file in this case.
 For now this concerns `05tag` for which tags are only used with probability 0.5.
 ### Some observations
+For a thorough comparison please refer to the `generated_samples` folder.
 #### Captioning
 #### The effect of style images on characters
+I do believe regularization images are important, far more important than tweaking any hyperparameters. They slow down training but also make sure that the undesired aspect are less baked into the model if we have images of other types, even if they are not for the subjects we train for.
 Comparing the models trained with and without style images, we can see that models trained with general style images have less anime styles baked in. The difference is particularly clear for Tilty, who only have anime screenshots for training.
 ![00103-20230327084923](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00103-20230327084923.png)
+On the other hand, the default clothes seem to be better trained when there are no regularization images. While this may seem beneficial, it is worth noticing that I keep all the output tags. Therefore, in a sense we only want to get a certain outfit when we prompt them explicitly. The magic of having the trigger words to fill in what is not in caption seems to be more pronouncing when we have regularization images. In any case, this magic will not work forever as we will eventually start overfitting. The following image show that we get images that are much closer after putting clothes in prompts.
 ![00105-20230327090703](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00105-20230327090703.png)
 This is one of the most debated topic in LoRa training.
 Both the original paper and the initial implementation of LoRa for SD suggest using quite small ranks.
+However, the 128 dim/alpha became the unfortunate default in many implementations for some time, which resulted in files with more than 100mb.
 Every since LoCon got introduced, we advocate again the use of smaller dimension and default the value of alpha to 1.
 As for LoHa, I have been insisting that the values that I am using here (net dim 8, conv dim 4, alpha 1) should be more than enough in most cases.
 It is often suggested to set the text encoder learning rate to be smaller than that of unet.
 This of course causes training to be slower white it is hard to evaluate the benefit.
+To begin, let me show how it actually slow downs the trainer process. In contrary to the common belief, it actually affects style training more than character training. I half the text encoder learning rate for the following experiments.
+- This is what you get for characters. If the trigger words are put properly you barely see the difference, not mentioning the case of single character training that most people focus on. The interesting point however comes from the blending between Mahiro and Mihari due to sharing `Oyama` in the trigger words. Large text encoder learning rate help reduces the blending faster.
+![00106-20230327112316](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00106-20230327112316.png)
+- For styles you can see training with lower text encoder rate actually makes training slower (the largest difference happens to ke-ta and momoko)
+![00107-20230327112855](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00107-20230327112855.png)
+![00017-20230325211523](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00017-20230325211523.png)
+In total I train the model two times longer. After spending some time here are two situations that reveal the potential benefit of having smaller text encoder learning rate.
 - In my training set I have anime screenshots, tagged with `aniscreen` and fanarts, taggedd with `fanart`.
 Although they are balanced to have the same weight, the consistency of anime screenshots seems to drive the characters toward this style by default.
 ![00083-20230327015201](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00083-20230327015201.png)
+There may be some other disadvantages other than slower training but this needs to be further explored.
 In any case, I still believe if we want to get the best result we should avoid compeletely text encoder training and do [pivotal tuning](https://github.com/cloneofsimo/lora/discussions/121) instead.