File size: 15,253 Bytes

ccfb293
 
 
065c9d5
a8d203d
065c9d5
 
 
 
ba96265
065c9d5
 
b80afa1
 
af6512e
 
a8d203d
af6512e
 
 
 
 
 
 
 
a74c748
 
 
 
 
a8d203d
bafcc48
 
 
a8d203d
 
 
 
a6be6cb
7bdeb47
a8d203d
 
 
 
7bdeb47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6be6cb
 
 
 
 
 
 
 
 
 
0f9fa14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fcb3fe
0f9fa14
7fcb3fe
0f9fa14
 
 
 
 
 
 
 
 
 
 
 
 
 
7fcb3fe
0f9fa14
 
 
 
a6be6cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fcb3fe
0f9fa14
 
 
 
 
7fcb3fe
0f9fa14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fcb3fe
a6be6cb
 
7fcb3fe
a8d203d
7fcb3fe
bafcc48
7fcb3fe
 
af6512e
7fcb3fe
 
 
 
 
 
 
 
 
 
 
 
 
 
af6512e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bafcc48

---
license: creativeml-openrail-m
---

### Trigger words

```
Anisphia, Euphyllia, Tilty, OyamaMahiro, OyamaMihari
by onono imoko, by momoko, by mochizuki kei, by kantoku, by ke-ta
aniscreen, fanart
```

For `0324_all_aniscreen_tags`, I accidentally tag all the character images with `aniscreen`.  
For `0325_aniscreen_fanart_styles`, things are done correctly (anime screenshots tagged as `aniscreen`, fanart tagged as `fanart`).


### Settings

Default settings are
- loha net dim 8, conv dim 4, alpha 1
- lr 2e-4 constant scheduler throuout
- Adam8bit
- resolution 512
- clip skip 1

Names of the files suggest how the setting is changed with respect to this default setup.
The configuration json files can otherwsie be found in the `config` subdirectories that lies in each folder.
However, some experiments concern the effect of tags for which I regenerate the txt file and the difference can not be seen from the configuration file in this case.
For now this concerns `05tag` for which tags are only used with probability 0.5.

### Some observations

For a thorough comparaison please refer to the `generated_samples` folder.

#### Captioning

Dataset, in general, is the most important out of all.
The common wisdom that we should prune anything that we want to be attach to the trigger word is exactly the way to go for.
No tags at all (top three rows) is terrible, especially for style training.
Having all the tags (bottom three rows) remove the traits from subjects if these tags are not used during sampling (not completely true but more or less the case, see also discussion below).

![00066-20230326090858](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00066-20230326090858.png)


#### The effect of style images on characters

I do beleive regularization images are important, far more important than tweaking any hyperparameters. They slow down training but also make sure that the undesired aspect are less baked into the model if we have images of other types, even if they are not for the subjects we train for.

Comparing the models trained with and without style images, we can see that models trained with general style images have less anime styles baked in. The difference is particularly clear for Tilty, who only have anime screenshots for training.  

![00103-20230327084923](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00103-20230327084923.png)

On the other hand, the default clothes seem to be better trained when there is no regularization image. While this may seem beneficial, it is worth noticing that I keep all the output tags. Therefore, in a sense we only want to get the outputs when we prompt them explicitly. The magic of having the trigger words to fill in what is not in caption seems to be more pronouncing when we have regularization images. In any case, this magic will not work forever as we will eventually start overfitting. The following image show that we get images that are much closer after putting clothes in prompts.

![00105-20230327090703](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00105-20230327090703.png)

In any case, if your regularization images are properly tagged with of a lot of concepts, then you always have the benefit that you can combine them with the main things you train for.


#### Training resolution

The most prominent benefit of training at higher resolution is that it helps generating more complex/detailed background.
Chances are that you can get more details about the outfit or pupils etc.
However, training at higher-resolution is quite time-consuming and most of the time it is probably not worth it.
For example, if you want better background it can be simpler to switch the model (unless, say, you are actually training background lora).

![00045-20230326045748](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00045-20230326045748.png)
![00044-20230326044731](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00044-20230326044731.png)


#### Network dimension and alpha

This is one of the most debated topic in LoRa training.  
Both the original paper and the initial implementation of LoRa for SD suggest using quite small ranks.  
However, the 128 dim/alpha became the unfortunate default in many implementations for some times, which resulted in files with more than 100mb.  
Every since LoCon got introduced, we advocate again the use of smaller dimension and default the value of alpha to 1.

As for LoHa, I have been insisting that the values that I am using here (net dim 8, conv dim 4, alpha 1) should be more than enough in most cases.  
These values do not come from no where. In fact, after some analysis, it turns out almost every model fine-tuned from SD has the information of the weight difference matrix concentrated in fewer than 64 ranks (this applies even to WD 1.5).  
Therefore, 64 should enough-- if we can get to the good point.  
Nonetheless, optimization is quite tricky. Changing dimension does not only increase expressive power but also modify the optimization landscape. It is also exactly for the latter that alpha gets introduced.  
It turns out it might be easier to get better results with larger dimension, which explains the success of compression after training.  
Actually, for my 60K umamusume dataset I have LoCon extracted from fine-tuned model but I failed to directly train a LoCon on it.

To clarify all these, I test the following three setups for LoHa with net dim 32 and conv dim 16
- lr 2e-4, alpha 1
- lr 5e-4, alpha 1
- lr 2e-4, net alpha 16, conv alpha 8

I made the following observations
- I get good results with latter two configurations, which confirms that increasing alpha and learning rate have similar effects. More precisely, I have better backgrounds and better separation between fanart and screenshot styles (only for Mihari and Euphyllia though) compared to dimension 8/4 LoHas.
- Both of them however have their own strength and own weakness. The 5e-4 one works better for `Euphyllia; fanart`
![00081-20230327011335](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00081-20230327011335.png)
- Among all the modules I trained, only the dim 32/16 half alpha one can almost consistently output the correct outfit for Mihari
![00084-20230327021752](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00084-20230327021752.png)
- They seem to give better results for style training in general.
![00091-20230327040330](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00091-20230327040330.png)
![00094-20230327052628](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00094-20230327052628.png)
![00095-20230327055221](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00095-20230327055221.png)
- They seem to provide better style transfer. Please refer to the end (image of 144mb).

One interesting observation is that in the first image we get better background for small LoHa trained at higher resolution and larger LoHa trained only at resolution 512. This again suggests we may be able to get good results with small dimension if they are trained properly. It is however unclear how to achieve that. Simply increasing the learning rate to 5e-4 does not seem to be sufficient in this case (as can be seen from the above images).

Finally, these results do not mean that you would always want to use larger dimension, as probably you do not really need all these details that the additional dimension brings you.


#### Optimizer, learning rate scheduler, and learning rate

This is probably the most important things to tune after you get a good dataset, but I don't have many things to say here.  
You should just find the one that works.  
Some people suggest the lr finder strategy https://followfoxai.substack.com/p/find-optimal-learning-rates-for-stable

I tested several things, and here is what I can say
- Setting the learning rate larger of course makes training faster as long as it does not fry things up. Here switching the learning rate from lr 2e-4 to 5e-4 increases the likeliness. Would it however be better to train longer with smaller learning rate? This still needs more test. (I will zoom in on the case where we only change the text encoder learning rate below.)
- Cosine schduler learns slower than constant scheduler for a fixed learning rate.
- It seems that Dadaptation trains faster at styles but slower at characters. Why?  
Since the outputs of Dadaptation seems to change more over time, I guess it may just have picked a larger learning rate. Does this then mean larger learning rate would pick the style first?
![00074-20230326204643](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00074-20230326204643.png)
![00097-20230327063406](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00097-20230327063406.png)


#### Text encoder learning rate

It is often suggested to set the text encoder learning rate to be smaller than that of unet.
This of course causes training to be slower white it is hard to evaluate the benefit.
In one experiment I half the text encoder learning rate and train the model two times longer.
After spending some time here are two situations that reveal the potential benefit of this practice.

- In my training set I have anime screenshots, tagged with `aniscreen` and fanarts, taggedd with `fanart`.
Although they are balanced to have the same weight, the consistency of anime screenshots seems to drive the characters toward this style by default.
When I put `aniscreen` to negative, this causes bad results in general but the one trained with lower text encoder learning rate seems to survive the best.
Note that Tilty (second image) is only trained with anime screenshots.

![00165-20230327023658](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00165-20230327023658.png)
![00085-20230327030828](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00085-20230327030828.png)

- Training at lower text encoder rate should better preserve the model's ability to understand the prompt.
This aspect is difficult to test, but it seems to be confirmed by this "umbrella" experiment (though some other setup, such as lora and higher dimension seem to give even better results).

![00083-20230327015201](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00083-20230327015201.png)

There may be some disadvantages as well but his needs to be further explored.  
In any case, I still believe if we want to get the best result we should avoid compeletely text encoder training and do [pivotal tuning](https://github.com/cloneofsimo/lora/discussions/121) instead.


#### LoRa, LoCon, LoHa

It may seem weird to mention this so late, but honestly I do not find them to give very different result here.  
The common belief is that LoHa trains more style than LoCon, which in turn trains more style than LoRa.  
This seems to be mostly true, but the difference is quite subtle. Moreover, I would rather use the word "texture" instead of style.  
I especially test whether any of them would be more favorable when transferred to different base model. No conclusion here.

- LoHA
![00067-20230326093940](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00067-20230326093940.png)
- LoCon
![00068-20230326095613](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00068-20230326095613.png)
- LoRa
![00069-20230326102713](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00069-20230326102713.png)
- Without additional network
![00070-20230326103743](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00070-20230326103743.png)

Some remarks
- In the above images, LoHa has dim 8/4, LoCon has dim 16/8, and LoRa has dim 8. LoHa and LoCon thus have roughly the same size (25mb) while LoRa is smaller (11mb). LoRa with smaller dimension seems to train faster here.  
- Some comparaison between LoHa and LoCon do suggest that LoHa indeed trains faster at texture while LoCon faster at higher level traits. The difference is however very small so it is not really conclusive.
![00034-20230325234457](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00034-20230325234457.png)
![00035-20230325235521](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00035-20230325235521.png)
- In an [early experiment](https://civitai.com/models/17336/roukin8-character-lohaloconfullckpt-8) I saw that LoHa and LoCon training lead to quite different result. One possible explanation is that I train on NAI here while I trained on [BP](https://huggingface.co/Crosstyan/BPModel) in that experiment.


#### Clip skip 1 versus 2

People say that wy should train on clip skip 2 for anime models, but honestly I cannot see any difference. The only important thing is to use the same clip skip for training and sampling.

![00013-20230325200652](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00013-20230325200652.png)
![00014-20230325203156](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/00014-20230325203156.png)


#### Style Transfer

The simple rule seems to be that we get better style transfer if the styles are better trained.  
Although it is impossible to make any conclusion from a single image, dim 32/16 half alpha is clearly the winner here, followed by dim 32/16 5e-4. 
Among the remaining ones LoRa and Dadaption are probably slightly better. This can be explained by the fact that they both train faster (LoRa has smaller dimension while Dadaption supposed uses larger learning rate) and thus the model just knows the styles better. However, the Dadaption LoHa completely fails at altering the style of Tilty, who only has anime screenshots in training set. After some tests I find this can be fixed by by weighting the prompts differently.


![xyz_grid-0000-20230327073826](https://huggingface.co/alea31415/LyCORIS-experiments/resolve/main/generated_samples/xyz_grid-0000-20230327073826.png)


### Dataset

Here is the composition of the dataset
```
17_characters~fanart~OyamaMihari: 53
19_characters~fanart~OyamaMahiro+OyamaMihari: 47
1_artists~kantoku: 2190
24_characters~fanart~Anisphia: 37
28_characters~screenshots~Anisphia+Tilty: 24
2_artists~ke-ta: 738
2_artists~momoko: 762
2_characters~screenshots~Euphyllia: 235
3_characters~fanart~OyamaMahiro: 299
3_characters~screenshots~Anisphia: 217
3_characters~screenshots~OyamaMahiro: 210
3_characters~screenshots~OyamaMahiro+OyamaMihari: 199
3_characters~screenshots~OyamaMihari: 177
4_characters~screenshots~Anisphia+Euphyllia: 165
57_characters~fanart~Euphyllia: 16
5_artists~mochizuki_kei: 426
5_artists~onono_imoko: 373
7_characters~screenshots~Tilty: 95
9_characters~fanart~Anisphia+Euphyllia: 97
```