Best way to stay close to a main image while changing details, poses, surroundings, etc.?
Did someone experiment with how to stay close to one image while changing other things in the image?
I think of, for example, generating an image of a person and then trying to change what's in the background or even what the person is holding in the hand without changing the face too much. Exploring the latent space by interpolating to seeds near the original seed works good to find random new features, but changing the prompt, e.g., by adding "... reading a book" to the description of the person can change the whole image even when starting with the same seed.
Using img2img doesn't work in many cases as the person in the image may need to move for example and img2img seems to optimize for detecting the same features at the same places as in the seed image, which would be the person who did not move (e.g. sitting down to read a book).
I wonder if one could somehow extract features about the generated person, which then can be incorporated for new images, like "a person [i](having features with seed X)[/i] sitting down to read a book" where the features part is extracted from an image by detecting for example the facial features and generating a seed that is understood by the network and leads it to generate similar faces.
That's a cool use case!
Could maybe a combination of image segmentation: https://huggingface.co/tasks/image-segmentation
and the inpaint pipeline:
https://github.com/huggingface/diffusers#in-painting-using-stable-diffusion
work here?
Also cc @multimodalart @valhalla @pcuenq
I need to test the Inpainting Pipeline. The problem is that img2img and probably the inpainting pipeline start with an image, which they change, but I would like to preserve features like "red hair" which should be independent from the pose of the person. Segmenting and moving around can surely work for some images (maybe one could like to generate a few persons and then place them in a group image and "diffuse them together").
The question is, can one extract some of the actual information from the latent space? Let's say large eyes and red hair and certain facial features and then generate random images with these features, but, for example, totally different poses of the person with large eyes and red hair?
#27 looks like it could helpful for this. Let's say I generated a random image for "female person" and got these features, then could produce a prompt that may reproduce this images by extracting more information than the initial prompt contained. This could not describe the facial features in detail, but something like "red hair" can probably be extracted. For the other features, one I guess the initial noise would need to be modified.
Currently I use a really simple approach for generating similar images:
- Generate a random initialization init1 and keep it fixed
- In each Iteration generate a random initialization init2
- Create n steps interpolating between init1 and init2
For good images one can for example use slerp(t, init1, init2)
with t
from linspace(0, 0.3, 6)
.
Some example generated using seeds interpolated between init1
and different random initizations init2
with t
in [0.05:0.3]:
The diagonal riverside is always preserved quite good, but I would like to have something that helps to keep the bird looking the same.
In these images one could segment the first bird and insert it into the other images, but maybe I would like to have the bird flying in one image and then one needs to get the information "the bird is green" to the new position of the bird.
It may also be useful to be able to weight the attention. When I describe a character and then add "in front of the Eiffel tower" I get a whole new image in which the character is much smaller (and looking different) to make room for the Eiffel tower.
It would be useful when the network would first look for the characters (and detect the same shape) and then give attention to the Eiffel tower in the pixels which rendered as a single color background before.
This could also be an use-case for the inpainting pipeline but this would require a good segmentation and possibly the character had before a more complex background and not a single color and then segmentation is a hard task as well.
Experimenting with it, it looks like it would be interesting to build some UI to interpolate between an arbitrary number of seeds to combine features found in different seeds to steer the image into a direction.
Very interesting research,
@xalex
. Eventually, some UI would make stuff like drawing masks for inpainting on the fly, or do advanced prompt engineering, make Stable Diffusion more awesome.
For now
@bloc97
has made the Cross Attention Control repo, lets you fine tune output, effectively giving much more artistic control.
^ CrossAttentionControl repo
Adding the link here: https://github.com/bloc97/CrossAttentionControl
This looks interesting, because it does not change the initialization but the attention. Maybe one also could try to extend inpainting approaches to inpaint the same noise.
Let's say I had rendered a knight, I wonder if I could segment the knight and then paste the noise in that area into the noise for that fantasy background and steer the prompt/attention a bit and get the knight integrated into the background better than pasting the image and then using img2img.
Still I am not sure if any of these approaches allow to get the same knight looking into another direction. When I have a knight (as image, prompt, seed) who is riding from left to right and now want a front facing image, I would need to steer the features that make up his face and, e.g., the armor of his horse and not the image/image initialization.
Steering it via prompts works only roughly for many things. So one could probably use a technique that allows to find the weights that contributed to a part of an image (maybe based on approaches like diffusers-interpret) and then change them a bit to find out which weight steer what. If one then has weights for the face, one could try to keep them fixed when generating an image with another perspective.
I wonder if the CrossAttentionControl code should be added to diffusers. And it seems that it still needs some convenience functions, e.g. to map words to their tokens. And what happens when a word consists of several tokens and they have different weights?
diffusers could add some abstract to, for example, allow weighting with text operators like "beetle car -animal -insect"
@xalex any update on this? Were you able to achieve it?
Did someone experiment with how to stay close to one image while changing other things in the image?
I think of, for example, generating an image of a person and then trying to change what's in the background or even what the person is holding in the hand without changing the face too much. Exploring the latent space by interpolating to seeds near the original seed works good to find random new features, but changing the prompt, e.g., by adding "... reading a book" to the description of the person can change the whole image even when starting with the same seed.
Using img2img doesn't work in many cases as the person in the image may need to move for example and img2img seems to optimize for detecting the same features at the same places as in the seed image, which would be the person who did not move (e.g. sitting down to read a book).I wonder if one could somehow extract features about the generated person, which then can be incorporated for new images, like "a person [i](having features with seed X)[/i] sitting down to read a book" where the features part is extracted from an image by detecting for example the facial features and generating a seed that is understood by the network and leads it to generate similar faces.
I would be also interested if there is any solution to this :)
Is it solved?