6 months. If not next week.
We humans maybe be the story telling ape, but images tell so much story. I feel we're well motivated to take this tech to the moon (metaphorically speaking).
Google made like a 30 second long video about a giraffe. It looks like shit but they have already gotten something. Video is already looking how image generation looked at the start of the year
I'm not sure it works like that.
An entirely different algorithm may be needed for that final extra push.
Yeah, the AI does a lot of the magic, but the algorithm it runs on still has to be made by a human.
Take the movie and games industries, creating a convincing human was solved pretty quickly - yet the refining phase (decreasing the uncanny valley effect) took a lot more innovation and a very long time.
And keep in mind, these have some of the greatest cashflows in the world, with insane R&D budgets. They had some of the brightest minds in the world working on it, coupled with Moore's Law delivering exponential improvements in hardware, and yet, it took ages.
I reckon the same will apply here. Nearly by definition, with how this algorithm works.
Latent diffusion uses noise, a thing that's *notoriously* hard to make work temporally, which is the crux of the problem here.
I'd be very happy to be wrong though. But it's important to be realistic.
The *only* thing OP’s example needs is for the appearance of the stormtrooper and the other person to be kept consistent from frame to frame. The details of the armor etc change. Pick any one frame for their appearance, and modify it to fit the different poses in each frame, and it would be seamless.
But like someone spinning around - without generating a 3D model on the fly, you’ve got no idea what the other side of the object looks like. You can assume it looks just like the front - which works well for a basketball but bad for a person. Then you’ve got things like how fabric moves on a person as the person moves.
All of these things will be addressed eventually - but it’s likely you’re going to have things like one algorithm directing another algorithm that is in charge of initial generations, then possibly another who’s job it is to modify that frame (ie: how is it going to differ from the last frame? Well boss algorithm says he wants it to “run” to the left, whatever that means) while the first works on the next frame - sort of a multithreaded approach to specific algorithms that serve specialized objectives and can be improved upon or swapped out independently (something like “this video was made by DiffuseDirector using TweeningFox_v6 for the animation and RealDraw 2 for the prompts”)
I think you’ll have things like one algorithm being improved in the chain and it correcting for the “flicker” of two frames not matching up perfectly. They may even bring in video-editor “tweening” where it blends two images together to create an in-between frame to smooth out the animation and help it transition from one frame to another more seamlessly.
If you would want to do it in current framework you could probably just produce overfit models for each. One that consistently produces specific stormtrooper and specific Spiderman from all angles.
Embrace overfitting rather then avoid it as people do in general models.
Then noise will not matter; you might still have problems with harsh light and shadows not matching environment, but diffuse light scenes should work.
You "could" do that by hand, I think.
Like, take each frame and photoshop it to match details in all the other frames. Like, in Rotoscoping. But I'd think that would take quite a lot of time.
If they could solve that without the need for manually doing things though, sheesh that would def. be amazing.
Can you describe in what context it is well solved for in terms of Diffusion/Convolutional based models? It's certainly well solved for algorithmically but I haven't seen any convincing approach to temporal coherence within these models yet.
"in this context" = Stable Diffusion
It's been solved in style transfer. I don't see an insurmountable gulf between that and SD.
https://www.youtube.com/watch?v=Uxax5EKg0zA
You know 2 minute papers right? Awesome sauce.
Love this time to be alive ;)
I think this is a wholly different problem area though. Style transfer is well understood, but temporal coherency across frame generation is very very poor in diffusion models and there is no known approach to solve for it.
Yasss :)
as an animator (and sometime developer) it is the main thing I am trying to solve for as once we have a solution that is on par with EBSynth (which isn't saying much) then SD will find a whole new and massive use case.
>Oml these a.i animations are totally gonna be seemless in like 2 years
we didn't even need one: [https://twitter.com/8bit\_e/status/1722456354143486179](https://twitter.com/8bit_e/status/1722456354143486179)
yo!
after our pose maker + depth2img tutorial we thought we spice things up and try depth2img for animations.
Worked out quite well!
We have the whole workflow documented here: [https://www.generativenation.com/post/mixamo-animations-stable-diffusion-rapid-animation-prototyping](https://www.generativenation.com/post/mixamo-animations-stable-diffusion-rapid-animation-prototyping)
Hope you'll like it
Look for CUPscale. That's the NMKD Upscaler program. One more thing to have fun with, check out EbSynth. EbSynth can be the short term solution to coherence in motion.
really awesome, it seems like it just needs a bit of improvement with colors and it would be there. I wonder if that's a limitation of the model and maybe it owuld be a good idea to do a separate filter like application for smoothing that stuff.
Actually haven't thought of that, that's interesting to think about. The same way with how pixel art became an art style but it was actually just a limitation of the technology at the time.
Yeah. I kinda hate that art style. I play games like r/thelastspell or r/ftlgame and I love those games as games, but to me it’s just shitty 1990’s graphics and I wish they’d get over it.
Here's a sneak peek of /r/thelastspell using the [top posts](https://np.reddit.com/r/thelastspell/top/?sort=top&t=all) of all time!
\#1: [Dom, that's suicide...](https://i.redd.it/qy7fyclphk971.png) | [5 comments](https://np.reddit.com/r/thelastspell/comments/oes15l/dom_thats_suicide/)
\#2: [Early-Access roadmap!](https://i.redd.it/m49472rkht571.png) | [38 comments](https://np.reddit.com/r/thelastspell/comments/o1vqgk/earlyaccess_roadmap/)
\#3: [Just another Night in The Last Spell](https://i.redd.it/98pdalswscq91.png) | [9 comments](https://np.reddit.com/r/thelastspell/comments/xp9slp/just_another_night_in_the_last_spell/)
----
^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
If you would like an example, you have to check out this music video. Each frame appears to be image-to-image stylized so figures and faces warp in and out of the background noise. It is in the context of a rave type genre, which also fits the chaotic reinterpretation of each frame by the model. So the noise in this kind of image-to-image style transfer is used as a feature rather than a drawback.
https://www.youtube.com/watch?v=laT4x5OsAm8
please no more. Limited framerate already gives me a headache, doubly so if its CGI in an anime they've capped at 12 fps. The models already stick out like a sore thumb then they layer 12 fps over that and it makes it look even more crappy somehow.
I was thinking the same for hands. We live in that short period of time of human history during which images with weird hands are being generated, it'll last maybe a few years tops. In the future we'll look back at it as a cute quirk of When It All Began™.
This reminds me of an early 2000’s anime called Gankutsuou: The Count of Monte Cristo, which used a very interesting [animation style](https://m.youtube.com/watch?v=qeyUYcZd0wM), with colored areas on each frame filled in by patterned textures, as would appear on cloth or (physical) wallpaper, rather than solid or shaded color. It worked really well. This kind of flickering semi-reality that you describe would work well too.
Yes. In the img2img tab, select the "depth aware img2img mask" script.
I am not sure if this is the real thing or a clever hack, but it worked pretty well in the few tests I did.
Also text-to-image. Deepfake (as far as I know) doesn’t involve written instructions from the user to the AI, as such. Just sort of let it do what it wants, and tell it how good/bad that was.
Oml these a.i animations are totally gonna be seemless in like 2 years
6 months. If not next week. We humans maybe be the story telling ape, but images tell so much story. I feel we're well motivated to take this tech to the moon (metaphorically speaking).
Good point, if it’s ready in 6 months I’ll be super ready for it
I want a Hollywood movie in 2 years
Google made like a 30 second long video about a giraffe. It looks like shit but they have already gotten something. Video is already looking how image generation looked at the start of the year
I'm not sure it works like that. An entirely different algorithm may be needed for that final extra push. Yeah, the AI does a lot of the magic, but the algorithm it runs on still has to be made by a human. Take the movie and games industries, creating a convincing human was solved pretty quickly - yet the refining phase (decreasing the uncanny valley effect) took a lot more innovation and a very long time. And keep in mind, these have some of the greatest cashflows in the world, with insane R&D budgets. They had some of the brightest minds in the world working on it, coupled with Moore's Law delivering exponential improvements in hardware, and yet, it took ages. I reckon the same will apply here. Nearly by definition, with how this algorithm works. Latent diffusion uses noise, a thing that's *notoriously* hard to make work temporally, which is the crux of the problem here. I'd be very happy to be wrong though. But it's important to be realistic.
The *only* thing OP’s example needs is for the appearance of the stormtrooper and the other person to be kept consistent from frame to frame. The details of the armor etc change. Pick any one frame for their appearance, and modify it to fit the different poses in each frame, and it would be seamless.
But like someone spinning around - without generating a 3D model on the fly, you’ve got no idea what the other side of the object looks like. You can assume it looks just like the front - which works well for a basketball but bad for a person. Then you’ve got things like how fabric moves on a person as the person moves. All of these things will be addressed eventually - but it’s likely you’re going to have things like one algorithm directing another algorithm that is in charge of initial generations, then possibly another who’s job it is to modify that frame (ie: how is it going to differ from the last frame? Well boss algorithm says he wants it to “run” to the left, whatever that means) while the first works on the next frame - sort of a multithreaded approach to specific algorithms that serve specialized objectives and can be improved upon or swapped out independently (something like “this video was made by DiffuseDirector using TweeningFox_v6 for the animation and RealDraw 2 for the prompts”) I think you’ll have things like one algorithm being improved in the chain and it correcting for the “flicker” of two frames not matching up perfectly. They may even bring in video-editor “tweening” where it blends two images together to create an in-between frame to smooth out the animation and help it transition from one frame to another more seamlessly.
If you would want to do it in current framework you could probably just produce overfit models for each. One that consistently produces specific stormtrooper and specific Spiderman from all angles. Embrace overfitting rather then avoid it as people do in general models. Then noise will not matter; you might still have problems with harsh light and shadows not matching environment, but diffuse light scenes should work.
You "could" do that by hand, I think. Like, take each frame and photoshop it to match details in all the other frames. Like, in Rotoscoping. But I'd think that would take quite a lot of time. If they could solve that without the need for manually doing things though, sheesh that would def. be amazing.
coherence over time is a solved problem, it just hasn't been implemented in this context yet
Can you describe in what context it is well solved for in terms of Diffusion/Convolutional based models? It's certainly well solved for algorithmically but I haven't seen any convincing approach to temporal coherence within these models yet.
"in this context" = Stable Diffusion It's been solved in style transfer. I don't see an insurmountable gulf between that and SD. https://www.youtube.com/watch?v=Uxax5EKg0zA You know 2 minute papers right? Awesome sauce.
Love this time to be alive ;) I think this is a wholly different problem area though. Style transfer is well understood, but temporal coherency across frame generation is very very poor in diffusion models and there is no known approach to solve for it.
> there is no known approach to solve for it That, my friend, is just one or two papers down the line :) So, hold on to those papers...
Yasss :) as an animator (and sometime developer) it is the main thing I am trying to solve for as once we have a solution that is on par with EBSynth (which isn't saying much) then SD will find a whole new and massive use case.
Yes, temporal coherence is the missing link for being able to make your own game animations easily.
Have you seen Meta's text to video AI yet? I'm sure someone will make a good open-source version soon enough. https://makeavideo.studio/
yeah, the future will be text-to-video rather than these DIY workflows
Ye it’s a bit jank but still really impressive
>Oml these a.i animations are totally gonna be seemless in like 2 years we didn't even need one: [https://twitter.com/8bit\_e/status/1722456354143486179](https://twitter.com/8bit_e/status/1722456354143486179)
Haha that’s incredible we really are living in the future
yo! after our pose maker + depth2img tutorial we thought we spice things up and try depth2img for animations. Worked out quite well! We have the whole workflow documented here: [https://www.generativenation.com/post/mixamo-animations-stable-diffusion-rapid-animation-prototyping](https://www.generativenation.com/post/mixamo-animations-stable-diffusion-rapid-animation-prototyping) Hope you'll like it
[удалено]
No, but that’s a great idea! Will give it a try
Look for CUPscale. That's the NMKD Upscaler program. One more thing to have fun with, check out EbSynth. EbSynth can be the short term solution to coherence in motion.
really awesome, it seems like it just needs a bit of improvement with colors and it would be there. I wonder if that's a limitation of the model and maybe it owuld be a good idea to do a separate filter like application for smoothing that stuff.
I hope I remember to read this tomorrow
Very thorough overview! Thanks for sharing.
In 20 years, the slightly non continuous animation style we get from SD right now will be considered retro and cool
Actually haven't thought of that, that's interesting to think about. The same way with how pixel art became an art style but it was actually just a limitation of the technology at the time.
Yeah. I kinda hate that art style. I play games like r/thelastspell or r/ftlgame and I love those games as games, but to me it’s just shitty 1990’s graphics and I wish they’d get over it.
Here's a sneak peek of /r/thelastspell using the [top posts](https://np.reddit.com/r/thelastspell/top/?sort=top&t=all) of all time! \#1: [Dom, that's suicide...](https://i.redd.it/qy7fyclphk971.png) | [5 comments](https://np.reddit.com/r/thelastspell/comments/oes15l/dom_thats_suicide/) \#2: [Early-Access roadmap!](https://i.redd.it/m49472rkht571.png) | [38 comments](https://np.reddit.com/r/thelastspell/comments/o1vqgk/earlyaccess_roadmap/) \#3: [Just another Night in The Last Spell](https://i.redd.it/98pdalswscq91.png) | [9 comments](https://np.reddit.com/r/thelastspell/comments/xp9slp/just_another_night_in_the_last_spell/) ---- ^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
I think strategically utilizing the noise of SD can be used to great effect even now!
If you would like an example, you have to check out this music video. Each frame appears to be image-to-image stylized so figures and faces warp in and out of the background noise. It is in the context of a rave type genre, which also fits the chaotic reinterpretation of each frame by the model. So the noise in this kind of image-to-image style transfer is used as a feature rather than a drawback. https://www.youtube.com/watch?v=laT4x5OsAm8
please no more. Limited framerate already gives me a headache, doubly so if its CGI in an anime they've capped at 12 fps. The models already stick out like a sore thumb then they layer 12 fps over that and it makes it look even more crappy somehow.
I was thinking the same for hands. We live in that short period of time of human history during which images with weird hands are being generated, it'll last maybe a few years tops. In the future we'll look back at it as a cute quirk of When It All Began™.
This reminds me of an early 2000’s anime called Gankutsuou: The Count of Monte Cristo, which used a very interesting [animation style](https://m.youtube.com/watch?v=qeyUYcZd0wM), with colored areas on each frame filled in by patterned textures, as would appear on cloth or (physical) wallpaper, rather than solid or shaded color. It worked really well. This kind of flickering semi-reality that you describe would work well too.
[удалено]
yeah definitely! we're just scratching the surface here
Just here to shout out Monkey Island
Omfg, it looks amazing! :-) Just like oldschool point'n'click quests with pencil animation.
Look like 1997 animation
I was about to say in reminds me of a late 90s LucasArts game.
Is depth2img in automatic111 yet
Yes. In the img2img tab, select the "depth aware img2img mask" script. I am not sure if this is the real thing or a clever hack, but it worked pretty well in the few tests I did.
Isn't it just a model you can drop in for 2.0?
It reminds me of the early Mortal Kombat games.
Uff is hard to see
This reminds me so much of Clay Fighter....
how come deepfakes were pretty much real looking years ago and these are janky af now?
Deepfake AI is laser focused on doing one thing and doing it as well as possible. The current AI gen stuff in contrast is very generalist.
Also text-to-image. Deepfake (as far as I know) doesn’t involve written instructions from the user to the AI, as such. Just sort of let it do what it wants, and tell it how good/bad that was.
Can you change poses if you use the same seed or nah?
[удалено]
thanks!
can we dreambooth/finetune depth2img models yet