UNSTABLE SCULPTURES — Diffusion Models are the Next Frontier of AAA Game Development

Exploring a future of AAA game development focused around controlling diffusion models

June 12, 2025

Tagged: AAA AI Game Development

b.acon_two_people_working_at_a_computer_screen_together_–ar_91_a1acc99c-e4fb-4890-a438-079987c62826

This is the third post in an unplanned trilogy of posts that look at the state of the games industry and where it’s headed over the next few decades. You can find the other posts here:

Game Design Mimetics (2022)
The Potential of Creative AI in Game Development (2023)

Also, I’m tracking more developments in this space at the bottom of this post. Scroll all the way down for more information on what’s currently happening here.

I’ve been telling (with no industry insider information!) people for the past few years that I think the next major wave of consoles will have tensor processing units (TPUs) in them to do local model inference. I’ve never really written about why I think this though, and my spicy hot take from a few years ago now seems more obvious in a post Stable Diffusion/ChatGPT world.

However, I still wanted to take the time to put some of these predictions to paper, as well as why. Additionally, despite telling anyone who would listen to my take here, I wasn’t actually sure WHY you would want this. It just “felt” like what would happen. But a few years on I’ve got some better ideas, and I think these ideas fundamentally alter what “playing a game” and “making a game” will mean in the coming console generations.

ECONOMICS

The driving force behind new console releases since the dawn of video game consoles has been more computing power coupled with stronger graphics processing units. However, games rarely utilize raw compute to do “more things,” and instead use compute only in as much as it helps them push the envelope of graphical fidelity and spectacle, mostly because this is what sells games. There has been a symbiotic relationship between high-end AAA game creators and new console manufactures — developers make games that showcase the “power” of the consoles, consumers buy the new spectacle, console sales go up, repeat.

To even state this as a dynamic feels strange, as it has been completely embedded in games production and the unquestioned model for decades. The idea of an alternative is is hard to conceive.

However, we have reached a plateau with photorealistic graphics in games. Fidelity and spectacle still increases, but the gains are marginal compared to the generational leaps that used to happen between console generations.

gfx

This is bad for high-end AAA studios, as their defensibility (both in terms of their product but also justifying incredibly large budgets to publishing partners) has been predicated on their ability to deliver high-fidelity graphics to consumers (notably not “game design”) that outmatches previous console generations. If a studio is unable to produce “better” graphics than a previous console generation, that dilutes the value of the console itself.

DEMOCRATIZATION AND COMMODIFICATION OF AAA

At the same time, the tools for game creation have become democratized. The floor in which developers operate at continues to rise higher and higher. You aren’t going to get Remedy-quality graphics out of the box in Unreal - but you can get 80% there. Graphics, the driving force of premium content and consumer demand, is now available writ large and as such has started to become commoditized.

And notably, 80% of the top end seems like enough. Or more that consumers don’t care so much for spectacle and fidelity as primary purchasing motivators when everything can operate at 80% of AAA. Big failures recently in games like Concord and Redfall can (imo) largely be chalked up to “we thought we could still sell a game on just high fidelity asset production”.

These failures are not one-offs though — they are canaries in the coal mine and represent the shifting waters of the previous “deal” of AAA production, games, consumers, and consoles. Clinging to IP strength as a lifeboat (which is what most people would have you believe was why the aforementioned games failed — they were new brands) is only a temporary fix here. This rejection of the current high end will come for all games. Consumers largely seem to be rejecting the premise AAA is founded on, and only rewarding the highest of high ends with the same old logic. This is not a healthy ecosystem, as it will squeeze out everything else (and then itself).

I think there is socioeconomic reasoning for this as well which acts as a compounding factor for these issues. Creating a game with really good graphics can feel pretentious. This high-gloss media that so desperately wants you to engage with it can feel disjointed to engage with if your life exists in a capitalist dystopia. These games are played largely for spectacle, but spectacle can feel or veer towards condescension, dangling some luscious possibility in front of your face that is at odds with the circumstances in which you consume it.

Which is also to say that shit-aesthetic games like Among Us, Vampire Survivors, and Slay the Spire, are ascendant. They make no pretensions, and there is very little between your experience of playing the game and the way the game presents itself to you.

Western AAA has no idea what to do about this. They will keep spending hundreds of millions of dollars chasing bad logic and just hoping they are the last ones to be found out in some sort of desperate game of studio russian roulette.

Eastern AAA has a better handle here, with games like Dave the Diver (secretly a Tencent game) proving what’s possible when you put AA/AAA resources at “lower aesthetic” rungs. I think their historical engagement with mobile here has been a major boon in the long term, as they have a better handle on understanding how/where players derive long term value from a game vs. just continual ante-ing of the spectacle. However they face burn along other dimensions of value, as players experience fatigue and also deteriorating socioeconomic circumstances that turn them away from the monetization mechanics these games employ.

Western AAA attempts to fold in some of this design to their premium games with schemes like battle passes are just half-answers to this dilemma — “why do people want our games”? If someone doesn’t or can’t pay for it, you instead offer it for free and charge inside the game, making back your game’s budget by inducing death by a thousand cuts on your players. The balance sheet looks better here, but this doesn’t address the main issue: customers don’t view your game as a premium experience in the first place enough to buy it.

I also recommend listening to this episode of the Game Craft podcast on “Cycles”. Mitch Lasky and Blake Robbins spend the whole episode sort of talking about the “state” of games right now, with every indicator of “Content Innovation,” “Distribution Innovation,” and “Technological Innovation” being at a low point, leading to industry contraction and conservatism. In short — AAA NEEDS a major jolt that is able to span all three.

Given all this, I think it’s easy to think that premium can no longer exist in games. I don’t think this is true though. I think it more means that we are bad at producing premium content now that premium has been commoditized. There is effectively no gap between indie and AAA, and AAA is floundering trying to capture and recoup value.

Which then begs the question: is there an actual frontier for “premium” games? Graphics have been the driving force here, but with that now moot, is “premium” or AAA over?

I don’t think so — there’s one permutation of course where “premium” experiences as a concept just completely go away, but that seems unlikely in our current politics. Large corporate entities will always want to be able to extract premium value where possible, and short of them locking off consoles to third party developers (highly unlikely), they will want to find a way to build premium content to differentiate themselves.

So if it’s no longer graphics, what’s next?

DIFFUSION

My prediction here had been, as indicated at the top of this, “machine learning models”. I’ve been saying this for a few years now, but I didn’t really have a clear understanding (even to myself) of what form that would take. There are obvious things like generating text or images on the fly that would be nice to have locally for inference speed, but also any modern console with an internet connection can already do this with currently available APIs.

Microsoft or Sony could ship a “baked” version of something like ChatGPT on a console, but that still seems a bit small minded (and again API costs and speed are negligible). So for what then?

Well, here’s a peek:

Not impressed yet? Consider this similarly shitty example:

Behold the future of gaming, in all its fuzzily and imprecisely rendered glory.

In neither of these videos are you actually watching someone play Doom or Minecraft. Instead, you’re seeing a stream of AI generated images that make it look like you’re playing the respective game. But there is no “game” actually there.

The DOOM video comes as part of a paper titled “Diffusion Models are Real-Time Game Engines” released by Google Research that detailed their generative diffusion game engine GameNGen. The core idea behind the “engine” is that they trained an ML model to associate player input with visual output.

Then, given player input, the “engine” produces commiserate visual output, no “engine” actually required. It’s sort of like if you had a way to prompt an image model accurately to show each subsequent frame of a game that didn’t exist. However, current image APIs over the internet would give you a response time around 10000ms. Games need to be nearly 100x faster than and target 16ms. GameNGen gets there with, you guessed it, local inference:

GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of autoregressive generation

This is the use case for local inference on game consoles: streaming frames for games powered by diffusion models (for visuals at least, but maybe more, as in the above examples).

GO ON…

Something that inspired me to write this post was this video from Veo 3:

https://www.reddit.com/r/singularity/s/0Bs8XZVNdY

Especially the scenes that looked like a Grand Theft Auto knockoff game (complete with boat-like camera control). They looked COMPELLING.

Screen Shot 2025-06-11 at 8.28.30 PM

To be clear, this is a pure video model — there is no realtime input here or GameNGen in the mix. But it’s not hard to see to gap closing imminently between “video of fake GTA game” and “playable fake GTA” when you consider how fast the space of AI is moving. GameNGen itself was even more ambitious here than possibly necessary, as the training phase saw a large reinforcement learning effort to associate player input and image data. I think you could be far more naiive on the visual side but pair it with a spare environment simulation and get far better cohesion and effectively “swappable” visuals. Regardless, the biggest barrier right now is the tradeoff between speed (image/video gen throughput) and quality (fidelity of experience), and the gap is rapidly closing.

Instead of needing to have a globally distributed team with a budget of somewhere around 500-900m to make a single frame of “real” GTA 6 possible, you can generate a frame of GTA for effectively the cost of a few circuits firing. The whole AAA industry, effectively compressed into a prompt.

We aren’t there yet though (obviously). These generators are also notoriously entropic (by design), so even if you had access to the requisite throughput, you may just be generating a strobing seizure of images. The GameNGen example brings up that the model only really has ~3 seconds of memory, and games need cohesion on the order to 10s to 100s of hours.

AN EXPENSIVE, CHALLENGING, AND REWARDING ROAD

Which is to say there is a big chunky technical problem that needs to be solved and the people who do it need to be well capitalized, smart, and skilled in game development. It’s an effort I’d roughly describe as “stabilizing diffusion model output”, and, I think, this work will become the domain of AAA games for the next decade.

AAA will strive to fully contain the models and getting them into behaving consistently, while indies will be fine with (or even embrace) the fuzziness of an unstable experience (or just keep making other things).

This is a huge opportunity, and the winner(s) will effectively redefine gaming for the next few decades. It’s a tall order though, but one that I think gaming’s finest could definitely do, provided they are well resourced and sheltered inside of a studio model.

There’s a lot of work to be done and a lot to build, and a lot of it will cost a lot of money. This is good, as it will act properly as a moat between “premium” games and everything else (and remember right now there basically is no moat).

I can think, off the top of my head, of plenty of ways to deploy capital towards a “premium” diffusion-primary game:

Developing the underlying “engine” of the game. GameNGen had no engine, and maybe I’m being naiive here to think we would need one moving forward, but I feel like the models are far too unstable to not have some ground truth simulation in the background. I suspect this will be a schism in AAA where different studios take different tactics and become known for them. I think we’ll start seeing a lot of “engines” that almost act like rapid grey-boxing tools, but with a lot of additional tools that act as ways to feedback from diffusion-produced output back to the engine.
Scene object annotation tools in combination with something like ControlNet to feed the understanding of a scene to a model to improve consistency between frames.
Feeding back data from OCR-ing image output back such that it can be fed back into the underlying engine to understand what is actually being shown.
Similar for audio: capturing diffusion-created sounds and replaying them back later for consistency.
Relatedly, finding ways to actually “buffer” the diffusion output and edit it before showing to the player to remove any erroneous details or add in elements to improve consistency elements.
Developing a “world grammar” (not dissimilar to USD) to be able to use to generate a prompt for the model

Every one of these is its own domain. There may be 10-50 companies for any one of them. And there are others far beyond this as well. Doing all of these well will be expensive and difficult, but the upside is that you are redefining gaming and game development.

BUT WHY DO THIS AT ALL?

I think it’s easy to see this as a marginal cost savings on making a game, with the added complexity on top of the technical infrastructure to “control” the models.

In some senses I think this is true. But it belies a greater point about making games this way.

If you have the technical infrastructure set up to be able to render a cohesive scene of people walking around a town square in Madrid (or wherever), it is trivially easy to use a greater proportion of that pre-work to then make a scene of people walking around a space station. Or Atlantis. Or Atlanta.

There is almost no marginal cost to changing this, and it can happen with the same model install a user already has — a user can even control it, generating and simulating wherever or whatever they want.

The reuse factor here is off the charts. You can make one “game” that is effectively 1000000 “different” games for any one player. It opens up the possibility for an effectively limitless number of permutations of a game. A game can be fully customized to exactly what the player wants.

Instead of shipping patches, you get model updates that improve consistency over time. GTA7 could be GTA WORLD — players prompt the game for where they want it to be set at, and then off they go. The cost for someone like Rockstar to do this the “old way” is unfathomable and would be simply impossible in practice.

The lure of this way of making games will be impossible to ignore. I also do not think many of the current AAA studios are well setup here to make these changes and will likely fall by the wayside. They are completely indexed on a “production of high end graphics is the be all end all of development” mindset.

WHAT DIES

I think the coming of this is acutely bad for current AAA because these models will excel at reproducing the types of experiences they offer. In trying for so long to make games more like films, game companies have now set themselves up to be usurped by literal (generative) videos. Games where the whole play experience is focused around a well-defined character in the middle of a scene, framed like a movie, where all that really matters is what’s literally in front of you - that stuff is going to get eaten so fast.

It’s honestly maybe even possible that GTA 6 gets “nerd sniped” by diffusion models. It’s fun to play in Miami, sure. But why not make a GTA in your own town? London? Jakarta? A diffusion model powered game can do this, maybe within a year. Rockstar is probably just shaking and hoping it doesn’t happen, as it would completely diffuse demand of 6.

The “beautiful 2D platformer” is also probably gone. If you’ve got a good underlying platform engine and environment, you can easily pass that to a model to generate whatever artistic style you want.

Racing is similarly over. Especially as they are maybe even closest to already operating like this, with underlying simulations effectively puppeting an irrelevant graphical layer.

WHAT SURVIVES

Because “games that mostly look and feel like films” are prime targets for disruption here, this is also to say that the further you get from the hegemonic “cinematic third person action adventure game,” the more well-setup you are long term to deliver a differentiated product that diffusion is bad at.

Strategy games especially stand out here, which is to say Europa Universalis 5/Paradox has nothing to worry about. Indies in general as a category and broader more “systems driven” games will still be in demand.

Games where EXACTNESS matters will still be around (RTS, MOBAs,). Games that are more about audiovisual experiences will definitely get their lunch eaten.

WHAT THRIVES

The other opportunity here is that this is a technology that enables a whole new class of games. When you can effectively “will” any frame into existence, without an implicit need for an underlying technology to support that frame - you can do almost anything. You can, on the fly, potential change your car racing game into a property management game, and then into a Life sim, and then back to a car racer. And these transitions are potentially not even anticipated by the developer but would implicitly work. Actually here’s more capital intensive ideas (and questions):

How do you interpolate between “game models”? Can games actually interact with each other?
Do companies/consoles produce “models” that you build on, akin to an SDK?
What is the interface and protocol handoff between models? Is that the system SDK or is it something application developers will manage with come open standard?

I think it’s easy to think that somehow with this technology we get one big “everything game”, but that seems unlikely to me (or more that someone will do it and it will be bad). People DO like differentiated experiences, and may gravitate towards games that are a still largely about “one” thing. Games will differentiate on their own models and inner workings of their own “engines”.

It is also worth saying that access to local inference can also be something that works for trad games in more subtle ways (also when people talk about “AI in games” this is usually what they are talking about). Roguelikes can generate infinite items with appropriate visuals. Visual Novels can use LLMs for off-kilter player interactions. A game can be fully voiced by AI, etc.

In some ways, democratized tooling has already erased AAA defensibility, as genre frontiers become more possible at the indie level. So (new) AAA needs new non-democratized frontiers to maintain a premium, and leaning into the capital expensive work of wresting local models into control is a perfect way for them to move out of the mire of “traditional” genre games into something infeasible for a small team.

BEYOND GAMES

Game consoles are strange dinosaurs of a previous era of technological progress. They used to be bastions of possibility and a driver of graphics innovation, but the rise of the modern PC GPU (why didn’t I buy NVIDIA stock when I was 14….) and general “gamer computer culture” means that PCs have largely caught up with, and surpassed, the computing power of a console, so a console’s hegemonic control of the state of the art lasts a few months at best (if at all).

However, right now it is not common to have a TPU in your computer. Largely because you don’t need one unless you are personally doing custom inference work on a local model. They are also expensive and difficult to program as they lack a common interface like CUDA.

Consoles then are well-poised here to also just offer general value to developers and customers by offering machines with easily programmable TPUs. With these tactics, they (through market share/consumers) can actually push through new technology to mass audiences once again while also controlling the flow of that technology, making them technical leaders again.

In fact, I’d even go as far to say that these new “consoles” will only be partially branded as such. I don’t necessarily expect to see literally something like XBOX SERIES X2 (Now with TPU!), but maybe instead something like XBOX T - a new product line/console form factor.

By conventional game logic, the idea would be that you then sell this with premier titles. This may still be the case. However, I think back to the PS3 “cell processor days” where people bought PS3s for their computing power alone, and I could see a similar thing happening here. These console form factors become the ways to buy general purpose TPU compute. Good for games, sure, but also a great present if your kid is into using AI models and wants something more geared towards providing art. It’s not exactly like an Amiga… but it’s not dissimilar from that with the way it distributed its own creative toolset with the console itself. The console will become a shortcut to consumer access to (and APIs for) TPUs.

PROJECTING

To be clear, these will look like absolute shit at first (as we’ve seen above). People will make fun of the games that start to sketch out what this looks like. “Traditional” gamers will bemoan them. But they will slowly (or quickly) get better and better, and all the obvious criticisms will start to go away.

Especially so when you consider the same graphical leaps between console generations that I previously talked about are also happening in AI models on a much more compressed timeframe. As people like to say, “these are the worst they are ever going to be”.

These improvements only took two years in Midjourney (top is latest v7, bottom is v1)!

robot_mj

Leaps like this do not exist in traditional games anymore. It’s part of the ennui of new console launches now. Everyone feels like we’re going through the motions, and we’re all waiting around for the real next thing, and until then just playing the hits.

It’s also good this is becoming possible right now, as AAA is in the process of collapse, flooding the market with tons of highly talented people. Hire them! Diffusion and rendering is one part of this, but the technical know-how to build an engine that is able to support these features will need game engineering-flavored people.

I’ve seen at least one company already potentially pursuing ideas here — Odyssey.world, but their vision seems a bit quaint compared to what I’m describing. This is a full-on reconfiguration of what AAA game development looks like. There will be many players, and many studios. There will be new middleware, new franchises, etc.

It’s going to be a bumpy ride as this happens. In a lot of ways it will feel like we’re back at square one again, though we’re coming to it now with decades of knowledge about what games can and can’t do.

I’m excited for it though. I think it’s going to be wild to watch happen. Like watching a prompted image slowly come to life as its initial blurry rendition begins to fade into focus, the same will happen with diffusion-powered games. Slow at first, murky, unclear, then finding its legs and stepping into maturity. Replaying the rise of games 50 years ago from children’s toys to globe-spanning entertainment medium (and also still children’s toys).

Recent Developments

On Substack, someone had asked if there was any space to talk about this stuff. Currently no, but I’ll start collecting resources below this post tracking where things are at following me first posting this.

6/20/25 - Tencent anncounced Hunyuan-GameCraft

Key innovation here seems to be around multiple inputs and history preservation (no more turning around and back and seeing something different) based on an explicit finetune of HunyuanVideo trained on “100 AAA videogames”. It also looks really good!

Another nice thing in the paper is a nice chart giving and overview of (all?) players in space and their current capabilities.

Gamecraft Comps

Published on June 12, 2025.

Tagged: AAA AI Game Development