Using Gemini 2.5 and Claude Code To Generate An AI 2027 Wargame

Seeing how far current LLMs can get with designing and programming a complex turn-based strategy game

April 30, 2025

Tagged: Programming Vibecoding Game Design

I had an idea to try and convert the recent AI 2027 report into a game. Beyond that, I again wanted to push at some of the limits of AI coding and also get some reps in with testing a newer workflow of using models like Gemini 2.5 and o3 to “plan” and then getting Claude Code to do the actual work.

I still think the game idea is a good idea, but I’m rapidly losing steam on the idea and may just cut and run, due in part to some limitations of vibing. I wanted to write a little bit about the workflow though and what worked/didn’t work before I drop it though.

I also put up a repo on Github where you can look at the more load-bearing documentation I created for how the game works:

https://github.com/kkukshtel/ai-2027-game

Here’s a screenshot of the game so far, and where it will likely stay:

agi-game

Background

The way I conceived of this game was some mixture of Twilight Struggle, Imperial Struggle, and Daybreak. If I was to napkin design this game, I’d come up with some basic area control/influence game that was abstract in connections between geographical centers, with event cards that were split up by years, based on extrapolated info from the report.

The specific “mixture” I basically left up to Gemini and I prompted it as such:

I’m looking to create a web-based strategy game that simulates AI takeoff scenarios. The gameplay mechanics should be somewhat similar to games like Twilight Struggle, Imperial Struggle, and Daybreak (rules attached), and should focus around the rising spheres of global influence around developing and maintaining AGI and how that affects other key parts of the world. The game will only ever be played in a desktop web browser, so we don’t necessarilly need to think of responsive design. Players can play as either the USA or China.

Create a high level ruleset about how the game is played, complete with general rules and structure of play. Then create an implementation plan for building the game website. Then create a component reference for the components that need to be built. Then create instructions to create the full game, with a strong step by step workflow that a junior engineer could follow along with to create the game.

Consider all the attached files when thinking about how the game should be created as well as what themes it should represent through play.

Consider AI 2027 here as the template for the scenarios possible, as well as the attached research papers.

Attached were the rulesets for the three games, as well as some additional research papers going into the specific geopolitical struggles that will likely emerge over the next year as we, collectively, race to AGI.

Gemini thought for a bit, then produced this (sorry for the long quote but I think it’s useful in this context!):

Phase 1: High-Level Ruleset - “AGI Race: Brinkmanship”

1. Game Objective: Players compete as either the USA or China to achieve global dominance through the development and control of Artificial General Intelligence (AGI) and potentially Artificial Superintelligence (ASI). Victory is achieved by accumulating sufficient Victory Points (VP), achieving a decisive AGI lead under specific conditions, or forcing the opponent into a state of irreversible global instability.

2. Core Concepts: * Influence: Represents a player’s control and integration within key global Domains relevant to AI development and deployment (R&D, Compute & Hardware, Policy & Governance, Public Opinion, Military Integration). Influence is placed in specific Hubs (e.g., Silicon Valley, Shenzhen, EU AI Centers, Global South Tech Hubs) located on the game map. * AGI Development Track: A central track measuring each player’s progress towards AGI/ASI. Advancing requires significant investment and unlocks capabilities but also potential risks. * Global Stability Track: Represents the world’s socio-political and economic stability. Starts high. Certain actions (aggressive AI deployment, military integration, major security incidents, public panic) decrease stability. Low stability makes actions harder, increases risks, and triggers Crisis Events. Reaching 0 Stability can cause game loss for the responsible player. Analogous to Defcon / Daybreak’s Temperature. * Cards: The primary driver of actions. Cards represent technological breakthroughs, strategic investments, geopolitical events, policy decisions, incidents, and public shifts. Cards are affiliated with USA, China, or Neutral. * Operations Points (Ops): Cards provide Ops points of different types: Research Ops ( ), Compute Ops ( ), Influence Ops ( ). These are used to take actions. * Events: Cards also have events that can be triggered, often providing powerful, specific effects. Playing an opponent’s card for Ops must trigger their event. * Domains: The key spheres of competition (R&D, Compute, Policy, Public Opinion, Military). Control over Hubs within these domains contributes to scoring. * Crisis Cards: Drawn when Global Stability drops or the AGI track advances significantly. Represent unforeseen negative consequences (AI escapes, major cyberattacks, economic collapse, geopolitical flashpoints, extreme public backlash). Often require players to expend resources to mitigate.

3. Game Components (Conceptual): * Game Board: Depicts global Hubs grouped by Domains/Regions. Connections show potential influence spread. Tracks for VP, AGI Development, and Global Stability. * Cards: Decks for different Eras reflecting AI progress (e.g., Early [~2025-26], Mid [~2027], Late [~2028+]). Scoring cards tied to Domains. Crisis Card deck. * Tokens: Influence markers (USA/China), AGI progress markers, Stability marker, VP marker, miscellaneous markers for event effects or resource tracking (e.g., Alignment Effort, Compute Bottleneck). * Player Aids: Summarizing actions, costs, scoring conditions.

4. Game Sequence (Turn Structure): Each turn represents roughly 6 months to a year. * A. Stability Phase: Check current Global Stability. If low, potentially draw Crisis cards or apply negative effects. Improve Stability slightly (representing natural calming/adaptation). * B. Research Phase: Reveal new technology/event card(s) for the turn (optional mechanic). Players may commit resources to secret projects. * C. Card Phase: Deal cards to players based on the current Era (e.g., 7 cards). * D. Headline Phase: Players secretly select one card to play for its Event only. Card with higher Compute Ops value resolves first (representing faster processing/deployment). Ties resolved by USA player first. Scoring cards cannot be Headlines. * E. Action Rounds: Players alternate playing cards, typically 6 rounds per turn. On their round, a player plays one card either for its Ops value or its Event. * Playing for Ops: Use the card’s Ops points ( , , ) to perform actions. If the card is associated with the opponent, the Event MUST trigger for the opponent. Player chooses Ops before or after opponent’s event resolves. * Playing for Event: Trigger the card’s Event. Can only play own or Neutral events this way. Card is typically discarded (or removed if specified). * F. Crisis Phase: Players may need to resolve or mitigate active Crisis cards. * G. Cleanup Phase: Discard down to hand limit (e.g., 1 card). Check victory conditions. Advance Turn marker.

5. Core Actions (Using Ops): * Place Influence: Spend Influence Ops ( ) to place influence markers in Hubs. Cost depends on target Hub stability, existing influence (yours/opponent’s), and adjacency. Controlling adjacent Hubs makes placement cheaper. Placing in opponent-controlled Hubs is expensive. * Advance AGI Track: Spend Research Ops ( ) and Compute Ops ( ) to advance your marker. Costs increase significantly along the track. Reaching milestones may unlock abilities, add new Era cards, or trigger Stability checks/Crisis draws. * Strengthen Compute/Hardware: Spend Compute Ops ( ) to gain temporary advantages, build resilience against cyberattacks, or meet prerequisites for certain Research advances (abstracted resource). * Mitigate Stability/Crisis: Spend Ops (usually Influence or Research ) to increase Global Stability or mitigate specific Crisis card effects. * Launch Project: Spend Ops on secret projects (optional mechanic, cards face down, revealed later for effect).

6. Scoring: * Triggered by playing Domain Scoring cards during Action Rounds. * Players compare Influence levels in Hubs relevant to that Domain (Presence, Domination, Control mechanics similar to Twilight Struggle). * VP awarded based on dominance level. Bonus VP for controlling key Hubs or achieving specific event conditions. * VP marker adjusted on the track.

7. Winning: * Reach 20 VP immediately. * Opponent causes Global Stability to reach 0. * Achieve a decisive AGI lead (e.g., reach the final space on the AGI track while opponent is significantly behind and Global Stability is above a certain threshold). * End of final turn: Final scoring round (score all Domains), player with VP advantage wins. Tiebreakers based on AGI track position or final Stability.

I’m happy (or sad?) to report that this was, in broad strokes, basically the design I had napkin-created. Players have hands of cards that can be played as an event and resolved, or played for their ops cost (hello Twilight Struggle) to gain that resource, which may be used to play tokens to control hubs (Shenzen, Taiwan, Silicon Valley, DC, etc.). There is also an AGI track players can advance with ops to victory.

This all largely sounds good! It seems like it would work, right? Well, kind of.

AI (Still) Won’t Kill Gamedev

I wrote before the AI won’t kill gamedev. I believe this now, after this exercise, perhaps even more than I did then, even with two years of model progress and my own skill increasing with using them.

Here’s the rub: Gemini was great at the highest level thinking. Truly an “idea guy”. Being able to throw it the rulesets of three exemplar games in the relevant genre and have it create a rough ruleset based on those influence was great. Incredible even that this is possible at all (maybe I’m even feeling the AGI).

However off the bat I could tell the design was bad. It was clearly the “average” of the given rulesets, and as such feels like a bit of a dull mush of its own priors. It lacked any bite or jagged edges. It also lacked things like balance. By its own accounting, players could play every card in their hand for ops points. These ops points could be spent on the AGI track for advancement. The ops awards from cards was great enough that players could basically beat the game on the first hand by simply playing all their cards for ops and advancing the AGI track to the max.

This is not an untenable design issue, and there are various “answers” to this (eg. make players only able to advance AGI once per turn. Or only one card can be played for ops per turn, etc.). Solving this though requires me, as a designer, to now consider all of the rules of the game and “fix” this. I could either do this myself, or prompt to the AI with this issue and have it fix it. Though both are possible, it belies the underlying issue - the AI’s design was poor and broken. “Fixing” the rules with the AI is effectively the “same” work as just doing the design — the AI is redundant.

Gemini, as seen above, provided a high-level ruleset, captured in those bullet points. This is all well and good until you sit down to actually play the game. The rulebook for Twilight Struggle is about 10 pages long of small print dual column text. Imperial Struggle is about 15. Daybreak, the more simple games, is around 3-5 in the same format. To tease out the implications of all those rules would a lot of prompting back and forth, with almost no guarantee of internal consistency or even that a rule itself is “good” or just patching a design hole that you told the AI existed.

The issue here is that you cannot extrapolate compelling and balanced gameplay from priors or even a high-level description. If someone described Gloomhaven in broad strokes, actually creating Gloomhaven from that point would be incredibly difficult. Instead of a direct line from the “vision” and the reality of the game, there is instead something of a constellation of possibility that orbit the seed of the idea. Many of the implied games are bad, some may be good, some may be great. None are a given.

This is an issue for LLMs, which are still largely glorified next word predictors. This is obviously incredibly useful and powerful in a lot of cases, but just following the median extrapolation from a given starting point will either take you somewhere “average” or basically tunnel you exactly back to your reference points.

The OPs (Operations) mechanic of Gemini’s rules are a great example of this. Twilight Struggle invented the card-driven wargame design through this system. Gemini has no such innovation here, even though ostensibly it could break all the rules and be “inspired” by the listed games to iterate on their own concepts and create something new.

The issue with this is that these innovations are non-obvious from priors. Games are, in a lot of ways, nonsensical. Their rulesets do not easily follow predefined patterns that could be easily replicated with slight variation. They are much more chaotic and entropic, littered with special cases and exceptions. “Bad” ideas from one game that may be publicly bemoaned and ridiculed (hence “downvoting” their use from a training corpus) can in fact be repurposed to newer designs in ways that redefine the landscape for a genre and hailed as revolutionary.

Making a games rules “fun” is also a total crapshoot: Platform collision boxes are often slightly longer than platforms themselves to give people a thrill of barely making it. It is common in first person shooters to make the final bullet of a gun marginally more powerful, and have the last few bars of health actually be worth “more”, amping up the tension of edge cases. Games notoriously “cheat” on percentages because humans are bad at reasoning about probability. This list goes on. You can’t predictably program this, because these cases are nonsensical.

The gameplay experience evoked from the code that goes into a game can not necessarily be derived from pattern matching priors. You can’t really interpolate between Mario and Celeste to come up with Shovel Knight. Throwing Twilight Struggle, Imperial Struggle, and Daybreak against each other for Gemini then produces a median work, truly less than the sum of its parts.

It’s worth saying that I can obviously prompt Gemini again. Maybe I can push it harder, and iterate across a lot of prompting to start crafting a better ruleset myself. But in doing that work I’m more just doing the “normal” work of designing, the LLM instead just acting like a (maybe too agreeable?) sounding board.

My point here is only to sort of point at the “default” behavior of such a model, that it isn’t a magic box of incredible ideas by default…. yet.

Content Design

Despite the above, this experience was still incredibly instructive around the margins of the “work” itself as I tried out some new tooling and techniques. I want to discuss some of the technical specifics of that below, but first, just to continue on the beat of “design”, want to say one place where Gemini really shined (IMO) was content design (as opposed to systems/rules design).

If you’ve used LLMs to any large degree on somewhat related problems this shouldn’t be a surprise — in some ways they are basically like “find (math-style) derivatives of this text corpus”. Feeding Gemini all the referenced PDFs and prompting it to come up with event cards and hubs was honestly awesome and I think well observed from the reference material. Here’s some examples:

Mid Era Cards (M01 - M60) ~ Acceleration & Confrontation (2027)

(Focus: More capable agents, AI for R&D takes off, model theft, hardware race intensifies, serious policy/treaty attempts, military AI risks)

M01: Agent-1 Equivalent Deployed (Internal) (NEU) | Ops: 4R | Event: Both players advance AGI Track by 1. Player with more R&D Hubs controlled may place 1 Influence there. REMOVE

M02: AI Accelerates Scientific Discovery (NEU) | Ops: 3R | Event: Draw 2 cards. Discard 1 card. May immediately spend up to 2R Ops.

M03: Superhuman Coding Achieved (NEU) | Ops: 3R | Event: The player whose AGI Track marker is ahead gains 2 VP. If tied, no VP. REMOVE

M04: China Steals Agent-X Weights (CHN) (CHN) | Ops: 1I | Event: China advances AGI Track by 2. Decrease Stability -2. USA may look at China’s hand and discard 1 card. REMOVE

M05: US Retaliatory Cyberattack (USA) (USA) | Ops: 2I | Event: If China played “China Steals Agent-X” this turn or last turn, China loses 2 VP and removes 2 Influence from any Compute Hub. Otherwise, no effect. Decrease Stability -1.

M06: Domestic Semiconductor Fab Online (USA) (USA) | Ops: 4C | Event: Place 3 USA Influence in USA Compute Hubs. USA gains “Secure Chips” marker: Opponent events cannot remove influence from USA Compute Hubs. PERSISTENT

M07: SMIC Breakthrough (CHN) (CHN) | Ops: 4C | Event: Place 3 CHN Influence in CHN Compute Hubs. China ignores USA “Export Controls Imposed” event.

Who knows if these numbers are any good, but in terms of pure thematic relationship to rules as well as relevance to provided influences, I think these are pretty good! The crisis card effects are similarly well observed:

Immediate Choice Crises

Cards that present an immediate choice to players:

C02: Critical Infrastructure Cyberattack

Effect Text: Immediate: Randomly select USA or China. Target player immediately chooses ONE: (a) Remove all own Influence from any 2 Compute Hubs they control OR (b) Discard 3 cards from hand. Global Stability decreases -1. Remove this Crisis immediately after the choice is made and effect applied.

C10: Critical Energy Grid Hack

Effect Text: Immediate: Global Stability decreases -1. The player controlling more Compute Hubs immediately chooses ONE: (a) Discard 2 cards from hand OR (b) Lose 2 VP. Remove this Crisis immediately.

C14: Diplomatic AI Misinterprets Signals

Effect Text: Immediate: Global Stability decreases -1. Both players must immediately remove 1 own Influence from one Military Hub they control that is adjacent to the opponent’s superpower space OR adjacent to a hub in the opponent’s home region (Player chooses which if multiple qualify). Penalty: If a player cannot remove influence as required (no controlled hubs meet criteria), they lose 1 VP instead. Remove this Crisis immediately.

I mentioned this at the top, but you can check out the Github repo here for a much bigger overview of the the content! I recommend it!

Okay, on to the technical stuff.

Gemini 2.5 handoff to Claude Code

An emerging workflow for AI tools, while we are still pre-AGI and no one model is the absolute definite best final final_final_V2_final version, is that you can use strong “thinking” models like o3/Gemini 2.5, to “plan” your work before you actually do it. This plan can then be passed off to more “tactical” models and workflows like Claude Code, or Cline (with o4-mini), giving them something of a compounded ability that leverages the best of what the relevant models are good at.

I feel that this, in practice, was fantastic. I used Gemini 2.5 to set up a sprint plan for implementing the game, and then passed that off to Claude Code. I also got it to give me unit tests to verify the work, and Claude Code would implement those as well. Great!

I was honestly surprised that, after mostly just working by itself and walking through the sprints, I found that after the sprint in which the frontend was setup, the frontend and such actually worked near flawlessly. Again, totally going off of feel here, but I suspect that, if Claude Code would have done this itself, it would have had a much harder time on the first run.

This also sort of echoes a different philosophy I took here compared to the work I did with mood. With mood I basically just prompted for disparate bugs and features, stopping and starting chats as I moved on to whatever was next.

For this project, I instead tried to generate a ton of documentation for how things should work first, then fed that to Claude bit by bit. I didn’t want to have to babysit every individual task, so passing it off these sprint documents seemed to be an effective way to get things done. Gemini did break up the sprints into individual task groups, and I think a better solution than (what I did) prompting something like “do sprint 4” to Claude Code would have been to instead prompt it for each task in a sprint bit by bit so it doesn’t try to operate widely over the whole sprint, but instead can focus on a single task with measurable results.

Whole sprints would result in a LOT of context being added to the model, as you can see below. Ideally I would have had it do just one one task at time, and once its done, clear the model history and start over. Otherwise, I’m getting every doc in the repo in the project almost immediately.

Screen Shot 2025-04-29 at 11.09.36 PM

Having specific documentation and requirements in the repo was largely great though. It allowed Claude to more rapidly understand what something was supposed to do, or the intended effect, instead of just reading the code as written.

Claude Code Best Practice

My use of Claude Code on the mood update was relatively naiive — I used it mostly as a local Claude Client + it’s “agentic” properties. However Claude Code can do a lot more than I had really leveraged, so I wanted to try some of that here, guided in part by Anthropic’s great blog post on using Claude Code.

One major one was the use of the CLAUDE.md file in the repo. This is something akin to an as-needed system prompt that can cascade to subfolders that also contain their own CLAUDE.md file. It’s a great place to store information about where files are located, what to do when adding tests, etc. Adding new info to this file can be done with a shortcut (#) or you can just instruct Claude to update it based on some other knowledge.

After working on some part of the design that came from Gemini or some other issue I was debugging, I would often just end with “add your learnings to Claude.md”. I also ended up using a slash command for this.

MCP

I didn’t get around to really needing MCP (Model Context Protocol) on mood, but it was something off in the horizon that looked interesting but I wasn’t sure how it would affect my own workflow yet.

Having now used it with Claude Code, it feels a bit like a superpower. Specifically I used the Supabase and Puppeteer one. Each was a total gamechanger. For Supabase, Claude Code could basically now directly edit and update resources on Supabase for me, without me needing to setup and route API calls, etc, and notably well-integrated into the chat client itself. This immediately saved me a ton of back and forth between the app and Supabase.

Puppeteer also felt like a surprising win. The ability to instruct the LLM to do functional tests and have it open your own browser and puppet it and return that data back to the LLM felt incredible. It’s choices and functionality here itself were mostly “just ok,” to the degree that I’m not totally sure the tests were that effective, but I suspect this will get better over time.

For games specifically, you really need tools like image diffing to ensure things are correct, but also just need to more manually setup functional tests and run them to see if they fail. Again referencing “games are illogical,” test setup as well has to test strange functionality or expectations of games that aren’t obviously derived from the structure of the code itself and are instead based on adversarial/belligerent player behavior.

There’s more in the future I want to do with MCP, but felt like this little effort was a nice way to get my head around it.

Testing

Related to testing more broadly, it was sort of hell. Unit tests ran mostly okay, leveraging Deno’s built in test runner for Supabase functions, but getting Claude to work with a local Supabase instance to run db related tests was a bit more of a crapshoot. I mentioned the blog LLM Blindspots in that big mood post, and it similarly finds that environment setup in general is something LLMs are very bad at.

This feels like a bit of an interop problem — Supabase’s MCP works for remote services but (afaik) not local development. Standing up and tearing down the DB proved difficult, and Claude Code kept want to ignore the issues and instead mock to DB calls to get around the issues. Given Supabase’s API surface area though, this also quickly became a bit unweidly.

I think largely though it feels like this issue is more in Supabase’s court — their local dev bootstrap environment mixed with the migrations and seeding… it feels like there should be a better way. Like maybe just let me push some JSON config that it hosts as if it is the server config? IDK!

Current Working Directory

LLM Blindspots also hits the nail on the head here:

Sonnet 3.7 is very bad at keeping track of what the current working directory is. Endeavor very hard to setup the project so that all commands can be run from a single directory.

Multiple times I would check on my directory after some command was run and end up with stuff where there were like three seperate supabase directories nested under each other, each with duplicate code. It’s hard to see this directly happening during a chat session, which makes it harder to stop before the LLM tunnels too much to reimplement everything again. Trying to make sure everything is legible from the top level is incredibly important, and even worth having stuff like “the backend code is in backend/ and the frontend is in frontend/” in your CLAUDE.md file is not a bad idea.

Some Sort of Conclusion

Largely, the Gemini 2.5 -> Claude Code workflow worked. I’ll definitely use it more in the future. The specifics as well around different things you can do with Claude Code (MCP, CLAUDE.md, slash commands, etc.) also worked really well.

Where this all fell down was really just on the substance of the design itself. After working with it, I knew a better approach here would be for me to just design the game myself and then build an implementation plan against that. I would still however strongly consider using an LLM to at least ideate on content ideas though.

Vibecoding has a real sunk cost fallacy. Not only should you get the LLM to stop digging, but you also need to realize when you’re doing too much digging. Because the moment to moment effort of vibecoding is “easier” than doing the actual work, it’s easy to accidentally spend more time trying to vibecode out of stubborness instead of just doing it right in the first place in a way that takes less time overall.

So hence this blog post — I’m stopping digging on this project. On to the next one.

Published on April 30, 2025.

Tagged: Programming Vibecoding Game Design