When I built my AI narrative testbed in summer 2025, the tradeoffs for real-time NPC dialogue using LLMs were brutal. You could have quality, or you could have speed, but getting both at a reasonable cost felt like wishful thinking. A lot has changed.

This week I ran a fresh benchmark across four models: Qwen 3.5 Flash, Gemini 3 Flash Preview, Claude Sonnet 4.6, and Claude Opus 4.6 to see where things stand in March 2026. The results tell a genuinely interesting story about where this technology is heading.

The Testbed

Before the numbers, a quick word on what I was actually testing. The system is a fully 3D interactive narrative experience, with voiced avatars and real-time lip syncing. Each NPC understands its character, its role in the story, and everything that’s happened in the narrative up to that point, but crucially, it’s also allowed to think creatively and add new details, which get stored in a shared knowledge database. NPCs know about each other. They’ll share opinions about events, drop hints about other characters, and gradually reveal the story through natural conversation.

The whole system is 100% data-driven. Stories can be swapped in and out without touching the code. The LLM layer is built on OpenRouter, which means switching models is essentially a config change. That’s what made this benchmark possible: the same story, the same character (Claudia Marlowe, a former actress with a complicated relationship to a murder victim), the same prompts — just different models under the hood.

The story itself is a classic locked-room murder mystery set in a storm-battered hotel. The player character, Evelyn, is investigating. Claudia is one of several suspects, each with their own knowledge, agenda, and evasions.

Two Test Conditions

I ran two distinct types of interaction, and they tell different parts of the story.

Generated Dialogue Options: The player selects from AI-generated conversation options rather than speaking freely. The NPC responds to structured prompts designed to probe the narrative. This tests how well the model handles story-critical dialogue — staying in character, referencing other NPCs correctly, maintaining narrative consistency while still being creative.

Free Voice (push-to-talk): I spoke directly to Claudia and tried to get her to talk about her film career, something she’s meant to be coy and evasive about. This tests something harder: can the model maintain a specific personality trait under conversational pressure? Can it be interesting while saying nothing?

The Numbers

All latency measurements are end-to-end: from the player’s input completing to the NPC’s voice beginning (via ElevenLabs TTS). This is the number that matters for player experience — it’s the silence they feel.

Model LLM Latency Perceived Latency Cost/Exchange Tok/s
Qwen 3.5 Flash ~1,686ms ~3.2s (PTT) / 7.4s (opt) ~$0.000107 ~51
Gemini 3 Flash Preview ~2,487ms ~4.0s (PTT) / 7.9s (opt) ~$0.000385 ~79
Claude Sonnet 4.6 6,203ms ~8.6s ~$0.009728 40
Claude Opus 4.6 ~7,532ms ~9.5-10.1s ~$0.049965 ~34

PTT = Push-to-talk free voice mode. The generated options mode includes a dead-air calculation (the gap between player selection and NPC response), which is why perceived latency is higher in that mode.

The cost column deserves a moment. Running 100 NPC exchanges costs roughly $0.01 with Qwen, $0.04 with Gemini, $0.97 with Sonnet, and $5.00 with Opus. At game scale, that’s not a trivial difference.

What the Dialogue Actually Looked Like

Numbers only tell part of the story. Here’s what I actually observed in the transcripts.

Qwen 3.5 Flash — Fast, functional, investigative

Qwen does the job. It stays in character, correctly references other NPCs from the knowledge base, and responds quickly enough to feel genuinely conversational. But there’s a pattern in the transcripts: Claudia tends to pivot to suspects almost immediately. She’s doing her narrative function — seeding clues, pointing at Julian Voss, mentioning Martha Kline — but she does it in a way that feels a little mechanical. Like she’s narrating rather than being.

In the film conversation test, she invents plausible movie titles (“The Velvet Silence”, “Midnight’s Regret”) which is actually charming, but the evasiveness feels thin. She gives you the deflection without really making you feel like you’re talking to someone who has something to hide.

For high-volume background NPCs, ambient chatter, and low-stakes interaction, Qwen at effectively zero cost per exchange is compelling. Especially given that highly quantized versions can already be run locally — which for some use cases removes the cost question entirely.

Gemini 3 Flash Preview — The sweet spot

This is the model that surprised me most. It’s roughly 50% slower than Qwen on LLM latency alone, but still very much in real-time territory at ~4 seconds total perceived. And the quality jump is noticeable.

In the voice test, when pressed about her film career, Gemini’s Claudia responded:

“Evelyn, my dear, the past is a museum where the lights have been turned off for a reason.”

That’s a good line. It’s evasive, it’s in character, and it redirects naturally to the present drama (“I saw Rory Finch scrubbing at a stain that certainly wasn’t there during dinner”). The personality trait — coy, theatrical, deflective — is being maintained under conversational pressure rather than abandoned for exposition.

At roughly 3.6x the cost of Qwen per exchange, but still well under a tenth of a cent, Gemini 3 Flash feels like the right answer for main story NPCs in a real-time context.

Claude Sonnet 4.6 — Great writing, borderline latency

The quality step up is genuinely clear. Claudia’s backstory emerges organically, and the dark edges of her character land properly:

“Victor held a performance. One I gave twenty years ago that never made the official filmography, if you take my meaning. The kind of work a woman does when she’s desperate and young and doesn’t yet understand that desperate men keep copies.”

That’s excellent NPC writing. But at 6.2 seconds average LLM latency and ~8.5 seconds total perceived, you’re at the edge of what most players will tolerate. It’s not impossible — a game designed around slightly slower, more deliberate NPC exchanges could make this work. But it’s not real-time in the way Gemini is real-time.

At roughly 25x the cost of Gemini per exchange, it’s hard to justify for shipped games unless the interaction model specifically accommodates the latency.

Claude Opus 4.6 — The benchmark for quality, not practicality

The dialogue is the best in the dataset, full stop. Characterisation is richer, responses feel genuinely improvised. But the numbers make it impractical for real-time use: average latency of 7.5 seconds, with spikes to 13.7 seconds. That’s immersion-breaking. The variance alone is a problem — inconsistent latency is arguably worse than consistently slow latency from a feel standpoint.

At ~$0.05 per exchange, a single 20-exchange conversation costs $1.00. For a shipped game, that’s untenable.

Where Opus does make sense is in offline roles: generating NPC backstories, populating the knowledge database, writing the authored narrative that the real-time models then draw from. Use the best model to build the world; use the fast model to inhabit it.

The Bigger Picture

What this benchmark shows isn’t just a cost/quality tradeoff — it’s a map of where to use each tier of model in a complete system.

Qwen 3.5 Flash (and locally-run quantized models): Ambient NPCs, background chatter, low-stakes interaction. Essentially free. Local deployment in the near term makes this a serious option for studios that can’t or won’t use cloud APIs.

Gemini 3 Flash: Main story NPCs, real-time conversation. The quality is there for narrative work, the latency is genuinely playable, and the cost is negligible at scale.

Claude Sonnet: Cutscene-adjacent or turn-based interactions where a few extra seconds is acceptable. Niche but viable for the right design.

Claude Opus: Offline content generation. Not real-time, but invaluable for building the authored layer that makes the real-time layer work.

The trajectory here is important. Six months ago, asking a small, fast model to handle the kind of creative evasiveness Claudia demonstrates in the Gemini transcripts would have been optimistic. The gap between “good enough” and “frontier quality” is closing faster than most people realise — and the local model story is about to get very interesting.

For indie developers especially, the question is no longer whether AI-driven NPCs are technically feasible. The question is which tier of model matches which part of your interaction design.

Methodology Notes

If you’re working on AI-driven narrative or NPC systems and want to compare notes, I’d love to hear what you’re finding. Drop a comment or reach out directly.