Gemini 3.5 Flash Is Impressive. Here's What We Actually Found.

Published: May 21, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

There's an unwritten rule in AI model releases: Pro is smart, Flash is fast, and you pick your tradeoff. Google just broke that rule.

Announced at Google I/O on May 19, 2026, Gemini 3.5 Flash is the first model in the new Gemini 3.5 family — and it does something no Flash-tier model has done before: outperform the previous flagship Pro model across coding and agentic benchmarks, while still running at Flash speeds.

The Context

Gemini 3.1 Pro launched in February 2026 and immediately topped the Artificial Analysis Intelligence Index on complex visual reasoning and multimodal tasks. It was Google's flagship, released just three months ago.

3.5 Flash is now better than it on most coding and agentic benchmarks. And it's faster.

The Benchmarks

Category	Benchmark	Gemini 3.5 Flash	Gemini 3 Flash	Gemini 3.1 Pro	Claude Sonnet 4.6	Claude Opus 4.7	GPT-5.5
Coding	Terminal-Bench 2.1 (agentic terminal coding)	76.2%	58.0%	70.3%	—	66.1%	78.2%
Coding	SWE-Bench Pro (diverse agentic coding tasks)	55.1%	49.6%	54.2%	—	64.3%	58.6%
Agentic	MCP Atlas (multi-step workflows using MCP)	83.6%	62.0%	78.2%	69.5%	79.1%	75.3%
Agentic	Toolathlon (real-world general tool use)	56.5%	49.4%	—	—	—	55.6%
UI Control	OSWorld-Verified (agentic computer use)	78.4%	65.1%	76.2%	72.5%	78.0%	78.7%
Expert Tasks	Finance Agent v2 (financial analysis and decision-making)	57.9%	42.6%	43.0%	51.0%	51.5%	51.8%
Expert Tasks	GDPval-AA (economically valuable knowledge work, Elo)	1656	1204	1314	1676	1753	1769
Multimodal	CharXiv Reasoning (information synthesis from complex charts)	84.2%	80.3%	83.3%	72.4%	82.1%	84.1%
Multimodal	MMMU-Pro (multimodal understanding and reasoning)	83.6%	81.2%	80.5%	74.5%	75.2%	81.2%
Multimodal	Blueprint-Bench 2 (agentic spatial reasoning)	33.6%	0.0%	26.5%	6.7%	24.5%	36.2%
Long Context	MRCR v2 — 128k (long context retrieval)	77.3%	67.2%	84.9%	84.9%	59.3%	94.8%
Long Context	MRCR v2 — 1M (long context retrieval)	26.6%	22.1%	26.3%	—	—	—
Reasoning	Humanity's Last Exam (academic reasoning, text + multimodal)	40.2%	33.7%	44.4%	33.2%	46.9%	41.4%
Reasoning	ARC-AGI-2 (abstract reasoning puzzles)	72.1%	33.6%	77.1%	58.3%	75.8%	84.6%

^Source:^{Google DeepMind — Gemini 3.5 Flash}

Flash leads across agentic, tool-use, and multimodal benchmarks. In coding, it beats Gemini 3.1 Pro on both tasks, though GPT-5.5 and Claude Opus 4.7 lead their respective categories. On deep reasoning and long-context retrieval, flagship Pro models retain an edge — a gap Google appears to be holding for the forthcoming 3.5 Pro.

Why Google led with Flash, not Pro

Google's decision to lead the 3.5 series with Flash — not Pro — is a signal. For the workflows that matter most in production today — agents, tool use, coding loops — raw reasoning depth matters less than the combination of quality, speed, and cost.

Running four times faster than comparable frontier models and priced at $1.50 / $9.00 per million input/output tokens, Flash makes agentic pipelines dramatically cheaper to run at scale.

Production evaluations support this. Box's CTO Ben Kus reported that 3.5 Flash beat the previous Flash generation by 19.6% on real-world enterprise workflows, with life sciences data extraction accuracy improving by 96.4%. JetBrains' Nick Frolov noted a 10–20% improvement in coding performance over the previous Flash generation.

Does Gemini 3.5 Flash hold up on your endpoint?

Official benchmarks use proprietary harnesses, full task sets, and the vendor's own evaluation stack. The relevant question for platform teams is different: what do you get on your base URL, with your model ids, on prompts you can rerun?

We ran a 15-prompt text-only harness through TrueFoundry AI Gateway across the same three categories Google highlighted — CharXiv-style, MMMU-Pro-style, and Finance Agent v2-style — scored against reference answers.

Model	Accuracy	Mean latency	Total cost	Cost / correct
Claude Opus 4.7	66.7%(10/15)	2,538 ms	$0.045	$0.0045
GPT-5.5	60.0% (9/15)	3,017 ms	$0.020	$0.0022
Gemini 3.5 Flash	46.7% (7/15)	3,529 ms	$0.091	$0.0130

Suite	Claude Opus 4.7	GPT-5.5	Gemini 3.5 Flash
CharXiv-style	80%	80%	80%
MMMU-Pro-style	80%	80%	60%
Finance Agent v2-style	40%	20%	0%

This run does not refute Google's official numbers — they use different harnesses and a different evaluation stack. What it shows is that benchmark rankings don't automatically transfer to your endpoint. On our slice, Flash's Finance-style score was 0/5, with failures driven by long completions that didn't match the expected format. The cost picture was equally stark: Flash carried the highest total spend and the fewest correct answers, putting its cost per correct answer at ~6× GPT-5.5.

The metric that matters when models are interchangeable behind a gateway is cost per correct answer: price per token × tokens per attempt ÷ probability of a usable response.

The 1M-token context window

Gemini 3.5 Flash supports a one-million-token context window — enough to hold an entire codebase, a lengthy regulatory document, or the full trace of a long-running autonomous task in a single session. Retrieval benchmarks suggest the window is genuinely usable at that length, rather than degrading at the long tail.

Gemini Spark and what Google is signaling

Also announced at I/O: Gemini Spark, Google's new 24/7 personal AI agent, is powered by 3.5 Flash. The model is now the default across the Gemini app and AI Mode in Google Search globally. Google is deploying 3.5 Flash as the production default for both their highest-traffic consumer products and their most ambitious agentic experiments — not as a stepping stone.

What to watch for

3.5 Pro next month. Google confirmed 3.5 Pro is already in internal use. If 3.5 Flash already beats 3.1 Pro on most benchmarks, the question is what 3.5 Pro does on the reasoning and long-context tasks where Flash still trails.

MCP Atlas leadership. Flash's lead on MCP Atlas — the benchmark for multi-step tool workflows using the Model Context Protocol — signals that Google has made tool orchestration a first-class training objective. For teams building MCP-native architectures, this is worth taking seriously.

Run it on TrueFoundry

TrueFoundry AI Gateway gives you access to Gemini 3.5 Flash alongside GPT-5.5, Claude Opus 4.7, and other frontier models through a single endpoint — the same setup used for the validation above. Unified request tracing, cost attribution by model and team, no separate API keys per provider.

Try it · Quick start · Book a demo

_{Official benchmark data:}_{Google DeepMind — Gemini 3.5 Flash}_{, May 19, 2026. TrueFoundry validation run: May 20, 2026, 15-prompt text-only harness via TrueFoundry AI Gateway.}

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now