Best AI Models, Ranked — June 2026 Leaderboard for Text,

Last verified: 2026-06-15 (SGT/MYT) · Next update: mid-July 2026. Mindber Data Drop v2026.06. Every figure is attributed to its published source and date — see Method & sources below. Prices are blended/illustrative and change frequently; confirm against each provider's live pricing before committing spend.

By Mindber Research · AI model tracking. Figures checked against the cited leaderboards on 2026-06-15.

How we assessed this: AI-assisted editorial analysis that aggregates published results from independent leaderboards (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, τ²-bench, LMArena) and vendor pricing pages, as of June 2026. Mindber did not run its own private benchmarks and this is not hands-on product testing. Every number is attributed to its origin and date; any figure we could not confirm against a live source was dropped, not guessed.

The short answer: there is no single "best AI model" in June 2026 — there is a best model per job, per budget. Right now Claude Fable 5 tops raw capability, GPT-5.5 leads coding agents, Gemini 3.1 Pro is the best frontier value, open-weight models (MiniMax-M3, DeepSeek V4, Qwen3.7 Max) close most of the gap for a fraction of the price, Nano Banana 2 and GPT Image lead image, and Veo 3.1 / Kling 3.0 lead video now that Sora 2 has been retired.

Below is the full breakdown — and, more importantly, the part most leaderboards skip: which numbers are real and which to ignore. For the live Mindber view of the same field, see the Model Arena board and the weekly LLM rankings; to compare two models head-to-head on price and capability, use the compare tool.

Three numbers that frame June 2026

Claude Fable 5 — top of the Artificial Analysis Intelligence Index, ~8 points clear of the value tier

Artificial Analysis, June 2026

83.4%

Codex CLI on GPT-5.5 — the agentic-coding lead on Terminal-Bench 2.1, ahead of Claude Code on Opus 4.8 (78.9%)

tbench.ai, June 2026

~$0.18

DeepSeek V4 Pro blended per 1M tokens — frontier-adjacent quality at roughly one-tenth the price of the top closed models

Artificial Analysis, June 2026

TL;DR — best model by category (June 2026)

Job	Top pick	Best value alternative	The number that matters
Text & reasoning	Claude Fable 5	Gemini 3.1 Pro / Qwen3.7 Max	AA Intelligence Index 65 vs 57
Coding (model)	Claude Fable 5 / Opus 4.8	DeepSeek V4 / MiniMax-M3	SWE-bench Verified — but read the caveat
Coding agent (tool)	GPT-5.5 (Codex CLI)	Claude Opus 4.8 (Claude Code)	Terminal-Bench 2.1: 83.4% vs 78.9%
General agent / tool use	GPT-5.5	GLM-5 family (customer-service tasks)	Benchmark-dependent — no universal winner
Image	Nano Banana 2	Seedream 5.0 (volume)	Human-preference Arena + per-image cost
Video	Veo 3.1 (cinematic + audio)	Kling 3.0 (~$0.10/sec)	Sora 2 is being shut down — migrate off it
Cheapest at frontier quality	DeepSeek V4 Pro	MiniMax-M3	~$0.18–0.22 blended /1M tokens
Fastest output	Mercury 2	Gemini 3.1 Flash-Lite	~889 t/s vs ~326 t/s

Capability figures: Artificial Analysis Intelligence Index, June 2026 (381 models). Coding: vals.ai SWE-bench Verified + Scale AI SEAL. Agents: tbench.ai Terminal-Bench 2.1. We attribute every number to its source and date — see Method below.

What changed this month

The frontier moved again in late May and early June:

Claude Fable 5 went GA on 9 June 2026 ($10 / $50 per 1M tokens, 1M-token context). It debuted at #1 on the Artificial Analysis Intelligence Index (65) and top of SWE-bench Verified (95.0%). We unpack access, safeguards and prompts in the Claude Fable 5 guide, and you can see Anthropic's own framing in its announcement.
Claude Opus 4.8 shipped 28 May 2026 ($5 / $25). It posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1 — the strongest price-to-capability point in the Claude line. See its scorecard and the Opus 4.8 cost calculator for break-even math.
GPT-5.5 (23 April 2026) is OpenAI's default everyday model, with a reported ~60% drop in hallucinations versus GPT-5.4. It currently leads agentic coding via Codex; current rates are on the OpenAI pricing page.
Sora 2 is being retired. OpenAI shut the Sora web/app on 26 April 2026; the API shuts down 24 September 2026. Do not start new video pipelines on it.
Open weights nearly caught up. DeepSeek V4, MiniMax-M3 and Qwen3.7 Max now sit within ~0.2 points of Gemini 3.1 Pro on SWE-bench Verified — at roughly one-tenth the token price.

The headline takeaway: the top of the table is now a plateau, not a gap. The interesting decisions in 2026 are about cost, speed, and fit — not about chasing the #1 row.

1) Text & reasoning

The cleanest single capability number is the Artificial Analysis Intelligence Index — a composite of GPQA Diamond, MMLU-Pro, AIME, LiveCodeBench and several other benchmarks, normalised to one score.

#	Model	Creator	Intelligence Index	Blended price /1M	Context
1	Claude Fable 5 (max effort)	Anthropic	65	$7.70	1M
2	Claude Opus 4.8 (max)	Anthropic	61	$3.85	1M
3	GPT-5.5 (xhigh)	OpenAI	60	$4.35	922k
4	GPT-5.5 (high)	OpenAI	59	$4.35	922k
5	Gemini 3.1 Pro Preview	Google	57	$1.74	1M
5	Qwen3.7 Max	Alibaba	57	$1.43	1M
5	Claude Opus 4.7 (max)	Anthropic	57	$3.85	1M
8	Gemini 3.5 Flash	Google	55	$1.31	1M
8	MiniMax-M3 (open)	MiniMax	55	$0.22	1M
10	Kimi K2.6 (open)	Moonshot	54	$0.70	256k

Source: Artificial Analysis Intelligence Index, June 2026.

Read it like this: the top five are separated by ~8 points across a broad reasoning suite — close enough that for most real workloads they're interchangeable on quality. Where they separate hard is price. Gemini 3.1 Pro delivers index-57 reasoning at $1.74; Qwen3.7 Max matches it at $1.43; MiniMax-M3 lands index-55 at $0.22. Paying Fable-5 prices ($7.70 blended) only makes sense for the genuinely hardest 5–10% of tasks. If your spend is dominated by a high volume of medium-difficulty calls, the value tier is not a compromise — it's the correct default, and you can sanity-check the trade on the Mindber rankings.

Human preference vs benchmarks: LMArena (blind A/B voting) and the Intelligence Index measure different things — one captures what people like, the other what models can do. The Claude and Gemini families trade the top of LMArena's text board, and those rankings shift week to week. When the two leaderboards disagree, that gap usually means a model is either over- or under-tuned for chat style, not that one source is "wrong." This is exactly why Mindber's scoring methodology keeps capability and preference as separate axes rather than collapsing them into one number.

2) Coding

This is the category with the most misleading numbers on the internet, so read carefully.

#	Model	SWE-bench Verified	Price /1M (in/out)
1	Claude Fable 5	95.0%	$10 / $50
2	Claude Opus 4.8	88.6%	$5 / $25
3	GPT-5.5	82.6%	$5 / $30
4	Claude Opus 4.7	~82%	$5 / $25
5	MiniMax-M3 (open)	80.5%	$0.30 / $1.20
5	Gemini 3.5 Flash	78.8%	$1.31 blended

Source: vals.ai SWE-bench Verified, June 2026. (Reported Opus 4.7 scores vary 82–88% across harnesses — see caveat.)

⚠️ The reality check most leaderboards won't give you

SWE-bench Verified is partly saturated and partly memorised. OpenAI's own audit found that frontier models can reproduce verbatim "gold" patches for some tasks — the 500 Python issues leaked into training data before the benchmark was widely published. OpenAI stopped reporting Verified scores and now points to SWE-bench Pro instead.

On Scale AI's standardised SEAL leaderboard (identical scaffolding for every model), the numbers collapse:

Best public standardised score: ~59.1% (GPT-5.4 xHigh)
Private commercial set: no model exceeds ~47.1%
Typical drop moving from Verified → Pro: 15–35 points

So when you see "95% on SWE-bench," translate it to: "saturated benchmark, real-world success rate is roughly half that on unseen, harder code." Use Pro / standardised numbers for procurement decisions, and Verified only for rough relative ranking. The deeper lesson is one Mindber's verification methodology leans on hard: a headline benchmark number is a starting hypothesis, not a purchase decision.

3) Agents & tool use

For agentic work, the harness matters as much as the model. The same model scores differently inside Codex CLI vs Claude Code on Opus 4.8 vs a custom scaffold — agent leaderboards rank agent + model pairs, not models alone.

Terminal-Bench 2.1 (operate a real computer via terminal — compile code, set up servers, run data workflows):

#	Agent + model	Score
1	Codex CLI on GPT-5.5	83.4%
2	Claude Code on Opus 4.8	78.9%
3	Gemini CLI on Gemini 3.1 Pro	70.7% (±2.9)

Source: tbench.ai, June 2026.

Customer-service / structured tool use (τ²-bench): a different picture entirely — GLM-family models (e.g. GLM-4.7-Flash at 98.8%) top the retail/airline tool-calling tasks. A model that wins terminal automation can lose at multi-turn customer-service tool use. Pick your agent by the task you actually run, not by a single board — and if you're unsure which models even belong on your shortlist, start from the AI tools directory filtered to your use case.

4) Image generation

The image race has split into clear lanes — there is no overall #1, only a best-per-lane.

Best all-rounder / character consistency: Nano Banana 2 (Gemini 3.1 Flash Image). Native 4K, keeps faces and styles stable across edits — ideal for serial content (mascots, storyboards, campaigns). Premium at ~$0.13–0.24/image.
Best text & typography: GPT Image (1.5 / 2). A "thinking" latent space that reasons through spatial instructions — the one model you can trust to spell a headline correctly. Consistently top-rated on Arena.ai for prompt adherence.
Best value / high-volume: Seedream 5.0 (ByteDance). Production-grade 4K at ~$0.026–0.032/image — built for e-commerce catalogues and content calendars.
Best for logos & posters: Ideogram v3.
Best for brand/style locking & open weights: Flux 2 Pro (dev/pro/max tiers).
Best for non-English prompts: Qwen Image (strong on Chinese, Arabic, Spanish).
Fastest: Z-Image Turbo (~1 second per image).

For Southeast Asian / multilingual creators: Qwen Image and Seedream handle Chinese and mixed-script prompts more reliably than Western-tuned models, and Seedream's per-image economics make batch product shots realistic on a small budget. You can browse the image-generation field, with Mindber scores and live pricing, in the discovery directory.

5) Video generation

The big story is a departure: Sora 2 is being shut down (web/app 26 April 2026; API 24 September 2026). If you're on it, plan your migration now. Here's the field that remains:

Best cinematic quality + native audio: Veo 3.1 (Google). The only model generating 48kHz synchronised dialogue — not just sound effects. Best photorealism on human subjects and natural light. ~$0.15–1.20 per 5-second clip by tier.
Best value: Kling 3.0 (Kuaishou). Native 4K, 60fps, multilingual lip-sync, ~$0.10/second — the iteration workhorse.
Hottest image-to-video: Seedance 2.0 (ByteDance). Strong stylised motion and short-form vertical content.
New frontier contender: HappyHorse-1.0 (Alibaba). Joint audio-video, 7-language lip-sync, climbing the Artificial Analysis video board; live on fal.ai.
Best creative control: Runway Gen-4.5. Motion brushes, scene consistency and a real timeline editor — it lost the leaderboard lead but still wins for directed, multi-shot work.
Best HDR: Luma Ray3.14 (native 16-bit HDR).

Note: video arena scores live on different scales (LMArena text-to-video vs Artificial Analysis), so cross-board number comparisons are unreliable. Treat these as lane leaders, not a single ranked ladder.

6) Best value & open-weight (the bootstrap lane)

If you're shipping a product and watching margins, this is the most important table in this report. Open weights are now frontier-adjacent at a fraction of the cost:

Model	Index	Price /1M	Why pick it
Gemini 3.1 Pro	57	$1.74	Best closed frontier value
Qwen3.7 Max	57	$1.43	Frontier reasoning, 1M context, strong multilingual
MiniMax-M3 (open)	55	$0.22	Near-frontier, open weights, 1M context
Kimi K2.6 (open)	54	$0.70	Strong open reasoning
DeepSeek V4 Pro (open)	52	$0.18	Cheapest credible workhorse; cache hits drop input further
GLM-5.1 (open)	51	$0.90	Strong tool use / agentic

Source: Artificial Analysis, June 2026.

The routing play: the cost-optimal setup isn't one model — it's a router. Pin ~80% of traffic to a cheap workhorse (DeepSeek V4 / MiniMax-M3 / a small Gemini Flash) and reserve a frontier model (Opus 4.8 / Fable 5) for the hard 20%. Done right, this beats any single-model subscription on both cost and quality. The economics of that split — and why the rate card is only a fraction of the real bill — are worked through end-to-end in The True Cost of AI Tools 2026.

7) Speed (for real-time & long agent chains)

When latency compounds across many sequential steps, throughput becomes the deciding metric:

Mercury 2 (Inception, diffusion LLM) — ~889 tokens/sec
Granite 4.0 H Small (IBM) — ~524 t/s
Step 3.7 Flash — ~385 t/s
gpt-oss-120b (high) — ~338 t/s
Gemini 3.1 Flash-Lite — ~326 t/s

Source: Artificial Analysis median output speed, June 2026. For chat UX, anything over ~150 t/s feels instant; speed matters most for agentic loops and batch jobs, where every extra second is multiplied by the number of sequential steps in the chain.

How to actually pick a model

Stop optimising for the #1 row. Match the model to the job:

Hardest reasoning, money no object → Claude Fable 5 or Opus 4.8.
Best quality per dollar at the frontier → Gemini 3.1 Pro or Qwen3.7 Max.
Self-hosting / data residency / lowest cost → MiniMax-M3, DeepSeek V4, or Qwen3.7 Max.
Coding inside an agent → GPT-5.5 via Codex, or Opus 4.8 via Claude Code.
Image — general → Nano Banana 2; text in image → GPT Image; high volume → Seedream 5.
Video — cinematic + audio → Veo 3.1; value/iteration → Kling 3.0.
Real-time / high throughput → Mercury 2 or a Flash-tier model.

The decision grid below is the same logic in a form you can hand to a buyer:

The buyer's decision grid

Quality over cost

Hardest reasoning

Claude Fable 5 (index 65) or Opus 4.8 (61)
Worth it for the hardest 5–10% of tasks
Route easy work elsewhere — don't default here

Quality per dollar

Best value at frontier

Gemini 3.1 Pro ($1.74) or Qwen3.7 Max ($1.43)
Index 57 — within ~8 points of the top
The correct default for most production traffic

Margins or data residency

Lowest cost / self-host

MiniMax-M3 ($0.22), DeepSeek V4 ($0.18)
Open weights, 1M context, self-hostable
Cache hits drop the input rate further

Harness matters as much as model

Coding inside an agent

GPT-5.5 via Codex tops Terminal-Bench 2.1
Opus 4.8 via Claude Code is close behind
Rank agent+model pairs, not models alone

Best-per-lane, no overall #1

Image & video

Image: Nano Banana 2 / GPT Image / Seedream 5
Video: Veo 3.1 (audio) or Kling 3.0 (value)
Sora 2 API closes 24 Sep 2026 — migrate

Latency compounds in agent loops

Real-time / high throughput

Mercury 2 (~889 t/s) or a Flash-tier model
>150 t/s already feels instant in chat
Speed is decisive for batch + multi-step chains

FAQ

What is the best AI model right now (June 2026)?

For raw capability, Claude Fable 5 leads the Artificial Analysis Intelligence Index (65). But "best" depends on the task: GPT-5.5 leads agentic coding, Gemini 3.1 Pro is the best value, and open models like MiniMax-M3 are best for cost-sensitive deployment. The live Mindber view is on the Model Arena board.

Is Claude better than GPT-5.5?

On the composite Intelligence Index, Claude Fable 5 (65) and Opus 4.8 (61) sit above GPT-5.5 (60). On agentic coding (Terminal-Bench 2.1), GPT-5.5 via Codex (83.4%) currently edges Opus 4.8 via Claude Code (78.9%). They're close enough that workflow fit and price usually decide — the Opus 4.8 cost calculator helps with the money side.

What is the best free or open-source AI model?

MiniMax-M3 (Intelligence Index 55) is the strongest near-frontier open-weight model, followed by Kimi K2.6 (54) and DeepSeek V4 Pro (52). All are self-hostable and dramatically cheaper than closed frontier models.

What is the cheapest good AI model?

DeepSeek V4 Pro (~~$0.18 blended /1M tokens, index 52) and MiniMax-M3 (~~$0.22, index 55) offer frontier-adjacent quality at roughly one-tenth the price of top closed models.

What is the best AI model for coding?

By model: Claude Fable 5 / Opus 4.8 lead SWE-bench Verified. By coding agent: GPT-5.5 (Codex) tops Terminal-Bench 2.1. Note SWE-bench Verified is partly saturated — check SWE-bench Pro for real-world signal.

Why are SWE-bench scores so high — are they real?

Treat 90%+ SWE-bench Verified scores with caution. The benchmark has known training-data contamination; OpenAI stopped reporting it. On Scale's standardised SEAL leaderboard the best public score is ~59%, and no model exceeds ~47% on the private set. Real-world coding success is roughly half the Verified headline.

What is the best AI image generator in 2026?

Nano Banana 2 for general use and character consistency, GPT Image for text/typography, and Seedream 5.0 for high-volume, cost-sensitive production.

What is the best AI video generator now that Sora is gone?

Veo 3.1 for cinematic quality with native synchronised audio, and Kling 3.0 for the best value (~$0.10/second). Sora 2's API shuts down 24 September 2026.

How often is this leaderboard updated?

Monthly. This is the June 2026 edition; the next refresh lands mid-July 2026. Between editions, the Model Arena board and What's New feed track launches as they land.

Method & sources

We don't run our own private benchmarks or invent scores. This leaderboard aggregates published results from independent sources and attributes every figure to its origin and date — that transparency is the point, and it is the same standard our scoring methodology holds every product page to.

Capability / price / speed: Artificial Analysis Intelligence Index (381 models), June 2026.
Coding: vals.ai (SWE-bench Verified) and Scale AI SEAL (SWE-bench Pro, standardised scaffolding), June 2026.
Agents: tbench.ai (Terminal-Bench 2.1) and τ²-bench, June 2026.
Human preference: LMArena (blind A/B voting), June 2026.
Vendor pricing & specs: Anthropic, OpenAI and Google Gemini pricing pages, June 2026.

Prices are blended/illustrative and change frequently — confirm against each provider's live pricing before committing spend. Some research-preview models (e.g. Mythos-tier previews) appear on leaderboards but are not generally available; we rank the publicly usable field. For the full picture of what a model actually costs once retries, output asymmetry, and idle seats are counted, read The True Cost of AI Tools 2026.

Spotted an error or a new release we missed? That's the fastest way to improve a leaderboard — tell us.

Explore more on Mindber: the live Model Arena ranking · What's New · the weekly LLM rankings · the full AI tools directory · all our guides.

Related on Mindber

Last verified: 2026-06-15 (SGT/MYT) · Next update: mid-July 2026. Mindber Data Drop v2026.06. Every figure is attributed to its published source and date — see Method & sources below. Prices are blended/illustrative and change frequently; confirm against each provider's live pricing before committing spend.

By Mindber Research · AI model tracking. Figures checked against the cited leaderboards on 2026-06-15.

How we assessed this: AI-assisted editorial analysis that aggregates published results from independent leaderboards (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, τ²-bench, LMArena) and vendor pricing pages, as of June 2026. Mindber did not run its own private benchmarks and this is not hands-on product testing. Every number is attributed to its origin and date; any figure we could not confirm against a live source was dropped, not guessed.

Three numbers that frame June 2026

Claude Fable 5 — top of the Artificial Analysis Intelligence Index, ~8 points clear of the value tier

Artificial Analysis, June 2026

83.4%

Codex CLI on GPT-5.5 — the agentic-coding lead on Terminal-Bench 2.1, ahead of Claude Code on Opus 4.8 (78.9%)

tbench.ai, June 2026

~$0.18

DeepSeek V4 Pro blended per 1M tokens — frontier-adjacent quality at roughly one-tenth the price of the top closed models

Artificial Analysis, June 2026

TL;DR — best model by category (June 2026)

Job	Top pick	Best value alternative	The number that matters
Text & reasoning	Claude Fable 5	Gemini 3.1 Pro / Qwen3.7 Max	AA Intelligence Index 65 vs 57
Coding (model)	Claude Fable 5 / Opus 4.8	DeepSeek V4 / MiniMax-M3	SWE-bench Verified — but read the caveat
Coding agent (tool)	GPT-5.5 (Codex CLI)	Claude Opus 4.8 (Claude Code)	Terminal-Bench 2.1: 83.4% vs 78.9%
General agent / tool use	GPT-5.5	GLM-5 family (customer-service tasks)	Benchmark-dependent — no universal winner
Image	Nano Banana 2	Seedream 5.0 (volume)	Human-preference Arena + per-image cost
Video	Veo 3.1 (cinematic + audio)	Kling 3.0 (~$0.10/sec)	Sora 2 is being shut down — migrate off it
Cheapest at frontier quality	DeepSeek V4 Pro	MiniMax-M3	~$0.18–0.22 blended /1M tokens
Fastest output	Mercury 2	Gemini 3.1 Flash-Lite	~889 t/s vs ~326 t/s

What changed this month

The frontier moved again in late May and early June:

Claude Fable 5 went GA on 9 June 2026 ($10 / $50 per 1M tokens, 1M-token context). It debuted at #1 on the Artificial Analysis Intelligence Index (65) and top of SWE-bench Verified (95.0%). We unpack access, safeguards and prompts in the Claude Fable 5 guide, and you can see Anthropic's own framing in its announcement.
Claude Opus 4.8 shipped 28 May 2026 ($5 / $25). It posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1 — the strongest price-to-capability point in the Claude line. See its scorecard and the Opus 4.8 cost calculator for break-even math.
GPT-5.5 (23 April 2026) is OpenAI's default everyday model, with a reported ~60% drop in hallucinations versus GPT-5.4. It currently leads agentic coding via Codex; current rates are on the OpenAI pricing page.
Sora 2 is being retired. OpenAI shut the Sora web/app on 26 April 2026; the API shuts down 24 September 2026. Do not start new video pipelines on it.
Open weights nearly caught up. DeepSeek V4, MiniMax-M3 and Qwen3.7 Max now sit within ~0.2 points of Gemini 3.1 Pro on SWE-bench Verified — at roughly one-tenth the token price.

The headline takeaway: the top of the table is now a plateau, not a gap. The interesting decisions in 2026 are about cost, speed, and fit — not about chasing the #1 row.

1) Text & reasoning

#	Model	Creator	Intelligence Index	Blended price /1M	Context
1	Claude Fable 5 (max effort)	Anthropic	65	$7.70	1M
2	Claude Opus 4.8 (max)	Anthropic	61	$3.85	1M
3	GPT-5.5 (xhigh)	OpenAI	60	$4.35	922k
4	GPT-5.5 (high)	OpenAI	59	$4.35	922k
5	Gemini 3.1 Pro Preview	Google	57	$1.74	1M
5	Qwen3.7 Max	Alibaba	57	$1.43	1M
5	Claude Opus 4.7 (max)	Anthropic	57	$3.85	1M
8	Gemini 3.5 Flash	Google	55	$1.31	1M
8	MiniMax-M3 (open)	MiniMax	55	$0.22	1M
10	Kimi K2.6 (open)	Moonshot	54	$0.70	256k

Source: Artificial Analysis Intelligence Index, June 2026.

2) Coding

This is the category with the most misleading numbers on the internet, so read carefully.

#	Model	SWE-bench Verified	Price /1M (in/out)
1	Claude Fable 5	95.0%	$10 / $50
2	Claude Opus 4.8	88.6%	$5 / $25
3	GPT-5.5	82.6%	$5 / $30
4	Claude Opus 4.7	~82%	$5 / $25
5	MiniMax-M3 (open)	80.5%	$0.30 / $1.20
5	Gemini 3.5 Flash	78.8%	$1.31 blended

Source: vals.ai SWE-bench Verified, June 2026. (Reported Opus 4.7 scores vary 82–88% across harnesses — see caveat.)

⚠️ The reality check most leaderboards won't give you

On Scale AI's standardised SEAL leaderboard (identical scaffolding for every model), the numbers collapse:

Best public standardised score: ~59.1% (GPT-5.4 xHigh)
Private commercial set: no model exceeds ~47.1%
Typical drop moving from Verified → Pro: 15–35 points

3) Agents & tool use

Terminal-Bench 2.1 (operate a real computer via terminal — compile code, set up servers, run data workflows):

#	Agent + model	Score
1	Codex CLI on GPT-5.5	83.4%
2	Claude Code on Opus 4.8	78.9%
3	Gemini CLI on Gemini 3.1 Pro	70.7% (±2.9)

Source: tbench.ai, June 2026.

4) Image generation

The image race has split into clear lanes — there is no overall #1, only a best-per-lane.

Best all-rounder / character consistency: Nano Banana 2 (Gemini 3.1 Flash Image). Native 4K, keeps faces and styles stable across edits — ideal for serial content (mascots, storyboards, campaigns). Premium at ~$0.13–0.24/image.
Best text & typography: GPT Image (1.5 / 2). A "thinking" latent space that reasons through spatial instructions — the one model you can trust to spell a headline correctly. Consistently top-rated on Arena.ai for prompt adherence.
Best value / high-volume: Seedream 5.0 (ByteDance). Production-grade 4K at ~$0.026–0.032/image — built for e-commerce catalogues and content calendars.
Best for logos & posters: Ideogram v3.
Best for brand/style locking & open weights: Flux 2 Pro (dev/pro/max tiers).
Best for non-English prompts: Qwen Image (strong on Chinese, Arabic, Spanish).
Fastest: Z-Image Turbo (~1 second per image).

5) Video generation

The big story is a departure: Sora 2 is being shut down (web/app 26 April 2026; API 24 September 2026). If you're on it, plan your migration now. Here's the field that remains:

Best cinematic quality + native audio: Veo 3.1 (Google). The only model generating 48kHz synchronised dialogue — not just sound effects. Best photorealism on human subjects and natural light. ~$0.15–1.20 per 5-second clip by tier.
Best value: Kling 3.0 (Kuaishou). Native 4K, 60fps, multilingual lip-sync, ~$0.10/second — the iteration workhorse.
Hottest image-to-video: Seedance 2.0 (ByteDance). Strong stylised motion and short-form vertical content.
New frontier contender: HappyHorse-1.0 (Alibaba). Joint audio-video, 7-language lip-sync, climbing the Artificial Analysis video board; live on fal.ai.
Best creative control: Runway Gen-4.5. Motion brushes, scene consistency and a real timeline editor — it lost the leaderboard lead but still wins for directed, multi-shot work.
Best HDR: Luma Ray3.14 (native 16-bit HDR).

6) Best value & open-weight (the bootstrap lane)

If you're shipping a product and watching margins, this is the most important table in this report. Open weights are now frontier-adjacent at a fraction of the cost:

Model	Index	Price /1M	Why pick it
Gemini 3.1 Pro	57	$1.74	Best closed frontier value
Qwen3.7 Max	57	$1.43	Frontier reasoning, 1M context, strong multilingual
MiniMax-M3 (open)	55	$0.22	Near-frontier, open weights, 1M context
Kimi K2.6 (open)	54	$0.70	Strong open reasoning
DeepSeek V4 Pro (open)	52	$0.18	Cheapest credible workhorse; cache hits drop input further
GLM-5.1 (open)	51	$0.90	Strong tool use / agentic

Source: Artificial Analysis, June 2026.

7) Speed (for real-time & long agent chains)

When latency compounds across many sequential steps, throughput becomes the deciding metric:

Mercury 2 (Inception, diffusion LLM) — ~889 tokens/sec
Granite 4.0 H Small (IBM) — ~524 t/s
Step 3.7 Flash — ~385 t/s
gpt-oss-120b (high) — ~338 t/s
Gemini 3.1 Flash-Lite — ~326 t/s