Mindber
HomeDiscoverRankingsModel ArenaWhat's NewComparePricingBlog
Mindber

Independent directory for discovering, comparing, and monitoring AI apps, agents, and automation software by category, pricing, traffic, regional demand, alternatives, and verified activity signals.

All systems operational
ENEnglishCN中文ESEspañolIDIndonesiaVITiếng ViệtTHไทย

Product

  • AI software directory
  • AI software rankings
  • Compare AI apps
  • AI software pricing
  • Submit an AI product

Resources

  • AI scoring methodology
  • AI activity signals
  • AI rankings methodology
  • Verification tiers
  • Methodology changelog
  • AI data sources
  • AI product blog
  • AI market reports

Company

  • About Mindber
  • Claim a profile
  • Report correction
  • Contact Mindber

Legal

  • Terms
  • Privacy
  • Disclaimer
  • DMCA
  • Takedown

Mindber publishes human-reviewed AI product profiles, rankings, comparisons, and reports from publicly accessible product, pricing, traffic, regional, and source data, with clear context for buyers, search engines, and AI answer engines. Not investment, legal, or purchasing advice.

Mindber Score™, Mindber Innovation Index™, Mindber Functionality Score™, and Mindber Activity Score™ are trademarks of Mindber.

© 2026 Mindber. All rights reserved.v2.5
  • Home
  • Discover
  • Rankings
  • Model Arena
  • Compare
  • Sign in
Skip to main content
BlogBest AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources)

Best AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources)

guideUpdated June 15, 202616 min read

The best AI model in June 2026 depends on the job: Claude Fable 5 leads capability, GPT-5.5 coding agents, Gemini 3.1 Pro value. Ranked, with sources.

#best AI model 2026#AI model leaderboard#best LLM 2026#best AI model for coding#Claude Fable 5#GPT-5.5#Gemini 3.1 Pro#open-weight models#AI models
Best AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources) — The best AI model in June 2026 depends on the job: Claude Fable 5 leads capability, GPT-5.5 coding agents, Gemini 3.1 Pro value. Ranked, with sources.

Last verified: 2026-06-15 (SGT/MYT) · Next update: mid-July 2026. Mindber Data Drop v2026.06. Every figure is attributed to its published source and date — see Method & sources below. Prices are blended/illustrative and change frequently; confirm against each provider's live pricing before committing spend.

By Mindber Research · AI model tracking. Figures checked against the cited leaderboards on 2026-06-15.

How we assessed this: AI-assisted editorial analysis that aggregates published results from independent leaderboards (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, τ²-bench, LMArena) and vendor pricing pages, as of June 2026. Mindber did not run its own private benchmarks and this is not hands-on product testing. Every number is attributed to its origin and date; any figure we could not confirm against a live source was dropped, not guessed.

The short answer: there is no single "best AI model" in June 2026 — there is a best model per job, per budget. Right now Claude Fable 5 tops raw capability, GPT-5.5 leads coding agents, Gemini 3.1 Pro is the best frontier value, open-weight models (MiniMax-M3, DeepSeek V4, Qwen3.7 Max) close most of the gap for a fraction of the price, Nano Banana 2 and GPT Image lead image, and Veo 3.1 / Kling 3.0 lead video now that Sora 2 has been retired.

Below is the full breakdown — and, more importantly, the part most leaderboards skip: which numbers are real and which to ignore. For the live Mindber view of the same field, see the Model Arena board and the weekly LLM rankings; to compare two models head-to-head on price and capability, use the compare tool.

Three numbers that frame June 2026

65
Claude Fable 5 — top of the Artificial Analysis Intelligence Index, ~8 points clear of the value tier
Artificial Analysis, June 2026
83.4%
Codex CLI on GPT-5.5 — the agentic-coding lead on Terminal-Bench 2.1, ahead of Claude Code on Opus 4.8 (78.9%)
tbench.ai, June 2026
~$0.18
DeepSeek V4 Pro blended per 1M tokens — frontier-adjacent quality at roughly one-tenth the price of the top closed models
Artificial Analysis, June 2026

TL;DR — best model by category (June 2026)

JobTop pickBest value alternativeThe number that matters
Text & reasoningClaude Fable 5Gemini 3.1 Pro / Qwen3.7 MaxAA Intelligence Index 65 vs 57
Coding (model)Claude Fable 5 / Opus 4.8DeepSeek V4 / MiniMax-M3SWE-bench Verified — but read the caveat
Coding agent (tool)GPT-5.5 (Codex CLI)Claude Opus 4.8 (Claude Code)Terminal-Bench 2.1: 83.4% vs 78.9%
General agent / tool useGPT-5.5GLM-5 family (customer-service tasks)Benchmark-dependent — no universal winner
ImageNano Banana 2Seedream 5.0 (volume)Human-preference Arena + per-image cost
VideoVeo 3.1 (cinematic + audio)Kling 3.0 (~$0.10/sec)Sora 2 is being shut down — migrate off it
Cheapest at frontier qualityDeepSeek V4 ProMiniMax-M3~$0.18–0.22 blended /1M tokens
Fastest outputMercury 2Gemini 3.1 Flash-Lite~889 t/s vs ~326 t/s

Capability figures: Artificial Analysis Intelligence Index, June 2026 (381 models). Coding: vals.ai SWE-bench Verified + Scale AI SEAL. Agents: tbench.ai Terminal-Bench 2.1. We attribute every number to its source and date — see Method below.

What changed this month

The frontier moved again in late May and early June:

  • Claude Fable 5 went GA on 9 June 2026 ($10 / $50 per 1M tokens, 1M-token context). It debuted at #1 on the Artificial Analysis Intelligence Index (65) and top of SWE-bench Verified (95.0%). We unpack access, safeguards and prompts in the Claude Fable 5 guide, and you can see Anthropic's own framing in its announcement.
  • Claude Opus 4.8 shipped 28 May 2026 ($5 / $25). It posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1 — the strongest price-to-capability point in the Claude line. See its scorecard and the Opus 4.8 cost calculator for break-even math.
  • GPT-5.5 (23 April 2026) is OpenAI's default everyday model, with a reported ~60% drop in hallucinations versus GPT-5.4. It currently leads agentic coding via Codex; current rates are on the OpenAI pricing page.
  • Sora 2 is being retired. OpenAI shut the Sora web/app on 26 April 2026; the API shuts down 24 September 2026. Do not start new video pipelines on it.
  • Open weights nearly caught up. DeepSeek V4, MiniMax-M3 and Qwen3.7 Max now sit within ~0.2 points of Gemini 3.1 Pro on SWE-bench Verified — at roughly one-tenth the token price.

The headline takeaway: the top of the table is now a plateau, not a gap. The interesting decisions in 2026 are about cost, speed, and fit — not about chasing the #1 row.

The frontier is a plateau, not a ladder

For most real workloads, the top five reasoning models are interchangeable on quality — they separate on price, latency, and how well they fit your harness. Chasing the #1 row is the most common way teams overpay. Pick the cheapest model that clears your task's quality bar, and reserve the frontier tier for the genuinely hard slice.

1) Text & reasoning

The cleanest single capability number is the Artificial Analysis Intelligence Index — a composite of GPQA Diamond, MMLU-Pro, AIME, LiveCodeBench and several other benchmarks, normalised to one score.

#ModelCreatorIntelligence IndexBlended price /1MContext
1Claude Fable 5 (max effort)Anthropic65$7.701M
2Claude Opus 4.8 (max)Anthropic61$3.851M
3GPT-5.5 (xhigh)OpenAI60$4.35922k
4GPT-5.5 (high)OpenAI59$4.35922k
5Gemini 3.1 Pro PreviewGoogle57$1.741M
5Qwen3.7 MaxAlibaba57$1.431M
5Claude Opus 4.7 (max)Anthropic57$3.851M
8Gemini 3.5 FlashGoogle55$1.311M
8MiniMax-M3 (open)MiniMax55$0.221M
10Kimi K2.6 (open)Moonshot54$0.70256k

Source: Artificial Analysis Intelligence Index, June 2026.

Read it like this: the top five are separated by ~8 points across a broad reasoning suite — close enough that for most real workloads they're interchangeable on quality. Where they separate hard is price. Gemini 3.1 Pro delivers index-57 reasoning at $1.74; Qwen3.7 Max matches it at $1.43; MiniMax-M3 lands index-55 at $0.22. Paying Fable-5 prices ($7.70 blended) only makes sense for the genuinely hardest 5–10% of tasks. If your spend is dominated by a high volume of medium-difficulty calls, the value tier is not a compromise — it's the correct default, and you can sanity-check the trade on the Mindber rankings.

Human preference vs benchmarks: LMArena (blind A/B voting) and the Intelligence Index measure different things — one captures what people like, the other what models can do. The Claude and Gemini families trade the top of LMArena's text board, and those rankings shift week to week. When the two leaderboards disagree, that gap usually means a model is either over- or under-tuned for chat style, not that one source is "wrong." This is exactly why Mindber's scoring methodology keeps capability and preference as separate axes rather than collapsing them into one number.

2) Coding

This is the category with the most misleading numbers on the internet, so read carefully.

#ModelSWE-bench VerifiedPrice /1M (in/out)
1Claude Fable 595.0%$10 / $50
2Claude Opus 4.888.6%$5 / $25
3GPT-5.582.6%$5 / $30
4Claude Opus 4.7~82%$5 / $25
5MiniMax-M3 (open)80.5%$0.30 / $1.20
5Gemini 3.5 Flash78.8%$1.31 blended

Source: vals.ai SWE-bench Verified, June 2026. (Reported Opus 4.7 scores vary 82–88% across harnesses — see caveat.)

⚠️ The reality check most leaderboards won't give you

SWE-bench Verified is partly saturated and partly memorised. OpenAI's own audit found that frontier models can reproduce verbatim "gold" patches for some tasks — the 500 Python issues leaked into training data before the benchmark was widely published. OpenAI stopped reporting Verified scores and now points to SWE-bench Pro instead.

On Scale AI's standardised SEAL leaderboard (identical scaffolding for every model), the numbers collapse:

  • Best public standardised score: ~59.1% (GPT-5.4 xHigh)
  • Private commercial set: no model exceeds ~47.1%
  • Typical drop moving from Verified → Pro: 15–35 points

So when you see "95% on SWE-bench," translate it to: "saturated benchmark, real-world success rate is roughly half that on unseen, harder code." Use Pro / standardised numbers for procurement decisions, and Verified only for rough relative ranking. The deeper lesson is one Mindber's verification methodology leans on hard: a headline benchmark number is a starting hypothesis, not a purchase decision.

3) Agents & tool use

For agentic work, the harness matters as much as the model. The same model scores differently inside Codex CLI vs Claude Code on Opus 4.8 vs a custom scaffold — agent leaderboards rank agent + model pairs, not models alone.

Terminal-Bench 2.1 (operate a real computer via terminal — compile code, set up servers, run data workflows):

#Agent + modelScore
1Codex CLI on GPT-5.583.4%
2Claude Code on Opus 4.878.9%
3Gemini CLI on Gemini 3.1 Pro70.7% (±2.9)

Source: tbench.ai, June 2026.

Customer-service / structured tool use (τ²-bench): a different picture entirely — GLM-family models (e.g. GLM-4.7-Flash at 98.8%) top the retail/airline tool-calling tasks. A model that wins terminal automation can lose at multi-turn customer-service tool use. Pick your agent by the task you actually run, not by a single board — and if you're unsure which models even belong on your shortlist, start from the AI tools directory filtered to your use case.

4) Image generation

The image race has split into clear lanes — there is no overall #1, only a best-per-lane.

  • Best all-rounder / character consistency: Nano Banana 2 (Gemini 3.1 Flash Image). Native 4K, keeps faces and styles stable across edits — ideal for serial content (mascots, storyboards, campaigns). Premium at ~$0.13–0.24/image.
  • Best text & typography: GPT Image (1.5 / 2). A "thinking" latent space that reasons through spatial instructions — the one model you can trust to spell a headline correctly. Consistently top-rated on Arena.ai for prompt adherence.
  • Best value / high-volume: Seedream 5.0 (ByteDance). Production-grade 4K at ~$0.026–0.032/image — built for e-commerce catalogues and content calendars.
  • Best for logos & posters: Ideogram v3.
  • Best for brand/style locking & open weights: Flux 2 Pro (dev/pro/max tiers).
  • Best for non-English prompts: Qwen Image (strong on Chinese, Arabic, Spanish).
  • Fastest: Z-Image Turbo (~1 second per image).

For Southeast Asian / multilingual creators: Qwen Image and Seedream handle Chinese and mixed-script prompts more reliably than Western-tuned models, and Seedream's per-image economics make batch product shots realistic on a small budget. You can browse the image-generation field, with Mindber scores and live pricing, in the discovery directory.

5) Video generation

The big story is a departure: Sora 2 is being shut down (web/app 26 April 2026; API 24 September 2026). If you're on it, plan your migration now. Here's the field that remains:

  • Best cinematic quality + native audio: Veo 3.1 (Google). The only model generating 48kHz synchronised dialogue — not just sound effects. Best photorealism on human subjects and natural light. ~$0.15–1.20 per 5-second clip by tier.
  • Best value: Kling 3.0 (Kuaishou). Native 4K, 60fps, multilingual lip-sync, ~$0.10/second — the iteration workhorse.
  • Hottest image-to-video: Seedance 2.0 (ByteDance). Strong stylised motion and short-form vertical content.
  • New frontier contender: HappyHorse-1.0 (Alibaba). Joint audio-video, 7-language lip-sync, climbing the Artificial Analysis video board; live on fal.ai.
  • Best creative control: Runway Gen-4.5. Motion brushes, scene consistency and a real timeline editor — it lost the leaderboard lead but still wins for directed, multi-shot work.
  • Best HDR: Luma Ray3.14 (native 16-bit HDR).

Note: video arena scores live on different scales (LMArena text-to-video vs Artificial Analysis), so cross-board number comparisons are unreliable. Treat these as lane leaders, not a single ranked ladder.

6) Best value & open-weight (the bootstrap lane)

If you're shipping a product and watching margins, this is the most important table in this report. Open weights are now frontier-adjacent at a fraction of the cost:

ModelIndexPrice /1MWhy pick it
Gemini 3.1 Pro57$1.74Best closed frontier value
Qwen3.7 Max57$1.43Frontier reasoning, 1M context, strong multilingual
MiniMax-M3 (open)55$0.22Near-frontier, open weights, 1M context
Kimi K2.6 (open)54$0.70Strong open reasoning
DeepSeek V4 Pro (open)52$0.18Cheapest credible workhorse; cache hits drop input further
GLM-5.1 (open)51$0.90Strong tool use / agentic

Source: Artificial Analysis, June 2026.

The routing play: the cost-optimal setup isn't one model — it's a router. Pin ~80% of traffic to a cheap workhorse (DeepSeek V4 / MiniMax-M3 / a small Gemini Flash) and reserve a frontier model (Opus 4.8 / Fable 5) for the hard 20%. Done right, this beats any single-model subscription on both cost and quality. The economics of that split — and why the rate card is only a fraction of the real bill — are worked through end-to-end in The True Cost of AI Tools 2026.

7) Speed (for real-time & long agent chains)

When latency compounds across many sequential steps, throughput becomes the deciding metric:

  • Mercury 2 (Inception, diffusion LLM) — ~889 tokens/sec
  • Granite 4.0 H Small (IBM) — ~524 t/s
  • Step 3.7 Flash — ~385 t/s
  • gpt-oss-120b (high) — ~338 t/s
  • Gemini 3.1 Flash-Lite — ~326 t/s

Source: Artificial Analysis median output speed, June 2026. For chat UX, anything over ~150 t/s feels instant; speed matters most for agentic loops and batch jobs, where every extra second is multiplied by the number of sequential steps in the chain.

How to actually pick a model

Stop optimising for the #1 row. Match the model to the job:

  • Hardest reasoning, money no object → Claude Fable 5 or Opus 4.8.
  • Best quality per dollar at the frontier → Gemini 3.1 Pro or Qwen3.7 Max.
  • Self-hosting / data residency / lowest cost → MiniMax-M3, DeepSeek V4, or Qwen3.7 Max.
  • Coding inside an agent → GPT-5.5 via Codex, or Opus 4.8 via Claude Code.
  • Image — general → Nano Banana 2; text in image → GPT Image; high volume → Seedream 5.
  • Video — cinematic + audio → Veo 3.1; value/iteration → Kling 3.0.
  • Real-time / high throughput → Mercury 2 or a Flash-tier model.

The decision grid below is the same logic in a form you can hand to a buyer:

The buyer's decision grid

Quality over cost

Hardest reasoning

  • Claude Fable 5 (index 65) or Opus 4.8 (61)
  • Worth it for the hardest 5–10% of tasks
  • Route easy work elsewhere — don't default here
Quality per dollar

Best value at frontier

  • Gemini 3.1 Pro ($1.74) or Qwen3.7 Max ($1.43)
  • Index 57 — within ~8 points of the top
  • The correct default for most production traffic
Margins or data residency

Lowest cost / self-host

  • MiniMax-M3 ($0.22), DeepSeek V4 ($0.18)
  • Open weights, 1M context, self-hostable
  • Cache hits drop the input rate further
Harness matters as much as model

Coding inside an agent

  • GPT-5.5 via Codex tops Terminal-Bench 2.1
  • Opus 4.8 via Claude Code is close behind
  • Rank agent+model pairs, not models alone
Best-per-lane, no overall #1

Image & video

  • Image: Nano Banana 2 / GPT Image / Seedream 5
  • Video: Veo 3.1 (audio) or Kling 3.0 (value)
  • Sora 2 API closes 24 Sep 2026 — migrate
Latency compounds in agent loops

Real-time / high throughput

  • Mercury 2 (~889 t/s) or a Flash-tier model
  • >150 t/s already feels instant in chat
  • Speed is decisive for batch + multi-step chains

FAQ

What is the best AI model right now (June 2026)?

For raw capability, Claude Fable 5 leads the Artificial Analysis Intelligence Index (65). But "best" depends on the task: GPT-5.5 leads agentic coding, Gemini 3.1 Pro is the best value, and open models like MiniMax-M3 are best for cost-sensitive deployment. The live Mindber view is on the Model Arena board.

Is Claude better than GPT-5.5?

On the composite Intelligence Index, Claude Fable 5 (65) and Opus 4.8 (61) sit above GPT-5.5 (60). On agentic coding (Terminal-Bench 2.1), GPT-5.5 via Codex (83.4%) currently edges Opus 4.8 via Claude Code (78.9%). They're close enough that workflow fit and price usually decide — the Opus 4.8 cost calculator helps with the money side.

What is the best free or open-source AI model?

MiniMax-M3 (Intelligence Index 55) is the strongest near-frontier open-weight model, followed by Kimi K2.6 (54) and DeepSeek V4 Pro (52). All are self-hostable and dramatically cheaper than closed frontier models.

What is the cheapest good AI model?

DeepSeek V4 Pro ($0.18 blended /1M tokens, index 52) and MiniMax-M3 ($0.22, index 55) offer frontier-adjacent quality at roughly one-tenth the price of top closed models.

What is the best AI model for coding?

By model: Claude Fable 5 / Opus 4.8 lead SWE-bench Verified. By coding agent: GPT-5.5 (Codex) tops Terminal-Bench 2.1. Note SWE-bench Verified is partly saturated — check SWE-bench Pro for real-world signal.

Why are SWE-bench scores so high — are they real?

Treat 90%+ SWE-bench Verified scores with caution. The benchmark has known training-data contamination; OpenAI stopped reporting it. On Scale's standardised SEAL leaderboard the best public score is ~59%, and no model exceeds ~47% on the private set. Real-world coding success is roughly half the Verified headline.

What is the best AI image generator in 2026?

Nano Banana 2 for general use and character consistency, GPT Image for text/typography, and Seedream 5.0 for high-volume, cost-sensitive production.

What is the best AI video generator now that Sora is gone?

Veo 3.1 for cinematic quality with native synchronised audio, and Kling 3.0 for the best value (~$0.10/second). Sora 2's API shuts down 24 September 2026.

How often is this leaderboard updated?

Monthly. This is the June 2026 edition; the next refresh lands mid-July 2026. Between editions, the Model Arena board and What's New feed track launches as they land.

Method & sources

We don't run our own private benchmarks or invent scores. This leaderboard aggregates published results from independent sources and attributes every figure to its origin and date — that transparency is the point, and it is the same standard our scoring methodology holds every product page to.

  • Capability / price / speed: Artificial Analysis Intelligence Index (381 models), June 2026.
  • Coding: vals.ai (SWE-bench Verified) and Scale AI SEAL (SWE-bench Pro, standardised scaffolding), June 2026.
  • Agents: tbench.ai (Terminal-Bench 2.1) and τ²-bench, June 2026.
  • Human preference: LMArena (blind A/B voting), June 2026.
  • Vendor pricing & specs: Anthropic, OpenAI and Google Gemini pricing pages, June 2026.

Prices are blended/illustrative and change frequently — confirm against each provider's live pricing before committing spend. Some research-preview models (e.g. Mythos-tier previews) appear on leaderboards but are not generally available; we rank the publicly usable field. For the full picture of what a model actually costs once retries, output asymmetry, and idle seats are counted, read The True Cost of AI Tools 2026.

Spotted an error or a new release we missed? That's the fastest way to improve a leaderboard — tell us.

Explore more on Mindber: the live Model Arena ranking · What's New · the weekly LLM rankings · the full AI tools directory · all our guides.

Related on Mindber

The True Cost of AI Tools in 2026: Sticker vs Reality

Why the real cost of an AI tool runs ~8x the rate card — a fully-sourced TCO model with the seven hidden costs.

Opus 4.8 Cost Calculator: When It Beats Sonnet & GPT-5.5

Break-even workloads, smart-routing savings, and per-model cache rates for the current frontier models.

Claude Fable 5: What It Is, How to Use It, and the Prompts That Exploit It

Anthropic's first public Mythos-class model — pricing, safeguards, benchmarks, access, and copy-paste prompts.

Share this article

Legal notice

This publication constitutes editorial commentary on publicly available information and does not constitute financial, legal, investment, or professional advice. Product names, trademarks, and registered trademarks referenced herein are the property of their respective owners; their appearance does not imply endorsement or affiliation. Mindber's analysis reflects editorial judgment based on public signals and is subject to change without notice. Scores are not buy, sell, or hold recommendations. No commercial relationship exists between Mindber and the vendors evaluated unless separately disclosed in writing. This publication is governed by the laws of Malaysia. Any dispute arising from or in connection with this publication shall be submitted to the exclusive jurisdiction of the courts of Malaysia.

AI-generated · This report was generated using AI language models trained on publicly available data. It reflects editorial analysis at the time of generation and is not the result of hands-on product testing, independent verification by a human analyst, or a commercial endorsement. All scores, assessments, and claims are derived from signals indexed by Mindber at generation time and are subject to change without notice. Mindber and its operators make no warranty of accuracy, completeness, or fitness for any commercial decision-making purpose. This report is for informational purposes only.

MI

Mindber Research

Mindber editorial — AI model tracking.

Aggregates published benchmark results (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, LMArena) and attributes every figure to its source and date.

On this page
  • TL;DR — best model by category (June 2026)
  • What changed this month
  • 1) Text & reasoning
  • 2) Coding
  • ⚠️ The reality check most leaderboards won't give you
  • 3) Agents & tool use
  • 4) Image generation
  • 5) Video generation
  • 6) Best value & open-weight (the bootstrap lane)
  • 7) Speed (for real-time & long agent chains)
  • How to actually pick a model
  • FAQ
  • Method & sources

Related articles

Claude Fable 5: What It Is, How to Use It, and the Prompts That Exploit It

Jun 913 min

Claude Fable 5 Suspended by US Government Order

Jun 1312 min

The True Cost of AI Tools in 2026: Sticker vs Reality

Jun 712 min
Sign In
Skip to main content
BlogBest AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources)

Best AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources)

guideUpdated June 15, 202616 min read

The best AI model in June 2026 depends on the job: Claude Fable 5 leads capability, GPT-5.5 coding agents, Gemini 3.1 Pro value. Ranked, with sources.

#best AI model 2026#AI model leaderboard#best LLM 2026#best AI model for coding#Claude Fable 5#GPT-5.5#Gemini 3.1 Pro#open-weight models#AI models
Best AI Models, Ranked — June 2026 Leaderboard for Text, Coding, Agents, Image & Video (With Sources) — The best AI model in June 2026 depends on the job: Claude Fable 5 leads capability, GPT-5.5 coding agents, Gemini 3.1 Pro value. Ranked, with sources.

Last verified: 2026-06-15 (SGT/MYT) · Next update: mid-July 2026. Mindber Data Drop v2026.06. Every figure is attributed to its published source and date — see Method & sources below. Prices are blended/illustrative and change frequently; confirm against each provider's live pricing before committing spend.

By Mindber Research · AI model tracking. Figures checked against the cited leaderboards on 2026-06-15.

How we assessed this: AI-assisted editorial analysis that aggregates published results from independent leaderboards (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, τ²-bench, LMArena) and vendor pricing pages, as of June 2026. Mindber did not run its own private benchmarks and this is not hands-on product testing. Every number is attributed to its origin and date; any figure we could not confirm against a live source was dropped, not guessed.

The short answer: there is no single "best AI model" in June 2026 — there is a best model per job, per budget. Right now Claude Fable 5 tops raw capability, GPT-5.5 leads coding agents, Gemini 3.1 Pro is the best frontier value, open-weight models (MiniMax-M3, DeepSeek V4, Qwen3.7 Max) close most of the gap for a fraction of the price, Nano Banana 2 and GPT Image lead image, and Veo 3.1 / Kling 3.0 lead video now that Sora 2 has been retired.

Below is the full breakdown — and, more importantly, the part most leaderboards skip: which numbers are real and which to ignore. For the live Mindber view of the same field, see the Model Arena board and the weekly LLM rankings; to compare two models head-to-head on price and capability, use the compare tool.

Three numbers that frame June 2026

65
Claude Fable 5 — top of the Artificial Analysis Intelligence Index, ~8 points clear of the value tier
Artificial Analysis, June 2026
83.4%
Codex CLI on GPT-5.5 — the agentic-coding lead on Terminal-Bench 2.1, ahead of Claude Code on Opus 4.8 (78.9%)
tbench.ai, June 2026
~$0.18
DeepSeek V4 Pro blended per 1M tokens — frontier-adjacent quality at roughly one-tenth the price of the top closed models
Artificial Analysis, June 2026

TL;DR — best model by category (June 2026)

JobTop pickBest value alternativeThe number that matters
Text & reasoningClaude Fable 5Gemini 3.1 Pro / Qwen3.7 MaxAA Intelligence Index 65 vs 57
Coding (model)Claude Fable 5 / Opus 4.8DeepSeek V4 / MiniMax-M3SWE-bench Verified — but read the caveat
Coding agent (tool)GPT-5.5 (Codex CLI)Claude Opus 4.8 (Claude Code)Terminal-Bench 2.1: 83.4% vs 78.9%
General agent / tool useGPT-5.5GLM-5 family (customer-service tasks)Benchmark-dependent — no universal winner
ImageNano Banana 2Seedream 5.0 (volume)Human-preference Arena + per-image cost
VideoVeo 3.1 (cinematic + audio)Kling 3.0 (~$0.10/sec)Sora 2 is being shut down — migrate off it
Cheapest at frontier qualityDeepSeek V4 ProMiniMax-M3~$0.18–0.22 blended /1M tokens
Fastest outputMercury 2Gemini 3.1 Flash-Lite~889 t/s vs ~326 t/s

Capability figures: Artificial Analysis Intelligence Index, June 2026 (381 models). Coding: vals.ai SWE-bench Verified + Scale AI SEAL. Agents: tbench.ai Terminal-Bench 2.1. We attribute every number to its source and date — see Method below.

What changed this month

The frontier moved again in late May and early June:

  • Claude Fable 5 went GA on 9 June 2026 ($10 / $50 per 1M tokens, 1M-token context). It debuted at #1 on the Artificial Analysis Intelligence Index (65) and top of SWE-bench Verified (95.0%). We unpack access, safeguards and prompts in the Claude Fable 5 guide, and you can see Anthropic's own framing in its announcement.
  • Claude Opus 4.8 shipped 28 May 2026 ($5 / $25). It posts 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1 — the strongest price-to-capability point in the Claude line. See its scorecard and the Opus 4.8 cost calculator for break-even math.
  • GPT-5.5 (23 April 2026) is OpenAI's default everyday model, with a reported ~60% drop in hallucinations versus GPT-5.4. It currently leads agentic coding via Codex; current rates are on the OpenAI pricing page.
  • Sora 2 is being retired. OpenAI shut the Sora web/app on 26 April 2026; the API shuts down 24 September 2026. Do not start new video pipelines on it.
  • Open weights nearly caught up. DeepSeek V4, MiniMax-M3 and Qwen3.7 Max now sit within ~0.2 points of Gemini 3.1 Pro on SWE-bench Verified — at roughly one-tenth the token price.

The headline takeaway: the top of the table is now a plateau, not a gap. The interesting decisions in 2026 are about cost, speed, and fit — not about chasing the #1 row.

The frontier is a plateau, not a ladder

For most real workloads, the top five reasoning models are interchangeable on quality — they separate on price, latency, and how well they fit your harness. Chasing the #1 row is the most common way teams overpay. Pick the cheapest model that clears your task's quality bar, and reserve the frontier tier for the genuinely hard slice.

1) Text & reasoning

The cleanest single capability number is the Artificial Analysis Intelligence Index — a composite of GPQA Diamond, MMLU-Pro, AIME, LiveCodeBench and several other benchmarks, normalised to one score.

#ModelCreatorIntelligence IndexBlended price /1MContext
1Claude Fable 5 (max effort)Anthropic65$7.701M
2Claude Opus 4.8 (max)Anthropic61$3.851M
3GPT-5.5 (xhigh)OpenAI60$4.35922k
4GPT-5.5 (high)OpenAI59$4.35922k
5Gemini 3.1 Pro PreviewGoogle57$1.741M
5Qwen3.7 MaxAlibaba57$1.431M
5Claude Opus 4.7 (max)Anthropic57$3.851M
8Gemini 3.5 FlashGoogle55$1.311M
8MiniMax-M3 (open)MiniMax55$0.221M
10Kimi K2.6 (open)Moonshot54$0.70256k

Source: Artificial Analysis Intelligence Index, June 2026.

Read it like this: the top five are separated by ~8 points across a broad reasoning suite — close enough that for most real workloads they're interchangeable on quality. Where they separate hard is price. Gemini 3.1 Pro delivers index-57 reasoning at $1.74; Qwen3.7 Max matches it at $1.43; MiniMax-M3 lands index-55 at $0.22. Paying Fable-5 prices ($7.70 blended) only makes sense for the genuinely hardest 5–10% of tasks. If your spend is dominated by a high volume of medium-difficulty calls, the value tier is not a compromise — it's the correct default, and you can sanity-check the trade on the Mindber rankings.

Human preference vs benchmarks: LMArena (blind A/B voting) and the Intelligence Index measure different things — one captures what people like, the other what models can do. The Claude and Gemini families trade the top of LMArena's text board, and those rankings shift week to week. When the two leaderboards disagree, that gap usually means a model is either over- or under-tuned for chat style, not that one source is "wrong." This is exactly why Mindber's scoring methodology keeps capability and preference as separate axes rather than collapsing them into one number.

2) Coding

This is the category with the most misleading numbers on the internet, so read carefully.

#ModelSWE-bench VerifiedPrice /1M (in/out)
1Claude Fable 595.0%$10 / $50
2Claude Opus 4.888.6%$5 / $25
3GPT-5.582.6%$5 / $30
4Claude Opus 4.7~82%$5 / $25
5MiniMax-M3 (open)80.5%$0.30 / $1.20
5Gemini 3.5 Flash78.8%$1.31 blended

Source: vals.ai SWE-bench Verified, June 2026. (Reported Opus 4.7 scores vary 82–88% across harnesses — see caveat.)

⚠️ The reality check most leaderboards won't give you

SWE-bench Verified is partly saturated and partly memorised. OpenAI's own audit found that frontier models can reproduce verbatim "gold" patches for some tasks — the 500 Python issues leaked into training data before the benchmark was widely published. OpenAI stopped reporting Verified scores and now points to SWE-bench Pro instead.

On Scale AI's standardised SEAL leaderboard (identical scaffolding for every model), the numbers collapse:

  • Best public standardised score: ~59.1% (GPT-5.4 xHigh)
  • Private commercial set: no model exceeds ~47.1%
  • Typical drop moving from Verified → Pro: 15–35 points

So when you see "95% on SWE-bench," translate it to: "saturated benchmark, real-world success rate is roughly half that on unseen, harder code." Use Pro / standardised numbers for procurement decisions, and Verified only for rough relative ranking. The deeper lesson is one Mindber's verification methodology leans on hard: a headline benchmark number is a starting hypothesis, not a purchase decision.

3) Agents & tool use

For agentic work, the harness matters as much as the model. The same model scores differently inside Codex CLI vs Claude Code on Opus 4.8 vs a custom scaffold — agent leaderboards rank agent + model pairs, not models alone.

Terminal-Bench 2.1 (operate a real computer via terminal — compile code, set up servers, run data workflows):

#Agent + modelScore
1Codex CLI on GPT-5.583.4%
2Claude Code on Opus 4.878.9%
3Gemini CLI on Gemini 3.1 Pro70.7% (±2.9)

Source: tbench.ai, June 2026.

Customer-service / structured tool use (τ²-bench): a different picture entirely — GLM-family models (e.g. GLM-4.7-Flash at 98.8%) top the retail/airline tool-calling tasks. A model that wins terminal automation can lose at multi-turn customer-service tool use. Pick your agent by the task you actually run, not by a single board — and if you're unsure which models even belong on your shortlist, start from the AI tools directory filtered to your use case.

4) Image generation

The image race has split into clear lanes — there is no overall #1, only a best-per-lane.

  • Best all-rounder / character consistency: Nano Banana 2 (Gemini 3.1 Flash Image). Native 4K, keeps faces and styles stable across edits — ideal for serial content (mascots, storyboards, campaigns). Premium at ~$0.13–0.24/image.
  • Best text & typography: GPT Image (1.5 / 2). A "thinking" latent space that reasons through spatial instructions — the one model you can trust to spell a headline correctly. Consistently top-rated on Arena.ai for prompt adherence.
  • Best value / high-volume: Seedream 5.0 (ByteDance). Production-grade 4K at ~$0.026–0.032/image — built for e-commerce catalogues and content calendars.
  • Best for logos & posters: Ideogram v3.
  • Best for brand/style locking & open weights: Flux 2 Pro (dev/pro/max tiers).
  • Best for non-English prompts: Qwen Image (strong on Chinese, Arabic, Spanish).
  • Fastest: Z-Image Turbo (~1 second per image).

For Southeast Asian / multilingual creators: Qwen Image and Seedream handle Chinese and mixed-script prompts more reliably than Western-tuned models, and Seedream's per-image economics make batch product shots realistic on a small budget. You can browse the image-generation field, with Mindber scores and live pricing, in the discovery directory.

5) Video generation

The big story is a departure: Sora 2 is being shut down (web/app 26 April 2026; API 24 September 2026). If you're on it, plan your migration now. Here's the field that remains:

  • Best cinematic quality + native audio: Veo 3.1 (Google). The only model generating 48kHz synchronised dialogue — not just sound effects. Best photorealism on human subjects and natural light. ~$0.15–1.20 per 5-second clip by tier.
  • Best value: Kling 3.0 (Kuaishou). Native 4K, 60fps, multilingual lip-sync, ~$0.10/second — the iteration workhorse.
  • Hottest image-to-video: Seedance 2.0 (ByteDance). Strong stylised motion and short-form vertical content.
  • New frontier contender: HappyHorse-1.0 (Alibaba). Joint audio-video, 7-language lip-sync, climbing the Artificial Analysis video board; live on fal.ai.
  • Best creative control: Runway Gen-4.5. Motion brushes, scene consistency and a real timeline editor — it lost the leaderboard lead but still wins for directed, multi-shot work.
  • Best HDR: Luma Ray3.14 (native 16-bit HDR).

Note: video arena scores live on different scales (LMArena text-to-video vs Artificial Analysis), so cross-board number comparisons are unreliable. Treat these as lane leaders, not a single ranked ladder.

6) Best value & open-weight (the bootstrap lane)

If you're shipping a product and watching margins, this is the most important table in this report. Open weights are now frontier-adjacent at a fraction of the cost:

ModelIndexPrice /1MWhy pick it
Gemini 3.1 Pro57$1.74Best closed frontier value
Qwen3.7 Max57$1.43Frontier reasoning, 1M context, strong multilingual
MiniMax-M3 (open)55$0.22Near-frontier, open weights, 1M context
Kimi K2.6 (open)54$0.70Strong open reasoning
DeepSeek V4 Pro (open)52$0.18Cheapest credible workhorse; cache hits drop input further
GLM-5.1 (open)51$0.90Strong tool use / agentic

Source: Artificial Analysis, June 2026.

The routing play: the cost-optimal setup isn't one model — it's a router. Pin ~80% of traffic to a cheap workhorse (DeepSeek V4 / MiniMax-M3 / a small Gemini Flash) and reserve a frontier model (Opus 4.8 / Fable 5) for the hard 20%. Done right, this beats any single-model subscription on both cost and quality. The economics of that split — and why the rate card is only a fraction of the real bill — are worked through end-to-end in The True Cost of AI Tools 2026.

7) Speed (for real-time & long agent chains)

When latency compounds across many sequential steps, throughput becomes the deciding metric:

  • Mercury 2 (Inception, diffusion LLM) — ~889 tokens/sec
  • Granite 4.0 H Small (IBM) — ~524 t/s
  • Step 3.7 Flash — ~385 t/s
  • gpt-oss-120b (high) — ~338 t/s
  • Gemini 3.1 Flash-Lite — ~326 t/s

Source: Artificial Analysis median output speed, June 2026. For chat UX, anything over ~150 t/s feels instant; speed matters most for agentic loops and batch jobs, where every extra second is multiplied by the number of sequential steps in the chain.

How to actually pick a model

Stop optimising for the #1 row. Match the model to the job:

  • Hardest reasoning, money no object → Claude Fable 5 or Opus 4.8.
  • Best quality per dollar at the frontier → Gemini 3.1 Pro or Qwen3.7 Max.
  • Self-hosting / data residency / lowest cost → MiniMax-M3, DeepSeek V4, or Qwen3.7 Max.
  • Coding inside an agent → GPT-5.5 via Codex, or Opus 4.8 via Claude Code.
  • Image — general → Nano Banana 2; text in image → GPT Image; high volume → Seedream 5.
  • Video — cinematic + audio → Veo 3.1; value/iteration → Kling 3.0.
  • Real-time / high throughput → Mercury 2 or a Flash-tier model.

The decision grid below is the same logic in a form you can hand to a buyer:

The buyer's decision grid

Quality over cost

Hardest reasoning

  • Claude Fable 5 (index 65) or Opus 4.8 (61)
  • Worth it for the hardest 5–10% of tasks
  • Route easy work elsewhere — don't default here
Quality per dollar

Best value at frontier

  • Gemini 3.1 Pro ($1.74) or Qwen3.7 Max ($1.43)
  • Index 57 — within ~8 points of the top
  • The correct default for most production traffic
Margins or data residency

Lowest cost / self-host

  • MiniMax-M3 ($0.22), DeepSeek V4 ($0.18)
  • Open weights, 1M context, self-hostable
  • Cache hits drop the input rate further
Harness matters as much as model

Coding inside an agent

  • GPT-5.5 via Codex tops Terminal-Bench 2.1
  • Opus 4.8 via Claude Code is close behind
  • Rank agent+model pairs, not models alone
Best-per-lane, no overall #1

Image & video

  • Image: Nano Banana 2 / GPT Image / Seedream 5
  • Video: Veo 3.1 (audio) or Kling 3.0 (value)
  • Sora 2 API closes 24 Sep 2026 — migrate
Latency compounds in agent loops

Real-time / high throughput

  • Mercury 2 (~889 t/s) or a Flash-tier model
  • >150 t/s already feels instant in chat
  • Speed is decisive for batch + multi-step chains

FAQ

What is the best AI model right now (June 2026)?

For raw capability, Claude Fable 5 leads the Artificial Analysis Intelligence Index (65). But "best" depends on the task: GPT-5.5 leads agentic coding, Gemini 3.1 Pro is the best value, and open models like MiniMax-M3 are best for cost-sensitive deployment. The live Mindber view is on the Model Arena board.

Is Claude better than GPT-5.5?

On the composite Intelligence Index, Claude Fable 5 (65) and Opus 4.8 (61) sit above GPT-5.5 (60). On agentic coding (Terminal-Bench 2.1), GPT-5.5 via Codex (83.4%) currently edges Opus 4.8 via Claude Code (78.9%). They're close enough that workflow fit and price usually decide — the Opus 4.8 cost calculator helps with the money side.

What is the best free or open-source AI model?

MiniMax-M3 (Intelligence Index 55) is the strongest near-frontier open-weight model, followed by Kimi K2.6 (54) and DeepSeek V4 Pro (52). All are self-hostable and dramatically cheaper than closed frontier models.

What is the cheapest good AI model?

DeepSeek V4 Pro ($0.18 blended /1M tokens, index 52) and MiniMax-M3 ($0.22, index 55) offer frontier-adjacent quality at roughly one-tenth the price of top closed models.

What is the best AI model for coding?

By model: Claude Fable 5 / Opus 4.8 lead SWE-bench Verified. By coding agent: GPT-5.5 (Codex) tops Terminal-Bench 2.1. Note SWE-bench Verified is partly saturated — check SWE-bench Pro for real-world signal.

Why are SWE-bench scores so high — are they real?

Treat 90%+ SWE-bench Verified scores with caution. The benchmark has known training-data contamination; OpenAI stopped reporting it. On Scale's standardised SEAL leaderboard the best public score is ~59%, and no model exceeds ~47% on the private set. Real-world coding success is roughly half the Verified headline.

What is the best AI image generator in 2026?

Nano Banana 2 for general use and character consistency, GPT Image for text/typography, and Seedream 5.0 for high-volume, cost-sensitive production.

What is the best AI video generator now that Sora is gone?

Veo 3.1 for cinematic quality with native synchronised audio, and Kling 3.0 for the best value (~$0.10/second). Sora 2's API shuts down 24 September 2026.

How often is this leaderboard updated?

Monthly. This is the June 2026 edition; the next refresh lands mid-July 2026. Between editions, the Model Arena board and What's New feed track launches as they land.

Method & sources

We don't run our own private benchmarks or invent scores. This leaderboard aggregates published results from independent sources and attributes every figure to its origin and date — that transparency is the point, and it is the same standard our scoring methodology holds every product page to.

  • Capability / price / speed: Artificial Analysis Intelligence Index (381 models), June 2026.
  • Coding: vals.ai (SWE-bench Verified) and Scale AI SEAL (SWE-bench Pro, standardised scaffolding), June 2026.
  • Agents: tbench.ai (Terminal-Bench 2.1) and τ²-bench, June 2026.
  • Human preference: LMArena (blind A/B voting), June 2026.
  • Vendor pricing & specs: Anthropic, OpenAI and Google Gemini pricing pages, June 2026.

Prices are blended/illustrative and change frequently — confirm against each provider's live pricing before committing spend. Some research-preview models (e.g. Mythos-tier previews) appear on leaderboards but are not generally available; we rank the publicly usable field. For the full picture of what a model actually costs once retries, output asymmetry, and idle seats are counted, read The True Cost of AI Tools 2026.

Spotted an error or a new release we missed? That's the fastest way to improve a leaderboard — tell us.

Explore more on Mindber: the live Model Arena ranking · What's New · the weekly LLM rankings · the full AI tools directory · all our guides.

Related on Mindber

The True Cost of AI Tools in 2026: Sticker vs Reality

Why the real cost of an AI tool runs ~8x the rate card — a fully-sourced TCO model with the seven hidden costs.

Opus 4.8 Cost Calculator: When It Beats Sonnet & GPT-5.5

Break-even workloads, smart-routing savings, and per-model cache rates for the current frontier models.

Claude Fable 5: What It Is, How to Use It, and the Prompts That Exploit It

Anthropic's first public Mythos-class model — pricing, safeguards, benchmarks, access, and copy-paste prompts.

Share this article

Legal notice

This publication constitutes editorial commentary on publicly available information and does not constitute financial, legal, investment, or professional advice. Product names, trademarks, and registered trademarks referenced herein are the property of their respective owners; their appearance does not imply endorsement or affiliation. Mindber's analysis reflects editorial judgment based on public signals and is subject to change without notice. Scores are not buy, sell, or hold recommendations. No commercial relationship exists between Mindber and the vendors evaluated unless separately disclosed in writing. This publication is governed by the laws of Malaysia. Any dispute arising from or in connection with this publication shall be submitted to the exclusive jurisdiction of the courts of Malaysia.

AI-generated · This report was generated using AI language models trained on publicly available data. It reflects editorial analysis at the time of generation and is not the result of hands-on product testing, independent verification by a human analyst, or a commercial endorsement. All scores, assessments, and claims are derived from signals indexed by Mindber at generation time and are subject to change without notice. Mindber and its operators make no warranty of accuracy, completeness, or fitness for any commercial decision-making purpose. This report is for informational purposes only.

MI

Mindber Research

Mindber editorial — AI model tracking.

Aggregates published benchmark results (Artificial Analysis, vals.ai, Scale AI SEAL, tbench.ai, LMArena) and attributes every figure to its source and date.

On this page
  • TL;DR — best model by category (June 2026)
  • What changed this month
  • 1) Text & reasoning
  • 2) Coding
  • ⚠️ The reality check most leaderboards won't give you
  • 3) Agents & tool use
  • 4) Image generation
  • 5) Video generation
  • 6) Best value & open-weight (the bootstrap lane)
  • 7) Speed (for real-time & long agent chains)
  • How to actually pick a model
  • FAQ
  • Method & sources

Related articles

Claude Fable 5: What It Is, How to Use It, and the Prompts That Exploit It

Jun 913 min

Claude Fable 5 Suspended by US Government Order

Jun 1312 min

The True Cost of AI Tools in 2026: Sticker vs Reality

Jun 712 min