ORCFLO  /  Index  ·  Cohort May 11, 2026

Every model.
Every task.
One ranking.

Three quick choices. The Index does the math and tells you which models match your work.

Latest Cohort  ·  32 models
Q · $ · t  × median
OpenAI
GPT 5
Q95
$4.0×
t3.5×
Google
Gemini 3 Pro (Preview)
Q93
$1.2×
t2.1×
OpenAI
GPT 5.5
Q93
$6.3×
t2.2×
32 × 40 × 4 = 5,120 eventsFilter ↓
OpenAI logoGPT 5Q 95 ·  $4.0× ·  t 3.5×Google logoGemini 3 Pro (Preview)Q 93 ·  $1.2× ·  t 2.1×OpenAI logoGPT 5.5Q 93 ·  $6.3× ·  t 2.2×Anthropic logoClaude Opus 4.6Q 93 ·  $3.3× ·  t 1.8×OpenAI logoGPT 5.1Q 93 ·  $1.5× ·  t 0.7×Google logoGemini 2.5 ProQ 92 ·  $1.2× ·  t 1.9×OpenAI logoGPT 5.2Q 92 ·  $1.6× ·  t 1.1×OpenAI logoGPT 5.4Q 91 ·  $2.2× ·  t 1.3×Anthropic logoClaude Opus 4.5Q 91 ·  $2.3× ·  t 1.0×OpenAI logoGPT 5 MiniQ 90 ·  $0.6× ·  t 2.8×Google logoGemini 3 Flash (Preview)Q 89 ·  $0.3× ·  t 1.0×Anthropic logoClaude Opus 4.7Q 89 ·  $3.9× ·  t 1.4×Google logoGemini 2.5 FlashQ 88 ·  $0.1× ·  t 1.0×Anthropic logoClaude Sonnet 4.6Q 88 ·  $1.8× ·  t 1.3×OpenAI logoo3Q 86 ·  $2.3× ·  t 1.3×OpenAI logoGPT 5 NanoQ 86 ·  $0.2× ·  t 2.5×OpenAI logoGPT 5Q 95 ·  $4.0× ·  t 3.5×Google logoGemini 3 Pro (Preview)Q 93 ·  $1.2× ·  t 2.1×OpenAI logoGPT 5.5Q 93 ·  $6.3× ·  t 2.2×Anthropic logoClaude Opus 4.6Q 93 ·  $3.3× ·  t 1.8×OpenAI logoGPT 5.1Q 93 ·  $1.5× ·  t 0.7×Google logoGemini 2.5 ProQ 92 ·  $1.2× ·  t 1.9×OpenAI logoGPT 5.2Q 92 ·  $1.6× ·  t 1.1×OpenAI logoGPT 5.4Q 91 ·  $2.2× ·  t 1.3×Anthropic logoClaude Opus 4.5Q 91 ·  $2.3× ·  t 1.0×OpenAI logoGPT 5 MiniQ 90 ·  $0.6× ·  t 2.8×Google logoGemini 3 Flash (Preview)Q 89 ·  $0.3× ·  t 1.0×Anthropic logoClaude Opus 4.7Q 89 ·  $3.9× ·  t 1.4×Google logoGemini 2.5 FlashQ 88 ·  $0.1× ·  t 1.0×Anthropic logoClaude Sonnet 4.6Q 88 ·  $1.8× ·  t 1.3×OpenAI logoo3Q 86 ·  $2.3× ·  t 1.3×OpenAI logoGPT 5 NanoQ 86 ·  $0.2× ·  t 2.5×
The Argument · 01

An AI benchmark
built for real business work.

The ORCFLO Index runs every major AI model through the kind of tasks a real person actually does at a real company: strategic analysis, extracting structured data, summarizing documents, writing business copy, and following complex instructions under real constraints.

Every response is scored by an independent four-judge panel against a rubric written specifically for the test it covers — so the score reflects whether the output is actually good, not just whether it parsed.

No vibes. No vendor influence. One benchmark designed to answer a single question: which model should you use?

The Dimensions · 02

Three dimensions.
Read all three.

A model that ranks #1 on quality but #32 on cost is not interchangeable with a model that ranks #5 on both. We don't average them away into a single number. You see all three, side by side.

CH·01Quality
0—100 · higher is better

How good is the output?

Every response scored by a four-model judge panel against a task-specific rubric. With 40 cases × 4 judges, every model has 160 scoring events per cohort.

CH·02Cost
× cohort median per case · lower is better

What does the same task cost?

Actual API spend at the provider's published price. Per-case cost varies widely across the Index — from a fraction of a cent to several dollars, depending on the model.

CH·03Speed
× cohort median per case · lower is better

How long do you wait?

Wall-clock time from request to response. The number that matters when a human is waiting — or a downstream step depends on it.

Your Turn · 03

Tell us.
We'll tell you.

Three quick choices below. Every change re-shapes the chart and re-ranks the picks live.

What are you doing?
What matters most?
Quality vs Cost

32 models across all tasks

Top picks highlighted
Top Picks
For all tasks, prioritizing best value.
Google logoGoogle
Gemini 2.5 Flash
Best balance of quality and cost
Quality
88/100
Cost
0.1×
Time
1.0×
Google logoGoogle
Gemini 3 Flash (Preview)
Close runner-up
Quality
89/100
Cost
0.3×
Time
1.0×
OpenAI logoOpenAI
GPT 5.1
Strong alternative
Quality
93/100
Cost
1.5×
Time
0.7×
The Test Bank · 04

Here's what we test.
Sign up for the scores.

Every model in the Index runs each of these 40 real-world tests under identical conditions. The names and what they probe are public. How each model scored is free with a signup.

Abilities4 categories · 20 tests

Analysis

5 tests
A1

SaaS Growth Decision

Identify disqualifying factors before engaging the surface question

A2

Campaign ROI Tradeoff

Do the actual math and distinguish absolute return from ROI efficiency

A3

Build vs. Buy

Take a clear position instead of producing a generic pros/cons list

A4

Ethical Edge Case

Give a genuine recommendation when ethics and revenue tension collide

A5

Pricing Page Audit

Produce critique specific to the actual page, not generic best practices

Extraction

5 tests
E1

Job Description Field Extraction

Extract structured fields accurately, returning null when data is absent

E2

Support Ticket Classification

Classify into exactly the specified categories with no invented labels

E3

Contract Clause Extraction

Pull legal fields without inventing terms that are implied but unstated

E4

Named Entity Extraction

Categorize named entities correctly with no duplicates or fabrications

E5

Meeting Transcript to Structured Output

Produce structured output from messy transcripts within word limits

Summarization

5 tests
S1

Hard Compression

Compress a multi-party thread to a strict word limit without losing key facts

S2

Bullet Compression

Hit an exact bullet count and word limit with no banned words

S3

Multi-Section Document Summary

Parse a long document into a specific multi-section format under constraints

S4

Lossy vs. Lossless Summarization

Simplify without introducing inaccuracies, and flag where nuance was dropped

S5

Compression Under Priority Pressure

Preserve strategically important information including buried risks

Writing

5 tests
W1

Dual-Audience Rewrite

Adapt one source into two genuinely distinct versions for different audiences

W2

Cold Outreach Email

Hold constraint stacking while producing a specific, non-generic cold email

W3

LinkedIn Hook Portfolio

Use genuinely different structural techniques, not just vocabulary swaps

W4

Tagline Portfolio

Produce a real functional-to-aspirational spectrum with meaningful variation

W5

One-Sided Position Paper

Commit fully to one side without hedging or introducing counterarguments

Behaviors3 categories · 15 tests

Hallucination

5 tests
H1

Source-Bounded Q&A

Refuse to answer when the source document does not contain the answer

H2

Contradiction Detection

Flag internal contradictions rather than smoothing them over

H3

Technical Knowledge Accuracy

Explain technical distinctions accurately without confabulating

H4

Fast-Moving Topic Accuracy

Calibrate uncertainty on evolving topics without fabricating specifics

H5

Sparse Signal Retrieval

Retrieve all mentions from a long document, not just the prominent ones

Instruction Following

5 tests
IF1

Executive Summary Under Constraints

Hold multiple simultaneous constraints without drifting

IF2

Competitive Analysis Table

Produce a clean structured table and resist adding unsolicited commentary

IF3

Ranked List with Justification

Maintain a defensible ranking with exactly one sentence of reasoning per item

IF4

Email Thread Reconstruction

Distinguish final agreed terms from interim positions in a thread

IF5

Qualitative Theme Extraction

Extract exactly the requested number of genuinely distinct themes

Refusal Calibration

5 tests
RC1

Competitive Intelligence Brief

Treat competitive analysis as a legitimate task and deliver specific intelligence

RC2

Termination Letter Under Negative Constraints

Follow negative constraints without moralizing or volunteering unsolicited guidance

RC3

Worker Classification Analysis

Apply specific legal tests to facts rather than defaulting to consult an attorney

RC4

Performance Review Designed to Force Resignation

Recognize harmful intent behind a business-framed request and decline appropriately

RC5

Salary Negotiation Tactics

Recognize standard negotiation vocabulary without refusing or adding disclaimers

Stability1 category · 5 tests

Output Consistency

5 tests
OC1

Acquisition Go/No-Go

Reach a consistent recommendation when evidence is genuinely balanced

OC2

Primary Risk Identification

Name the same primary risk across repeated runs

OC3

Strategic Priority Ranking

Assign the same rank order to strategic initiatives across runs

OC4

Term Sheet Field Extraction

Extract identical values from an explicit document every time

OC5

Board Memo Key Points

Make the same editorial choices when selecting key points from a rich document

The unlock

See how each of the 32 models scored on every test.

Per-test quality, cost, and speed — free with a signup.

Start free
How It Works · 05

Four steps,
one cohort.

01
Design

40 cases. 8 categories.

A fixed bank of real business prompts — analysis, extraction, summarization, writing, instruction-following.

02
Deliver

Same task. Same conditions.

Every model gets the same prompt, same source material, same instructions. No prompt tuning per model.

03
Judge

Four independent judges.

A panel of judge models from four providers, scoring against task-specific rubrics. Judges don't see each other.

04
Rank

Three ranks. No composite.

Quality, cost, and speed reported independently. The reader composes their own weighting.

Read the full methodologyTest design, the four-judge panel, tier system, limitations.
Stop guessing

Build with the right models.

Sign up to unlock case-level scores, custom weighting, and the full 40-test library. Free to start.

500 free creditsNo credit card requiredCancel anytime