ORCFLO / Index · Cohort May 11, 2026

Every model.
Every task.
One ranking.

Three quick choices. The Index does the math and tells you which models match your work.

Pick a model Read the methodology

Latest Cohort · 32 models

Q · $ · t × median

OpenAI

GPT 5

Q95

$4.0×

t3.5×

Google

Gemini 3 Pro (Preview)

Q93

$1.2×

t2.1×

OpenAI

GPT 5.5

Q93

$6.3×

t2.2×

32 × 40 × 4 = 5,120 eventsFilter ↓

GPT 5Q 95 · $4.0× · t 3.5×

Gemini 3 Pro (Preview)Q 93 · $1.2× · t 2.1×

GPT 5.5Q 93 · $6.3× · t 2.2×

Claude Opus 4.6Q 93 · $3.3× · t 1.8×

GPT 5.1Q 93 · $1.5× · t 0.7×

Gemini 2.5 ProQ 92 · $1.2× · t 1.9×

GPT 5.2Q 92 · $1.6× · t 1.1×

GPT 5.4Q 91 · $2.2× · t 1.3×

Claude Opus 4.5Q 91 · $2.3× · t 1.0×

GPT 5 MiniQ 90 · $0.6× · t 2.8×

Gemini 3 Flash (Preview)Q 89 · $0.3× · t 1.0×

Claude Opus 4.7Q 89 · $3.9× · t 1.4×

Gemini 2.5 FlashQ 88 · $0.1× · t 1.0×

Claude Sonnet 4.6Q 88 · $1.8× · t 1.3×

o3Q 86 · $2.3× · t 1.3×

GPT 5 NanoQ 86 · $0.2× · t 2.5×

GPT 5Q 95 · $4.0× · t 3.5×

Gemini 3 Pro (Preview)Q 93 · $1.2× · t 2.1×

GPT 5.5Q 93 · $6.3× · t 2.2×

Claude Opus 4.6Q 93 · $3.3× · t 1.8×

GPT 5.1Q 93 · $1.5× · t 0.7×

Gemini 2.5 ProQ 92 · $1.2× · t 1.9×

GPT 5.2Q 92 · $1.6× · t 1.1×

GPT 5.4Q 91 · $2.2× · t 1.3×

Claude Opus 4.5Q 91 · $2.3× · t 1.0×

GPT 5 MiniQ 90 · $0.6× · t 2.8×

Gemini 3 Flash (Preview)Q 89 · $0.3× · t 1.0×

Claude Opus 4.7Q 89 · $3.9× · t 1.4×

Gemini 2.5 FlashQ 88 · $0.1× · t 1.0×

Claude Sonnet 4.6Q 88 · $1.8× · t 1.3×

o3Q 86 · $2.3× · t 1.3×

GPT 5 NanoQ 86 · $0.2× · t 2.5×

The Argument · 01

An AI benchmark
built for real business work.

The ORCFLO Index runs every major AI model through the kind of tasks a real person actually does at a real company: strategic analysis, extracting structured data, summarizing documents, writing business copy, and following complex instructions under real constraints.

Every response is scored by an independent four-judge panel against a rubric written specifically for the test it covers — so the score reflects whether the output is actually good, not just whether it parsed.

No vibes. No vendor influence. One benchmark designed to answer a single question: which model should you use?

The Dimensions · 02

Three dimensions.
Read all three.

A model that ranks #1 on quality but #32 on cost is not interchangeable with a model that ranks #5 on both. We don't average them away into a single number. You see all three, side by side.

CH·01Quality

0—100 · higher is better

How good is the output?

Every response scored by a four-model judge panel against a task-specific rubric. With 40 cases × 4 judges, every model has 160 scoring events per cohort.

CH·02Cost

× cohort median per case · lower is better

What does the same task cost?

Actual API spend at the provider's published price. Per-case cost varies widely across the Index — from a fraction of a cent to several dollars, depending on the model.

CH·03Speed

× cohort median per case · lower is better

How long do you wait?

Wall-clock time from request to response. The number that matters when a human is waiting — or a downstream step depends on it.

Your Turn · 03

Tell us.
We'll tell you.

Three quick choices below. Every change re-shapes the chart and re-ranks the picks live.

What are you doing?

What matters most?

Quality vs Cost

32 models across all tasks

Top picks highlighted

Top Picks

For all tasks, prioritizing best value.

Google

Gemini 2.5 Flash

Best balance of quality and cost

Quality

88/100

Cost

0.1×

Time

1.0×

Google

Gemini 3 Flash (Preview)

Close runner-up

Quality

89/100

Cost

0.3×

Time

1.0×

OpenAI

GPT 5.1

Strong alternative

Quality

93/100

Cost

1.5×

Time

0.7×

Cost & Time shown as × cohort median per case

Browse all reports — one for every model Per-test scores and custom weighting — free

The Test Bank · 04

Here's what we test.
Sign up for the scores.

Every model in the Index runs each of these 40 real-world tests under identical conditions. The names and what they probe are public. How each model scored is free with a signup.

Abilities4 categories · 20 tests

Analysis

5 tests

SaaS Growth Decision

Identify disqualifying factors before engaging the surface question

Campaign ROI Tradeoff

Do the actual math and distinguish absolute return from ROI efficiency

Build vs. Buy

Take a clear position instead of producing a generic pros/cons list

Ethical Edge Case

Give a genuine recommendation when ethics and revenue tension collide

Pricing Page Audit

Produce critique specific to the actual page, not generic best practices

Extraction

5 tests

Job Description Field Extraction

Extract structured fields accurately, returning null when data is absent

Support Ticket Classification

Classify into exactly the specified categories with no invented labels

Contract Clause Extraction

Pull legal fields without inventing terms that are implied but unstated

Named Entity Extraction

Categorize named entities correctly with no duplicates or fabrications

Meeting Transcript to Structured Output

Produce structured output from messy transcripts within word limits

Summarization

5 tests

Hard Compression

Compress a multi-party thread to a strict word limit without losing key facts

Bullet Compression

Hit an exact bullet count and word limit with no banned words

Multi-Section Document Summary

Parse a long document into a specific multi-section format under constraints

Lossy vs. Lossless Summarization

Simplify without introducing inaccuracies, and flag where nuance was dropped

Compression Under Priority Pressure

Preserve strategically important information including buried risks

Writing

5 tests

Dual-Audience Rewrite

Adapt one source into two genuinely distinct versions for different audiences

Cold Outreach Email

Hold constraint stacking while producing a specific, non-generic cold email

LinkedIn Hook Portfolio

Use genuinely different structural techniques, not just vocabulary swaps

Tagline Portfolio

Produce a real functional-to-aspirational spectrum with meaningful variation

One-Sided Position Paper

Commit fully to one side without hedging or introducing counterarguments

Behaviors3 categories · 15 tests

Hallucination

5 tests

Source-Bounded Q&A

Refuse to answer when the source document does not contain the answer

Contradiction Detection

Flag internal contradictions rather than smoothing them over

Technical Knowledge Accuracy

Explain technical distinctions accurately without confabulating

Fast-Moving Topic Accuracy

Calibrate uncertainty on evolving topics without fabricating specifics

Sparse Signal Retrieval

Retrieve all mentions from a long document, not just the prominent ones

Instruction Following

5 tests

IF1

Executive Summary Under Constraints

Hold multiple simultaneous constraints without drifting

IF2

Competitive Analysis Table

Produce a clean structured table and resist adding unsolicited commentary

IF3

Ranked List with Justification

Maintain a defensible ranking with exactly one sentence of reasoning per item

IF4

Email Thread Reconstruction

Distinguish final agreed terms from interim positions in a thread

IF5

Qualitative Theme Extraction

Extract exactly the requested number of genuinely distinct themes

Refusal Calibration

5 tests

RC1

Competitive Intelligence Brief

Treat competitive analysis as a legitimate task and deliver specific intelligence

RC2

Termination Letter Under Negative Constraints

Follow negative constraints without moralizing or volunteering unsolicited guidance

RC3

Worker Classification Analysis

Apply specific legal tests to facts rather than defaulting to consult an attorney

RC4

Performance Review Designed to Force Resignation

Recognize harmful intent behind a business-framed request and decline appropriately

RC5

Salary Negotiation Tactics

Recognize standard negotiation vocabulary without refusing or adding disclaimers

Stability1 category · 5 tests

Output Consistency

5 tests

OC1

Acquisition Go/No-Go

Reach a consistent recommendation when evidence is genuinely balanced

OC2

Primary Risk Identification

Name the same primary risk across repeated runs

OC3

Strategic Priority Ranking

Assign the same rank order to strategic initiatives across runs

OC4

Term Sheet Field Extraction

Extract identical values from an explicit document every time

OC5

Board Memo Key Points

Make the same editorial choices when selecting key points from a rich document

The unlock

See how each of the 32 models scored on every test.

Per-test quality, cost, and speed — free with a signup.

Start free

How It Works · 05

Four steps,
one cohort.

Design

40 cases. 8 categories.

A fixed bank of real business prompts — analysis, extraction, summarization, writing, instruction-following.

Deliver

Same task. Same conditions.

Every model gets the same prompt, same source material, same instructions. No prompt tuning per model.

Judge

Four independent judges.

A panel of judge models from four providers, scoring against task-specific rubrics. Judges don't see each other.

Rank

Three ranks. No composite.

Quality, cost, and speed reported independently. The reader composes their own weighting.

Read the full methodologyTest design, the four-judge panel, tier system, limitations.

Stop guessing

Build with the right models.

Start free See pricing

500 free creditsNo credit card requiredCancel anytime

Every model.Every task.One ranking.

An AI benchmarkbuilt for real business work.

Three dimensions.Read all three.

How good is the output?

What does the same task cost?

How long do you wait?

Tell us.We'll tell you.

32 models across all tasks

Here's what we test.Sign up for the scores.

Analysis

SaaS Growth Decision

Campaign ROI Tradeoff

Build vs. Buy

Ethical Edge Case

Pricing Page Audit

Extraction

Job Description Field Extraction

Support Ticket Classification

Contract Clause Extraction

Named Entity Extraction

Meeting Transcript to Structured Output

Summarization

Hard Compression

Bullet Compression

Multi-Section Document Summary

Lossy vs. Lossless Summarization

Compression Under Priority Pressure

Writing

Dual-Audience Rewrite

Cold Outreach Email

LinkedIn Hook Portfolio

Tagline Portfolio

One-Sided Position Paper

Hallucination

Source-Bounded Q&A

Contradiction Detection

Technical Knowledge Accuracy

Fast-Moving Topic Accuracy

Sparse Signal Retrieval

Instruction Following

Executive Summary Under Constraints

Competitive Analysis Table

Ranked List with Justification

Email Thread Reconstruction

Qualitative Theme Extraction

Refusal Calibration

Competitive Intelligence Brief

Termination Letter Under Negative Constraints

Worker Classification Analysis

Performance Review Designed to Force Resignation

Salary Negotiation Tactics

Output Consistency

Acquisition Go/No-Go

Primary Risk Identification

Strategic Priority Ranking

Term Sheet Field Extraction

Board Memo Key Points

Four steps,one cohort.

40 cases. 8 categories.

Same task. Same conditions.

Four independent judges.

Three ranks. No composite.

Build with the right models.

Every model.
Every task.
One ranking.

An AI benchmark
built for real business work.

Three dimensions.
Read all three.

Tell us.
We'll tell you.

Here's what we test.
Sign up for the scores.

Four steps,
one cohort.