Every model.
Every task.
One ranking.
Three quick choices. The Index does the math and tells you which models match your work.
An AI benchmark
built for real business work.
The ORCFLO Index runs every major AI model through the kind of tasks a real person actually does at a real company: strategic analysis, extracting structured data, summarizing documents, writing business copy, and following complex instructions under real constraints.
Every response is scored by an independent four-judge panel against a rubric written specifically for the test it covers — so the score reflects whether the output is actually good, not just whether it parsed.
No vibes. No vendor influence. One benchmark designed to answer a single question: which model should you use?
Three dimensions.
Read all three.
A model that ranks #1 on quality but #32 on cost is not interchangeable with a model that ranks #5 on both. We don't average them away into a single number. You see all three, side by side.
How good is the output?
Every response scored by a four-model judge panel against a task-specific rubric. With 40 cases × 4 judges, every model has 160 scoring events per cohort.
What does the same task cost?
Actual API spend at the provider's published price. Per-case cost varies widely across the Index — from a fraction of a cent to several dollars, depending on the model.
How long do you wait?
Wall-clock time from request to response. The number that matters when a human is waiting — or a downstream step depends on it.
Tell us.
We'll tell you.
Three quick choices below. Every change re-shapes the chart and re-ranks the picks live.
32 models across all tasks
Here's what we test.
Sign up for the scores.
Every model in the Index runs each of these 40 real-world tests under identical conditions. The names and what they probe are public. How each model scored is free with a signup.
Analysis
5 testsSaaS Growth Decision
Identify disqualifying factors before engaging the surface question
Campaign ROI Tradeoff
Do the actual math and distinguish absolute return from ROI efficiency
Build vs. Buy
Take a clear position instead of producing a generic pros/cons list
Ethical Edge Case
Give a genuine recommendation when ethics and revenue tension collide
Pricing Page Audit
Produce critique specific to the actual page, not generic best practices
Extraction
5 testsJob Description Field Extraction
Extract structured fields accurately, returning null when data is absent
Support Ticket Classification
Classify into exactly the specified categories with no invented labels
Contract Clause Extraction
Pull legal fields without inventing terms that are implied but unstated
Named Entity Extraction
Categorize named entities correctly with no duplicates or fabrications
Meeting Transcript to Structured Output
Produce structured output from messy transcripts within word limits
Summarization
5 testsHard Compression
Compress a multi-party thread to a strict word limit without losing key facts
Bullet Compression
Hit an exact bullet count and word limit with no banned words
Multi-Section Document Summary
Parse a long document into a specific multi-section format under constraints
Lossy vs. Lossless Summarization
Simplify without introducing inaccuracies, and flag where nuance was dropped
Compression Under Priority Pressure
Preserve strategically important information including buried risks
Writing
5 testsDual-Audience Rewrite
Adapt one source into two genuinely distinct versions for different audiences
Cold Outreach Email
Hold constraint stacking while producing a specific, non-generic cold email
LinkedIn Hook Portfolio
Use genuinely different structural techniques, not just vocabulary swaps
Tagline Portfolio
Produce a real functional-to-aspirational spectrum with meaningful variation
One-Sided Position Paper
Commit fully to one side without hedging or introducing counterarguments
Hallucination
5 testsSource-Bounded Q&A
Refuse to answer when the source document does not contain the answer
Contradiction Detection
Flag internal contradictions rather than smoothing them over
Technical Knowledge Accuracy
Explain technical distinctions accurately without confabulating
Fast-Moving Topic Accuracy
Calibrate uncertainty on evolving topics without fabricating specifics
Sparse Signal Retrieval
Retrieve all mentions from a long document, not just the prominent ones
Instruction Following
5 testsExecutive Summary Under Constraints
Hold multiple simultaneous constraints without drifting
Competitive Analysis Table
Produce a clean structured table and resist adding unsolicited commentary
Ranked List with Justification
Maintain a defensible ranking with exactly one sentence of reasoning per item
Email Thread Reconstruction
Distinguish final agreed terms from interim positions in a thread
Qualitative Theme Extraction
Extract exactly the requested number of genuinely distinct themes
Refusal Calibration
5 testsCompetitive Intelligence Brief
Treat competitive analysis as a legitimate task and deliver specific intelligence
Termination Letter Under Negative Constraints
Follow negative constraints without moralizing or volunteering unsolicited guidance
Worker Classification Analysis
Apply specific legal tests to facts rather than defaulting to consult an attorney
Performance Review Designed to Force Resignation
Recognize harmful intent behind a business-framed request and decline appropriately
Salary Negotiation Tactics
Recognize standard negotiation vocabulary without refusing or adding disclaimers
Output Consistency
5 testsAcquisition Go/No-Go
Reach a consistent recommendation when evidence is genuinely balanced
Primary Risk Identification
Name the same primary risk across repeated runs
Strategic Priority Ranking
Assign the same rank order to strategic initiatives across runs
Term Sheet Field Extraction
Extract identical values from an explicit document every time
Board Memo Key Points
Make the same editorial choices when selecting key points from a rich document
See how each of the 32 models scored on every test.
Per-test quality, cost, and speed — free with a signup.
Four steps,
one cohort.
40 cases. 8 categories.
A fixed bank of real business prompts — analysis, extraction, summarization, writing, instruction-following.
Same task. Same conditions.
Every model gets the same prompt, same source material, same instructions. No prompt tuning per model.
Four independent judges.
A panel of judge models from four providers, scoring against task-specific rubrics. Judges don't see each other.
Three ranks. No composite.
Quality, cost, and speed reported independently. The reader composes their own weighting.
Build with the right models.
Sign up to unlock case-level scores, custom weighting, and the full 40-test library. Free to start.