Legend Interaction: Single-click to toggle individual models. Double-click to toggle all models from the same provider.
Note: Deprecated models and older revisions are hidden by default. To view all models, go to the and adjust the filters.
Model | Max Ctx | 128k (%) | 1M (%) | AUC @128k (%) | AUC @1M (%) ▼ | Model AUC (%) | Runs | Input Cost ($) | Output Cost ($) |
---|---|---|---|---|---|---|---|---|---|
gemini-2.5-flash-05-20 | 1,048,576 | #2 93.5% | #1 68.1% | #3 91.5% | #1 78.3% | #3 78.3% | 6,320 | #14 $361.6285 | WARN $6.8816 |
gemini-2.5-pro-06-05 | 1,048,576 | #4 84.4% | #3 63.9% | #4 89.6% | #2 77.5% | #4 77.5% | 6,320 | #17 $3,013.5705 | WARN $146.4724 |
gemini-2.5-flash-05-20 | 1,048,576 | #5 82.9% | #2 64.9% | #6 81.7% | #3 70.2% | #6 70.2% | 6,320 | #14 $361.6285 | WARN $0.8077 |
gemini-2.5-flash-preview-09-2025 | 1,048,576 | #7 78.9% | #4 59.6% | #5 85.6% | #4 67.9% | #7 67.9% | 6,112 | #12 $303.5261 | #13 $7.7013 |
gemini-2.5-flash-preview-09-2025 | 1,048,576 | #8 73.1% | #5 56.5% | #8 78.9% | #5 64.6% | #8 64.6% | 6,120 | #13 $303.6170 | #6 $0.7813 |
gemini-pro-1.5-002 OLD | 2,000,000 | #6 79.7% | #6 47.3% | #12 72.8% | #6 54.8% | #18 54.8% | 6,400 | #18 $3,198.4487 | #15 $25.6526 |
gpt-4.1 | 1,047,576 | #16 55.5% | #7 42.1% | #19 61.6% | #7 53.2% | #20 53.2% | 6,384 | #16 $2,468.4629 | #14 $19.7168 |
gemini-flash-1.5-002 OLD | 1,000,000 | #21 52.0% | #8 34.9% | #21 57.4% | #8 44.6% | #29 44.6% | 6,208 | #11 $162.7636 | #10 $1.5166 |
gpt-4.1-mini | 1,047,576 | #26 47.2% | #9 32.5% | #32 49.4% | #9 43.6% | #30 43.6% | 6,384 | #15 $493.6926 | #11 $3.9214 |
llama-4-maverick | 1,048,576 | #23 50.3% | #10 30.1% | #28 52.7% | #10 39.9% | #36 39.9% | 6,024 | #10 $136.3894 | #9 $1.4152 |
gemini-2.5-flash-lite-preview-06-17 | 1,048,576 | #28 46.7% | #13 17.4% | #27 52.8% | #11 35.2% | #42 35.2% | 6,320 | #8 $120.5428 | WARN $10.2502 |
gemini-2.5-flash-lite-preview-09-2025 | 1,048,576 | #33 39.6% | #12 23.8% | #34 47.9% | #12 34.3% | #44 34.3% | 6,112 | #5 $101.0891 | #12 $3.9277 |
gemini-2.5-flash-lite-preview-06-17 | 1,048,576 | #25 48.4% | #15 15.9% | #33 49.0% | #13 32.6% | #45 32.6% | 6,320 | #8 $120.5428 | WARN $0.1329 |
gemini-2.0-flash-001 | 1,048,576 | #12 60.1% | #17 13.3% | #22 56.0% | #14 32.1% | #46 32.1% | 6,224 | #7 $110.0136 | #8 $0.9748 |
gemini-2.5-flash-lite-preview-09-2025 | 1,048,576 | #42 34.0% | #11 23.8% | #44 40.5% | #15 31.7% | #47 31.7% | 6,120 | #6 $101.2057 | #1 $0.1163 |
minimax-01 | 1,000,192 | #9 71.9% | N/A | #14 71.6% | #16 31.4% | #10 63.2% | 5,224 | inc $90.9933 | inc $2.0702 |
minimax-m1 | 1,000,000 | #10 68.5% | N/A | #10 75.4% | #17 30.4% | #15 61.3% | 4,920 | inc $467.2028 | inc $20.8827 |
grok-4-fast | 2,000,000 | #27 46.8% | #20 11.5% | #17 65.1% | #18 26.6% | #53 26.6% | 6,400 | FREE $0.0000 | FREE $0.0000 |
gpt-4.1-nano | 1,047,576 | #37 36.4% | #18 13.0% | #41 42.6% | #19 24.6% | #55 24.6% | 6,384 | #9 $123.4231 | #7 $0.8169 |
gemini-2.0-flash-lite-001 | 1,048,576 | #34 38.6% | #14 17.0% | #47 38.4% | #20 24.2% | #56 24.2% | 6,224 | #4 $82.5102 | #4 $0.5253 |
gpt-5 | 400,000 | #1 95.0% | N/A | #1 96.7% | #21 22.9% | #1 93.8% | 4,800 | inc $386.1476 | inc $83.3761 |
gpt-5-mini | 400,000 | #3 87.1% | N/A | #2 92.6% | #22 20.6% | #2 84.4% | 4,800 | inc $77.2295 | inc $12.7560 |
llama-4-scout | 1,048,576 | #45 33.1% | #16 13.4% | #48 38.1% | #23 20.5% | #59 20.5% | 5,702 | #2 $55.0137 | #5 $0.6192 |
gemini-flash-1.5-8b-001 OLD | 1,000,000 | #48 27.0% | #19 12.1% | #53 30.0% | #24 17.8% | #61 17.8% | 6,208 | #3 $81.4626 | #3 $0.4052 |
grok-4-07-09 | 256,000 | #20 52.8% | N/A | #7 79.6% | #25 15.6% | #9 63.9% | 4,356 | inc $1,280.7056 | inc $174.9869 |
claude-sonnet-4 | 1,000,000 | #19 52.8% | N/A | #15 71.2% | #26 15.1% | #12 62.0% | 4,176 | inc $1,188.5016 | inc $96.0376 |
claude-sonnet-4 | 1,000,000 | #11 62.3% | N/A | #16 68.5% | #27 15.1% | #13 62.0% | 4,176 | inc $1,188.9738 | inc $42.6061 |
o4-mini | 200,000 | #15 57.4% | N/A | #9 76.0% | #28 15.0% | #14 61.3% | 4,448 | inc $247.9571 | inc $37.2230 |
o3 | 200,000 | #17 55.4% | N/A | #13 72.7% | #29 14.1% | #17 57.9% | 4,448 | inc $450.8311 | inc $51.7601 |
claude-3-haiku OLD | 200,000 | #14 58.6% | N/A | #26 52.9% | #30 12.3% | #21 50.5% | 4,176 | inc $49.5406 | inc $2.2391 |
claude-3.7-sonnet | 200,000 | #22 51.1% | N/A | #23 55.9% | #31 12.1% | #23 49.6% | 4,176 | inc $594.4813 | inc $27.8910 |
claude-3.7-sonnet | 200,000 | #18 52.9% | N/A | #24 55.5% | #32 11.7% | #24 47.8% | 4,176 | inc $594.5065 | inc $33.5946 |
claude-3.5-sonnet OLD | 200,000 | #30 46.1% | N/A | #29 51.2% | #33 10.9% | #28 44.8% | 4,176 | inc $594.4836 | inc $27.3120 |
claude-3.5-haiku | 200,000 | #29 46.2% | N/A | #31 50.0% | #34 10.3% | #32 42.0% | 4,120 | inc $155.2502 | inc $7.3357 |
gpt-5-nano | 400,000 | #35 38.3% | N/A | #38 44.2% | #35 9.6% | #37 39.4% | 4,800 | inc $15.4459 | inc $5.1686 |
o3-mini | 200,000 | #40 34.7% | N/A | #39 43.8% | #36 8.7% | #41 35.8% | 4,456 | inc $248.4745 | inc $47.7402 |
claude-opus-4:curated unranked | 200,000 | 63.2% | N/A | 73.1% | 8.6% | 73.1% | 2,797 | inc $1,688.0723 | inc $190.7885 |
grok-3-beta | 131,072 | #13 60.0% | N/A | #11 73.0% | #37 8.6% | #5 73.0% | 4,000 | inc $436.6788 | inc $23.1833 |
claude-opus-4 | 200,000 | #49 25.9% | N/A | #18 63.2% | #38 7.5% | #11 63.2% | 3,269 | inc $2,465.6002 | inc $190.7885 |
deepseek-r1-0528 | 163,840 | #31 44.5% | N/A | #20 61.0% | #39 7.2% | #16 61.0% | 3,736 | inc $59.7154 | inc $7.0488 |
grok-3-mini-beta | 131,072 | #36 37.0% | N/A | #25 54.8% | #40 6.5% | #19 54.8% | 3,976 | inc $44.2934 | inc $2.8246 |
qwen-turbo | 1,000,000 | #59 14.2% | #21 4.2% | #59 24.5% | #41 6.4% | #66 6.4% | 5,984 | #1 $44.3817 | #2 $0.1932 |
grok-3-mini-beta:high | 131,072 | #39 34.8% | N/A | #30 50.3% | #42 5.9% | #22 50.3% | 4,000 | inc $44.4349 | inc $4.8529 |
mistral-small-3.1-24b-instruct | 131,072 | #24 48.7% | N/A | #35 47.7% | #43 5.6% | #25 47.7% | 3,728 | inc $5.0020 | inc $0.2265 |
kimi-k2 | 131,072 | #32 39.9% | N/A | #36 47.3% | #44 5.6% | #26 47.3% | 3,736 | inc $66.0055 | inc $4.3029 |
qwen3-14b | 40,960 | #41 34.5% | N/A | #37 44.9% | #45 5.3% | #27 44.9% | 3,416 | inc $3.6445 | inc $0.4289 |
mistral-medium-3 | 131,072 | #44 33.2% | N/A | #40 43.0% | #46 5.1% | #31 43.0% | 3,936 | inc $59.9857 | inc $3.1361 |
qwq-32b | 131,072 | #58 14.3% | N/A | #42 41.7% | #47 4.9% | #33 41.7% | 3,688 | inc $17.5038 | inc $1.3619 |
deepseek-r1-0120 | 163,840 | #53 18.0% | N/A | #43 40.6% | #48 4.8% | #34 40.6% | 3,952 | inc $58.2088 | inc $8.4610 |
gemma-3-12b-it | 131,072 | #38 35.0% | N/A | #45 39.9% | #49 4.7% | #35 39.9% | 3,664 | inc $5.9463 | inc $0.1337 |
qwen3-30b-a3b | 131,072 | #46 29.7% | N/A | #46 39.1% | #50 4.6% | #38 39.1% | 3,416 | inc $7.2671 | inc $1.0059 |
llama-3.3-70b-instruct | 131,072 | #55 17.1% | N/A | #49 37.9% | #51 4.5% | #39 37.9% | 3,672 | inc $4.6160 | inc $0.1701 |
qwen3-32b | 131,072 | #47 29.3% | N/A | #50 36.5% | #52 4.3% | #40 36.5% | 3,416 | inc $12.7466 | inc $1.8151 |
deepseek-chat-v3-0324 | 163,840 | #43 33.8% | N/A | #51 34.9% | #53 4.1% | #43 34.9% | 3,960 | inc $36.4327 | inc $1.4411 |
gpt-oss-120b | 131,072 | #52 20.7% | N/A | #52 30.3% | #54 3.6% | #48 30.3% | 3,728 | inc $5.9578 | inc $1.3461 |
qwen3-235b-a22b | 131,072 | #57 14.9% | N/A | #54 29.6% | #55 3.5% | #49 29.6% | 3,416 | inc $20.0448 | inc $3.0603 |
command-r7b-12-2024 | 128,000 | #54 17.3% | N/A | #55 28.7% | #56 3.4% | #50 28.7% | 3,704 | inc $4.5529 | inc $0.2174 |
gemma-3-27b-it | 131,072 | #50 25.1% | N/A | #56 28.2% | #57 3.3% | #51 28.2% | 3,712 | inc $10.8761 | inc $0.2575 |
qwen3-8b | 128,000 | #51 21.3% | N/A | #57 27.4% | #58 3.2% | #52 27.4% | 3,416 | inc $3.1890 | inc $0.5640 |
gpt-oss-20b | 131,072 | #56 15.3% | N/A | #58 25.7% | #59 3.0% | #54 25.7% | 3,728 | inc $3.5777 | inc $1.1133 |
mistral-large-2411 | 131,072 | #66 0.0% | N/A | #60 24.1% | #60 2.8% | #57 24.1% | 3,824 | inc $304.2297 | inc $9.4407 |
nova-lite-v1 | 300,000 | #63 6.0% | N/A | #63 17.9% | #61 2.7% | #64 11.1% | 4,608 | inc $25.0529 | inc $0.2342 |
ministral-8b OLD | 131,072 | #62 7.2% | N/A | #61 22.8% | #62 2.7% | #58 22.8% | 3,712 | inc $12.2571 | inc $0.1293 |
qwen3-4b:free | 40,960 | #60 11.7% | N/A | #62 18.0% | #63 2.1% | #60 18.0% | 3,416 | inc FREE $0.0000 | inc FREE $0.0000 |
gemma-3-4b-it | 131,072 | #61 8.1% | N/A | #64 14.5% | #64 1.7% | #62 14.5% | 3,688 | inc $4.8146 | inc $0.0686 |
ministral-3b OLD | 32,768 | #64 5.4% | N/A | #65 13.9% | #65 1.6% | #63 13.9% | 3,728 | inc $5.0020 | inc $0.0360 |
nova-micro-v1 | 128,000 | #65 4.0% | N/A | #66 9.7% | #66 1.1% | #65 9.7% | 3,448 | inc $4.5526 | inc $0.0811 |
router unranked | 131,072 | 2.2% | N/A | 5.8% | 0.7% | 5.8% | 3,728 | inc $120.4235 | inc $2.7164 |
mistral-nemo OLD | 131,072 | #66 0.0% | N/A | #67 1.5% | #67 0.2% | #67 1.5% | 3,760 | inc $2.6120 | inc $0.0215 |
Table Interaction: Click headers to sort.
(Model header): Reload data. (Model cell): View model performance chart. (Bin header): View Cost/Score plot. (Score cell): View test details.Notes & Definitions:
- N/A indicates no results for this model/bin, potentially due to context window limits.
- AUC (Area Under Curve) normalized to 100% reflects overall performance across bins, weighted by bin width. Model AUC is normalized only for context bins within the model's max context length.
- Badge definitions:
- #1 #2 #3: Top 3 models for that metric (higher score / lower cost is better).
- unranked: Unranked due to known issues.
- WARN: Cost Inaccuracy reported.
- inc: Incomplete cost data (potentially underestimated cost, excluded from cost rank).
- FREE: Free model or cost data unavailable (excluded from cost rank).
- OLD: Old model (released >1yr ago).
Select Models
amazon
anthropic
cohere
deepseek
meta-llama
minimax
mistralai
moonshotai
openai
qwen
switchpoint
thudm
x-ai
Note: Some technical terms have a dotted underline. Hover over them for a brief explanation.
Benchmark Details & Methodology
1. What is the OpenAI-MRCR benchmark?
OpenAI MRCR tests a Large Language Model's (LLM) ability to handle complex conversational history. Key aspects include:
- Core Task: Finding and distinguishing between multiple identical pieces of information ("needles") hidden within a long conversation ("haystack").
- Setup: Inspired by Google DeepMind's MRCR eval (arxiv:2409.12640v2), this version inserts 2, 4, or 8 identical requests (e.g., "write a poem about tapirs") alongside distractor requests. Needles/distractors are generated by GPT-4o to blend in.
- Challenge: The model must retrieve a specific instance (e.g., the 2nd poem) based on its order, requiring careful tracking of the conversation. It must also prepend a specific random code (hash) to its answer.
- Data Source: The benchmark data and detailed methodology are described on Hugging Face (openai/mrcr).
- Dashboard Scope: This dashboard visualizes results directly from that published dataset and does not currently run new evaluations.
2. Which is harder: 2-needle or 8-needle tests?
Generally, 8-needle tests are more challenging for language models. A common failure mode with 8 needles, compared to 2 needles, is that the model might identify multiple similar "needles" (pieces of information) within the long context but select the incorrect one out of the eight possibilities.
To understand the setup for each test case:
- Context: The input context is populated with various writing mediums (e.g., poems, letters). Each medium focuses on several topic categories (e.g., "tapir," "chair").
- Needle Placement: There are 2, 4, or 8 instances of each combination of writing medium and topic (e.g., "write a poem about tapirs," followed by the poem) scattered throughout the context.
- The Question: The model is then asked a specific question, like "Return the fifth poem about tapirs and prepend the code XXX to it."
- Scoring: The model's response is graded from 0-100% based on its similarity to the expected answer.
So, an "8-needle" test means that when the model is quizzed, there are eight distinct instances of the same writing medium and topic, and the task is to retrieve the nth specific one found in the context.
While 8-needle tests are typically harder and can be considered a more robust measure of this specific capability, the importance of 2-needle versus 8-needle performance can depend on the specific use case. This is why results for different needle counts are often presented.
3. How is the score calculated?
The score measures how accurately the model retrieves the correct instance of the requested needle. The process involves:
- Comparison Method: The model's answer is compared to the expected answer using the SequenceMatcher ratio from Python's `difflib` library.
- Hash Requirement: The model *must* include a specific random code (hash identifier) at the start of its answer. This hash is removed before the comparison.
- Failure Condition: If the required hash is missing or incorrect, the score is automatically 0.
- Result: The similarity ratio (0.0 to 1.0) from the comparison is presented as a percentage (0-100%) in the table.
See FAQ #6 for details about how Area Under Curve (AUC) scores summarize performance across different context lengths.
4. What do the context length "bins" (e.g., 128k, 1M) mean?
The "bins" group test runs based on the total length of the text involved (prompt + expected answer). Here's how they work:
- Measurement: Length is measured in tokens using the `o200k_base` tokenizer.
- Grouping: Tests are grouped into bins based on their total token count. For example, the "128k" bin includes tests with >65,536 and <=131,072 tokens.
- Score Display: The score shown for a bin (e.g., "128k (%)") is the average score from the 100 test samples conducted within that bin's length range.
- Boundaries: The specific bin boundaries are [4k, 8k], (8k, 16k], (16k, 32k], (32k, 65k], (65k, 128k], (128k, 256k], (256k, 512k], (512k, 1M].
5. How is this different from Fiction.livebench?
Both test how well models handle long texts, but focus on different skills. OpenAI-MRCR tests the model's ability to pinpoint and distinguish between identical pieces of information based on their order in a conversation (using synthetic data). Fiction.livebench (fiction.live/...) tests narrative understanding – how well models follow plots, characters, and consistency within complex stories, using quizzes based on actual fiction excerpts.
6. How does the benchmark design reduce the risk of models succeeding due to training data contamination?
The benchmark incorporates several design features to minimize the chance that models succeed simply by having seen similar data during training:
- Synthetic & Unique Data: Each test uses a specially generated, long conversation. While topics might overlap with training data, the specific sequence of turns and the placement of "needles" are unique to the benchmark run.
- Instance Specificity: The core task isn't just retrieving information (e.g., a poem about tapirs) but retrieving a specific instance (e.g., the 2nd poem requested) based on conversational order. Simple memorization of poems is insufficient.
- Required Random Hash: Models must prepend a specific random code (hash) to their answer. This code is generated for the test run and cannot be predicted from general training data. Failure to include the correct hash results in a score of 0.
These elements combined make it highly unlikely that a model can achieve a high score purely by recalling memorized training data, as success requires understanding the specific conversational context, instance order, and adhering to the random hash requirement. (Source: OpenAI MRCR Dataset Card)
Note: This dashboard evaluates models against the dataset published by OpenAI (data collected up to April 11, 2024). Models released or significantly updated after this date might have been trained on this specific benchmark data, potentially affecting their results.
Understanding the Results
7. What are the different AUC scores and how are they calculated?
AUC (Area Under Curve) gives a single score summarizing performance across different context lengths (bins). Think of it like an average grade across tests of increasing difficulty (longer contexts). It's calculated by plotting the average score for each bin against the maximum context length of that bin and measuring the area under the resulting line/curve. This area is then normalized to a percentage (0-100%).
- AUC @128k: AUC calculated using results only up to the 128k token bin (tests with up to 131,072 tokens).
- AUC @1M: AUC calculated using results across all bins up to the 1M token bin (tests with up to 1,048,576 tokens).
- Model AUC: AUC calculated only over the range of bins the specific model actually completed successfully. This provides a fairer comparison if a model couldn't handle the longest contexts.
Technical Note: Calculation uses the Trapezoidal Rule on a linear scale of context lengths. See example below for details.
Show Calculation Example
Example (AUC @1M for google/gemini-2.5-flash-05-20:thinking
- 2-Needle Top Model):
- Data Points (Bin, Score%): (8k, 98.2), (16k, 94.6), (32k, 91.5), (64k, 88.7), (128k, 93.5), (256k, 83.3), (512k, 76.0), (1M, 68.1)
- Calculate Trapezoid Areas:
- 8k-16k: (16384 - 8192) * (98.2 + 94.6)/2 = 789,782.55
- 16k-32k: (32768 - 16384) * (94.6 + 91.5)/2 = 1,524,358.43
- 32k-64k: (65536 - 32768) * (91.5 + 88.7)/2 = 2,952,396.58
- 64k-128k: (131072 - 65536) * (88.7 + 93.5)/2 = 5,972,783.7
- 128k-256k: (262144 - 131072) * (93.5 + 83.3)/2 = 11,586,747.3
- 256k-512k: (524288 - 262144) * (83.3 + 76.0)/2 = 20,871,405.85
- 512k-1M: (1048576 - 524288) * (76.0 + 68.1)/2 = 37,757,386.49 - Sum Areas: 81,454,860.9
- Normalize: The total width for AUC @1M is (1M bin - 8k bin) = 1048576 - 8192 = 1,040,384.
Normalized AUC = (Total Area / Total Width) = 81,454,860.9 / 1,040,384 ≈ 78.2931 - Result: AUC @1M ≈ 78.3%
(Note: The example above dynamically uses data for the 2-needle benchmark from the model currently ranked #1 by AUC @1M for illustrative purposes. The same calculation method applies to all models.)
AUC @128k uses the same method but only sums areas up to the 128k point and normalizes by the width from the first bin to 128k. Model AUC normalizes by the width of the range actually tested by the model.
8. How is the cost calculated?
Cost estimates shown are calculated only for successful test runs (where the model returned a response) and are based on the following:
- Pricing Source: Public pricing data reported by OpenRouter for each model, specifically using the rate from the cheapest provider listed for that model at its maximum context length.
- Token Counting Method: Cost estimates use the input (prompt) and output (completion) token counts reported by the API provider for each successful run.
- Note: This method may differ from the tokenization used for grouping results. Test results are assigned to context length bins based on total tokens (prompt + expected completion) calculated with the `o200k_base` tokenizer, as opposed to the provider-reported counts used for cost estimation.
- Input Cost: Estimates the cost based on the total tokens for the input (prompt) reported by the API. When generating multiple candidates (n>1), the prompt is typically sent once, but the exact cost impact can vary by API provider.
- Output Cost: Estimates the cost of all output tokens generated by the model, which includes any reasoning and the final response (reasoning and response). It sums the output token costs across all generated candidates (e.g., all 8 runs per test).
Important Note on Actual vs. Estimated Cost:
- The benchmark execution aimed to utilize cost-saving measures such as batch processing, caching, and available discounts/credits when possible.
- Consequently, the actual costs incurred during the original benchmark run may have been lower than the estimates presented here.
To ensure a fair and standardized comparison across all models for users, this dashboard displays estimated costs. These are calculated using publicly available on-demand prices from OpenRouter (based on the cheapest provider listed for the model's max context length), rather than the potentially variable actual costs from the benchmark execution.
9. What does the "Runs" column represent?
This shows the total number of potential answers generated by the model across all successful test runs included in the summary. Each answer corresponds to one candidate response (or 'run'), typically generated using the 'n' parameter in the API call to get multiple outputs per input.
The total reflects the generation workload: (Number of Successful Tests) × (Candidates Generated Per Test).
10. What do the badges (#1, #2, #3, inc) in the table mean?
- #1 #2 #3 These badges highlight the top 3 models for that specific column's metric. Higher scores are better for performance (%), while lower costs ($) are better.
- unranked The "unranked" badge indicates that a model is not included in rankings for specific reasons, such as known issues with its performance. Hover over the badge on a model for the specific reason.
- WARN This badge appears in a cost column if the API has a known token-reporting inaccuracy. The cost is an estimate based on potentially incorrect data and is excluded from ranking. Hover over the badge for details.
- inc The "inc" (incomplete) badge appears in cost columns if the model didn't successfully complete tests for all context length bins (up to 1M tokens). The cost shown might be lower than it would be if all tests had passed. Models marked inc are not included in the cost rankings to ensure fair comparison.
- FREE This badge appears in cost columns for models where cost data is zero or unavailable (e.g., free models). These are also excluded from cost rankings.
- deprec This badge indicates the model may be deprecated or no longer actively supported/available via common APIs (like OpenRouter). Results might be stale.
- new This badge indicates results for this model were updated recently (within the last 7 days). This badge takes precedence over the deprec badge.
- OLD This badge indicates the model was released more than one year ago. It is not shown if the new badge is present.
Dashboard Features
11. How do I interact with the table headers?
The main results table headers provide several actions:
- Sorting: Click any column header label (e.g., `AUC @1M (%)`) to sort the entire table by that column's values. Click again to reverse the sort direction.
- Refresh Data: Click the refresh icon (
) in the 'Model' column header to reload the latest results for the currently selected needle count.
- Model Performance Chart: Click the timeline icon (
) next to a model name in the 'Model' column to view a chart of that model's performance across all available needle counts.
- Cost/Score Plot: Click the stats icon (
) in the header of any context bin column (e.g., `128k (%)`) to view a Cost vs. Score scatter plot for models within that specific bin.
12. How can I see a Cost vs. Score chart for a specific context bin?
In the main results table, click the stats icon () in the header of any context bin column (e.g., `128k (%)`). This opens a scatter plot showing the total cost vs. average score for all currently selected models specifically within that context length bin.
13. How can I view the detailed results for a specific test run?
In the main results table, hover over any individual score cell (the percentage value) and click the document icon () that appears. This opens a modal where you can browse through the individual test runs for that model/bin combination, view the expected answer, and see the actual responses generated by the model along with their scores.
14. Can I customize the table columns?
Yes! Go to the "Controls" tab (next to the "Leaderboard" tab). There you can find options to show/hide specific columns, like all the individual context length bins or the pricing information.
15. How are the initially visible traces (lines/bars) on the main chart chosen?
When viewing the main performance chart:
- If 5 or fewer models are selected in the "Controls" tab, all their performance traces (lines/bars) will be visible on the chart by default.
- If more than 5 models are selected, the chart initially shows only the traces for the top 5 performing non-deprecated models (ranked by their overall `AUC @1M (%)` score).
- All other selected models are listed in the legend but their traces are hidden initially. Click a model's name in the legend to display its performance trace.
- You can toggle the visibility of any model's performance trace by clicking its name in the legend.
16. How are the initially visible points on the Cost vs. Score chart chosen?
When viewing the Cost vs. Score chart for a specific bin or AUC metric:
- The chart initially displays data points only for the top 10 performing models among those currently selected. The ranking is based on the primary score metric being plotted: either the average score (%) within that specific context bin, or the relevant AUC score (e.g., `AUC @1M (%)`, `Model AUC (%)`) if viewing an AUC plot.
- All other selected models are listed in the legend but their data points are hidden initially. Click a model's name in the legend to display its data point.
- You can toggle the visibility of any model's data point by clicking its name in the legend.
Data, Code & Contact
17. Where can I find the benchmark data and code?
The dataset details and evaluation methods are described on Hugging Face: openai/mrcr. OpenAI also discussed results in their GPT-4.1 blog post.
18. Can you evaluate more models?
We plan to add more models over time. Feel free to suggest specific models you'd like to see evaluated!
19. How can I contact you?
For questions, suggestions, or issues, please reach out on Twitter: @DillonUzar.