Legend Interaction: Single-click to toggle individual models. Double-click to toggle all models from the same provider.
Model | Max Ctx | 128k (%) | 1M (%) | AUC @128k (%) | AUC @1M (%) ▼ | Model AUC (%) | Runs | Total Prompt Cost ($) | Total Compl Cost ($) |
---|---|---|---|---|---|---|---|---|---|
google/gemini-2.5-pro-preview-03-25 new | 1,048,576 | #1 91.6% | #1 64.1% | #1 95.3% | #1 79.0% | #1 79.0% | 6,320 | $3,013.3419 | $94.0818 |
google/gemini-2.5-flash-preview:thinking new | 1,048,576 | #3 87.0% | #3 59.8% | #2 89.9% | #2 73.8% | #2 73.8% | 6,320 | $180.8089 | $22.9385 |
google/gemini-2.5-flash-preview new | 1,048,576 | 70.1% | 54.6% | 76.2% | #3 62.0% | 62.0% | 6,320 | $180.8089 | $1.6501 |
openai/gpt-4.1 new | 1,047,576 | 64.0% | 56.7% | 72.7% | 60.6% | 60.6% | 6,400 | $2,457.5815 | $19.8782 |
google/gemini-2.0-flash-thinking-exp:free deprec | N/A | #2 88.2% | #2 60.8% | #3 88.1% | 54.3% | 54.3% | 6,320 | FREE $0.0000 | FREE $0.0000 |
openai/gpt-4.1-mini new | 1,047,576 | 60.2% | 40.1% | 64.6% | 52.8% | 52.8% | 6,376 | $482.5873 | $3.9791 |
google/gemini-flash-1.5 new | 1,000,000 | 65.9% | 46.9% | 63.7% | 51.8% | 51.8% | 6,312 | $178.8531 | $1.5116 |
meta-llama/llama-4-maverick new | 1,048,576 | 48.2% | 36.9% | 60.8% | 45.8% | 45.8% | 5,896 | inc $135.3332 | inc $1.7877 |
google/gemini-flash-1.5-8b new | 1,000,000 | 47.2% | 34.7% | 51.0% | 40.7% | 40.7% | 6,312 | #2 $89.4265 | #1 $0.6485 |
minimax/minimax-01 new | 1,000,192 | 80.9% | N/A | 78.0% | 33.1% | 66.4% | 5,154 | inc $83.3570 | inc $2.1948 |
openai/gpt-4.1-nano new | 1,047,576 | 45.3% | 15.1% | 47.1% | 29.1% | 29.1% | 6,392 | $122.8554 | #3 $0.9058 |
meta-llama/llama-4-scout new | 1,048,576 | 38.7% | 15.7% | 44.2% | 28.8% | 28.8% | 5,808 | inc $58.3773 | inc $0.6918 |
google/gemini-2.0-flash-lite-001 new | 1,048,576 | 37.6% | 13.4% | 41.5% | 23.4% | 23.4% | 6,296 | #1 $88.2331 | #2 $0.7300 |
google/gemini-2.0-flash-001 new | 1,000,000 | 35.5% | 11.1% | 48.7% | 23.3% | 23.3% | 6,296 | #3 $117.6441 | $3.8355 |
openai/o3 new | 200,000 | 70.3% | N/A | 86.3% | 17.5% | 70.8% | 4,464 | inc $2,241.4283 | inc $275.1521 |
openai/o4-mini new | 200,000 | 53.2% | N/A | 69.8% | 14.0% | 56.7% | 4,440 | inc $243.2707 | inc $36.1279 |
anthropic/claude-3.7-sonnet:thinking new | 200,000 | 52.9% | N/A | 63.5% | 13.3% | 53.8% | 522 | inc $73.7138 | inc $11.4784 |
x-ai/grok-3-mini-beta new | 131,072 | 36.1% | N/A | 60.2% | 13.2% | 53.2% | 3,960 | inc $344.0692 | inc $0.7631 |
anthropic/claude-3.7-sonnet new | 200,000 | 48.8% | N/A | 58.2% | 12.6% | 50.8% | 522 | inc $73.6699 | inc $3.4940 |
openai/o3-mini new | 200,000 | 38.8% | N/A | 51.7% | 10.9% | 44.0% | 4,472 | inc $246.0180 | inc $43.2641 |
x-ai/grok-3-beta new | 131,072 | 60.0% | N/A | 73.8% | 9.0% | #3 73.8% | 4,000 | inc $436.6788 | inc $23.1833 |
x-ai/grok-3-mini-beta:high new | 131,072 | 33.9% | N/A | 58.4% | 7.1% | 58.4% | 3,832 | inc $39.1908 | inc $0.7337 |
Table Interaction: Click column headers to sort. Click
in bin headers for Cost/Score plot. Hover over score cells & click for detailed test results.Disclaimers:
- N/A indicates no results found for this model/bin, potentially due to context window limits.
- AUC (Area Under Curve) normalized to 100% reflects overall performance across bins, weighted by bin width. Model AUC is normalized only for context bins within the model's tested range.
- Cost columns marked (inc) indicate the model had missing results in some context bins, potentially underestimating true cost. Ranking is omitted for these entries.
Select Models
anthropic
meta-llama
minimax
openai
x-ai
Some technical terms have a dotted underline. Hover over them for a brief explanation.
Benchmark Details & Methodology
1. What is the OpenAI-MRCR benchmark?
OpenAI MRCR tests a Large Language Model's (LLM) ability to handle complex conversational history. Key aspects include:
- Core Task: Finding and distinguishing between multiple identical pieces of information ("needles") hidden within a long conversation ("haystack").
- Setup: Inspired by Google DeepMind's MRCR eval (arxiv:2409.12640v2), this version inserts 2, 4, or 8 identical requests (e.g., "write a poem about tapirs") alongside distractor requests. Needles/distractors are generated by GPT-4o to blend in.
- Challenge: The model must retrieve a specific instance (e.g., the 2nd poem) based on its order, requiring careful tracking of the conversation. It must also prepend a specific random code (hash) to its answer.
- Data Source: The benchmark data and detailed methodology are described on Hugging Face (openai/mrcr).
- Dashboard Scope: This dashboard visualizes results directly from that published dataset and does not currently run new evaluations.
2. How is the score calculated?
The score measures how accurately the model retrieves the correct instance of the requested needle. The process involves:
- Comparison Method: The model's answer is compared to the expected answer using the SequenceMatcher ratio from Python's `difflib` library.
- Hash Requirement: The model *must* include a specific random code (hash identifier) at the start of its answer. This hash is removed before the comparison.
- Failure Condition: If the required hash is missing or incorrect, the score is automatically 0.
- Result: The similarity ratio (0.0 to 1.0) from the comparison is presented as a percentage (0-100%) in the table.
3. What do the context length "bins" (e.g., 128k, 1M) mean?
The "bins" group test runs based on the total length of the text involved (prompt + expected answer). Here's how they work:
- Measurement: Length is measured in tokens using the `o200k_base` tokenizer.
- Grouping: Tests are grouped into bins based on their total token count. For example, the "128k" bin includes tests with >65,536 and <=131,072 tokens.
- Score Display: The score shown for a bin (e.g., "128k (%)") is the average score from the 100 test samples conducted within that bin's length range.
- Boundaries: The specific bin boundaries are [4k, 8k], (8k, 16k], (16k, 32k], (32k, 65k], (65k, 128k], (128k, 256k], (256k, 512k], (512k, 1M].
4. How is this different from Fiction.livebench?
Both test how well models handle long texts, but focus on different skills. OpenAI-MRCR tests the model's ability to pinpoint and distinguish between identical pieces of information based on their order in a conversation (using synthetic data). Fiction.livebench (fiction.live/...) tests narrative understanding – how well models follow plots, characters, and consistency within complex stories, using quizzes based on actual fiction excerpts.
5. How does the benchmark design reduce the risk of models succeeding due to training data contamination?
The benchmark incorporates several design features to minimize the chance that models succeed simply by having seen similar data during training:
- Synthetic & Unique Data: Each test uses a specially generated, long conversation. While topics might overlap with training data, the specific sequence of turns and the placement of "needles" are unique to the benchmark run.
- Instance Specificity: The core task isn't just retrieving information (e.g., a poem about tapirs) but retrieving a specific instance (e.g., the 2nd poem requested) based on conversational order. Simple memorization of poems is insufficient.
- Required Random Hash: Models must prepend a specific random code (hash) to their answer. This code is generated for the test run and cannot be predicted from general training data. Failure to include the correct hash results in a score of 0.
These elements combined make it highly unlikely that a model can achieve a high score purely by recalling memorized training data, as success requires understanding the specific conversational context, instance order, and adhering to the random hash requirement. (Source: OpenAI MRCR Dataset Card)
Note: This dashboard evaluates models against the dataset published by OpenAI (data collected up to April 11, 2024). Models released or significantly updated after this date might have been trained on this specific benchmark data, potentially affecting their results.
Understanding the Results
6. What are the different AUC scores and how are they calculated?
AUC (Area Under Curve) gives a single score summarizing performance across different context lengths (bins). Think of it like an average grade across tests of increasing difficulty (longer contexts). It's calculated by plotting the average score for each bin against the maximum context length of that bin and measuring the area under the resulting line/curve. This area is then normalized to a percentage (0-100%).
- AUC @128k: AUC calculated using results only up to the 128k token bin (tests with up to 131,072 tokens).
- AUC @1M: AUC calculated using results across all bins up to the 1M token bin (tests with up to 1,048,576 tokens).
- Model AUC: AUC calculated only over the range of bins the specific model actually completed successfully. This provides a fairer comparison if a model couldn't handle the longest contexts.
All AUC scores are normalized to a maximum of 100%.
7. How is the cost calculated?
Costs are estimates based on the public API pricing for each model and the number of tokens reported by the API for each test run (typically using the `o200k_base` tokenizer).
- Total Prompt Cost: Estimates the cost of sending the input text (prompt) to the model. It assumes a separate API call for each candidate response generated, multiplying the base prompt cost by the number of candidates.
- Total Compl Cost: Estimates the cost of the text generated by the model (completion). It sums the completion token costs across all candidates generated for all successful runs.
Note: Actual costs might differ due to factors like batching requests, different pricing tiers, or API provider adjustments not captured here.
8. What does the "Runs" column represent?
This shows the total number of candidate responses (potential answers) generated by the model across all the successful test runs included in the summary. It reflects the total generation workload: (Number of Successful Tests) × (Candidates Generated Per Test).
9. What do the badges (#1, #2, #3, inc) in the table mean?
- #1 #2 #3 These badges highlight the top 3 models for that specific column's metric. Higher scores are better for performance (%), while lower costs ($) are better.
- inc The "inc" (incomplete) badge appears in cost columns if the model didn't successfully complete tests for all context length bins (up to 1M tokens). The cost shown might be lower than it would be if all tests had passed. Models marked inc are not included in the cost rankings to ensure fair comparison.
- FREE This badge appears in cost columns for models where cost data is zero or unavailable (e.g., free models). These are also excluded from cost rankings.
- deprec This badge indicates the model may be deprecated or no longer actively supported/available via common APIs (like OpenRouter). Results might be stale.
- new This badge indicates results for this model were updated recently (within the last 7 days). This badge takes precedence over the deprec badge.
Dashboard Features
10. How can I see a Cost vs. Score chart for a specific context bin?
In the main results table, click the stats icon () in the header of any context bin column (e.g., `128k (%)`). This opens a scatter plot showing the total cost vs. average score for all currently selected models specifically within that context length bin.
11. How can I view the detailed results for a specific test run?
In the main results table, hover over any individual score cell (the percentage value) and click the document icon () that appears. This opens a modal where you can browse through the individual test runs for that model/bin combination, view the expected answer, and see the actual responses generated by the model along with their scores.
12. Can I customize the table columns?
Yes! Go to the "Controls" tab (next to the "Leaderboard" tab). There you can find options to show/hide specific columns, like all the individual context length bins or the pricing information.
Data, Code & Contact
13. Where can I find the benchmark data and code?
The dataset details and evaluation methods are described on Hugging Face: openai/mrcr. OpenAI also discussed results in their GPT-4.1 blog post.
14. Can you evaluate more models?
We plan to add more models over time. Feel free to suggest specific models you'd like to see evaluated!
15. How can I contact you?
For questions, suggestions, or issues, please reach out on Twitter: @DillonUzar.