Context Arena

Legend Interaction: Single-click to toggle individual models. Double-click to toggle all models from the same provider.

Model
Max Ctx
128k (%)
1M (%)
AUC @128k (%)
AUC @1M (%)
Model AUC (%)
Runs
Total Prompt Cost ($)
Total Compl Cost ($)
google/gemini-2.5-pro-preview-03-25 new
1,048,576
#1 91.6%
#1 64.1%
#1 95.3%
#1 79.0%
#1 79.0%
6,320
$3,013.3419
$94.0818
google/gemini-2.5-flash-preview:thinking new
1,048,576
#3 87.0%
#3 59.8%
#2 89.9%
#2 73.8%
#2 73.8%
6,320
$180.8089
$22.9385
google/gemini-2.5-flash-preview new
1,048,576
70.1%
54.6%
76.2%
#3 62.0%
62.0%
6,320
$180.8089
$1.6501
openai/gpt-4.1 new
1,047,576
64.0%
56.7%
72.7%
60.6%
60.6%
6,400
$2,457.5815
$19.8782
google/gemini-2.0-flash-thinking-exp:free deprec
N/A
#2 88.2%
#2 60.8%
#3 88.1%
54.3%
54.3%
6,320
FREE $0.0000
FREE $0.0000
openai/gpt-4.1-mini new
1,047,576
60.2%
40.1%
64.6%
52.8%
52.8%
6,376
$482.5873
$3.9791
google/gemini-flash-1.5 new
1,000,000
65.9%
46.9%
63.7%
51.8%
51.8%
6,312
$178.8531
$1.5116
meta-llama/llama-4-maverick new
1,048,576
48.2%
36.9%
60.8%
45.8%
45.8%
5,896
inc $135.3332
inc $1.7877
google/gemini-flash-1.5-8b new
1,000,000
47.2%
34.7%
51.0%
40.7%
40.7%
6,312
#2 $89.4265
#1 $0.6485
minimax/minimax-01 new
1,000,192
80.9%
N/A
78.0%
33.1%
66.4%
5,154
inc $83.3570
inc $2.1948
openai/gpt-4.1-nano new
1,047,576
45.3%
15.1%
47.1%
29.1%
29.1%
6,392
$122.8554
#3 $0.9058
meta-llama/llama-4-scout new
1,048,576
38.7%
15.7%
44.2%
28.8%
28.8%
5,808
inc $58.3773
inc $0.6918
google/gemini-2.0-flash-lite-001 new
1,048,576
37.6%
13.4%
41.5%
23.4%
23.4%
6,296
#1 $88.2331
#2 $0.7300
google/gemini-2.0-flash-001 new
1,000,000
35.5%
11.1%
48.7%
23.3%
23.3%
6,296
#3 $117.6441
$3.8355
openai/o3 new
200,000
70.3%
N/A
86.3%
17.5%
70.8%
4,464
inc $2,241.4283
inc $275.1521
openai/o4-mini new
200,000
53.2%
N/A
69.8%
14.0%
56.7%
4,440
inc $243.2707
inc $36.1279
anthropic/claude-3.7-sonnet:thinking new
200,000
52.9%
N/A
63.5%
13.3%
53.8%
522
inc $73.7138
inc $11.4784
x-ai/grok-3-mini-beta new
131,072
36.1%
N/A
60.2%
13.2%
53.2%
3,960
inc $344.0692
inc $0.7631
anthropic/claude-3.7-sonnet new
200,000
48.8%
N/A
58.2%
12.6%
50.8%
522
inc $73.6699
inc $3.4940
openai/o3-mini new
200,000
38.8%
N/A
51.7%
10.9%
44.0%
4,472
inc $246.0180
inc $43.2641
x-ai/grok-3-beta new
131,072
60.0%
N/A
73.8%
9.0%
#3 73.8%
4,000
inc $436.6788
inc $23.1833
x-ai/grok-3-mini-beta:high new
131,072
33.9%
N/A
58.4%
7.1%
58.4%
3,832
inc $39.1908
inc $0.7337

Table Interaction: Click column headers to sort. Click query_stats in bin headers for Cost/Score plot. Hover over score cells & click article for detailed test results.

Disclaimers:

  1. N/A indicates no results found for this model/bin, potentially due to context window limits.
  2. AUC (Area Under Curve) normalized to 100% reflects overall performance across bins, weighted by bin width. Model AUC is normalized only for context bins within the model's tested range.
  3. Cost columns marked (inc) indicate the model had missing results in some context bins, potentially underestimating true cost. Ranking is omitted for these entries.

Table Options

Chart Options

Select Models

anthropic
google
meta-llama
minimax
openai
x-ai

Some technical terms have a dotted underline. Hover over them for a brief explanation.

Benchmark Details & Methodology

1. What is the OpenAI-MRCR benchmark?

OpenAI MRCR tests a Large Language Model's (LLM) ability to handle complex conversational history. Key aspects include:

  • Core Task: Finding and distinguishing between multiple identical pieces of information ("needles") hidden within a long conversation ("haystack").
  • Setup: Inspired by Google DeepMind's MRCR eval (arxiv:2409.12640v2), this version inserts 2, 4, or 8 identical requests (e.g., "write a poem about tapirs") alongside distractor requests. Needles/distractors are generated by GPT-4o to blend in.
  • Challenge: The model must retrieve a specific instance (e.g., the 2nd poem) based on its order, requiring careful tracking of the conversation. It must also prepend a specific random code (hash) to its answer.
  • Data Source: The benchmark data and detailed methodology are described on Hugging Face (openai/mrcr).
  • Dashboard Scope: This dashboard visualizes results directly from that published dataset and does not currently run new evaluations.
2. How is the score calculated?

The score measures how accurately the model retrieves the correct instance of the requested needle. The process involves:

  • Comparison Method: The model's answer is compared to the expected answer using the SequenceMatcher ratio from Python's `difflib` library.
  • Hash Requirement: The model *must* include a specific random code (hash identifier) at the start of its answer. This hash is removed before the comparison.
  • Failure Condition: If the required hash is missing or incorrect, the score is automatically 0.
  • Result: The similarity ratio (0.0 to 1.0) from the comparison is presented as a percentage (0-100%) in the table.

(Source: OpenAI MRCR Dataset Card)

3. What do the context length "bins" (e.g., 128k, 1M) mean?

The "bins" group test runs based on the total length of the text involved (prompt + expected answer). Here's how they work:

  • Measurement: Length is measured in tokens using the `o200k_base` tokenizer.
  • Grouping: Tests are grouped into bins based on their total token count. For example, the "128k" bin includes tests with >65,536 and <=131,072 tokens.
  • Score Display: The score shown for a bin (e.g., "128k (%)") is the average score from the 100 test samples conducted within that bin's length range.
  • Boundaries: The specific bin boundaries are [4k, 8k], (8k, 16k], (16k, 32k], (32k, 65k], (65k, 128k], (128k, 256k], (256k, 512k], (512k, 1M].

(Source: OpenAI MRCR Dataset Card)

4. How is this different from Fiction.livebench?

Both test how well models handle long texts, but focus on different skills. OpenAI-MRCR tests the model's ability to pinpoint and distinguish between identical pieces of information based on their order in a conversation (using synthetic data). Fiction.livebench (fiction.live/...) tests narrative understanding – how well models follow plots, characters, and consistency within complex stories, using quizzes based on actual fiction excerpts.

5. How does the benchmark design reduce the risk of models succeeding due to training data contamination?

The benchmark incorporates several design features to minimize the chance that models succeed simply by having seen similar data during training:

  • Synthetic & Unique Data: Each test uses a specially generated, long conversation. While topics might overlap with training data, the specific sequence of turns and the placement of "needles" are unique to the benchmark run.
  • Instance Specificity: The core task isn't just retrieving information (e.g., a poem about tapirs) but retrieving a specific instance (e.g., the 2nd poem requested) based on conversational order. Simple memorization of poems is insufficient.
  • Required Random Hash: Models must prepend a specific random code (hash) to their answer. This code is generated for the test run and cannot be predicted from general training data. Failure to include the correct hash results in a score of 0.

These elements combined make it highly unlikely that a model can achieve a high score purely by recalling memorized training data, as success requires understanding the specific conversational context, instance order, and adhering to the random hash requirement. (Source: OpenAI MRCR Dataset Card)

Note: This dashboard evaluates models against the dataset published by OpenAI (data collected up to April 11, 2024). Models released or significantly updated after this date might have been trained on this specific benchmark data, potentially affecting their results.

Understanding the Results

6. What are the different AUC scores and how are they calculated?

AUC (Area Under Curve) gives a single score summarizing performance across different context lengths (bins). Think of it like an average grade across tests of increasing difficulty (longer contexts). It's calculated by plotting the average score for each bin against the maximum context length of that bin and measuring the area under the resulting line/curve. This area is then normalized to a percentage (0-100%).

  • AUC @128k: AUC calculated using results only up to the 128k token bin (tests with up to 131,072 tokens).
  • AUC @1M: AUC calculated using results across all bins up to the 1M token bin (tests with up to 1,048,576 tokens).
  • Model AUC: AUC calculated only over the range of bins the specific model actually completed successfully. This provides a fairer comparison if a model couldn't handle the longest contexts.

All AUC scores are normalized to a maximum of 100%.

7. How is the cost calculated?

Costs are estimates based on the public API pricing for each model and the number of tokens reported by the API for each test run (typically using the `o200k_base` tokenizer).

  • Total Prompt Cost: Estimates the cost of sending the input text (prompt) to the model. It assumes a separate API call for each candidate response generated, multiplying the base prompt cost by the number of candidates.
  • Total Compl Cost: Estimates the cost of the text generated by the model (completion). It sums the completion token costs across all candidates generated for all successful runs.

Note: Actual costs might differ due to factors like batching requests, different pricing tiers, or API provider adjustments not captured here.

8. What does the "Runs" column represent?

This shows the total number of candidate responses (potential answers) generated by the model across all the successful test runs included in the summary. It reflects the total generation workload: (Number of Successful Tests) × (Candidates Generated Per Test).

9. What do the badges (#1, #2, #3, inc) in the table mean?
  • #1 #2 #3 These badges highlight the top 3 models for that specific column's metric. Higher scores are better for performance (%), while lower costs ($) are better.
  • inc The "inc" (incomplete) badge appears in cost columns if the model didn't successfully complete tests for all context length bins (up to 1M tokens). The cost shown might be lower than it would be if all tests had passed. Models marked inc are not included in the cost rankings to ensure fair comparison.
  • FREE This badge appears in cost columns for models where cost data is zero or unavailable (e.g., free models). These are also excluded from cost rankings.
  • deprec This badge indicates the model may be deprecated or no longer actively supported/available via common APIs (like OpenRouter). Results might be stale.
  • new This badge indicates results for this model were updated recently (within the last 7 days). This badge takes precedence over the deprec badge.

Dashboard Features

10. How can I see a Cost vs. Score chart for a specific context bin?

In the main results table, click the stats icon (query_stats) in the header of any context bin column (e.g., `128k (%)`). This opens a scatter plot showing the total cost vs. average score for all currently selected models specifically within that context length bin.

11. How can I view the detailed results for a specific test run?

In the main results table, hover over any individual score cell (the percentage value) and click the document icon (article) that appears. This opens a modal where you can browse through the individual test runs for that model/bin combination, view the expected answer, and see the actual responses generated by the model along with their scores.

12. Can I customize the table columns?

Yes! Go to the "Controls" tab (next to the "Leaderboard" tab). There you can find options to show/hide specific columns, like all the individual context length bins or the pricing information.

Data, Code & Contact

13. Where can I find the benchmark data and code?

The dataset details and evaluation methods are described on Hugging Face: openai/mrcr. OpenAI also discussed results in their GPT-4.1 blog post.

14. Can you evaluate more models?

We plan to add more models over time. Feel free to suggest specific models you'd like to see evaluated!

15. How can I contact you?

For questions, suggestions, or issues, please reach out on Twitter: @DillonUzar.