Context Arena

Table Options

Chart Options

Legend Interaction: Single-click to toggle individual models. Double-click to toggle all models from the same provider.

Model
Max Ctx
8k (%)
16k (%)
32k (%)
64k (%)
128k (%)
256k (%)
512k (%)
1M (%)
AUC @128k (%)
AUC @1M (%)
Model AUC (%)
Runs
Total Prompt Cost ($)
Total Compl Cost ($)
google/gemini-2.5-pro-preview-03-25
1,048,576
#3 98.7%
#3 94.3%
#3 89.0%
#2 88.9%
#3 84.1%
#2 81.1%
#1 72.7%
#1 60.1%
#3 88.5%
#1 73.7%
#1 73.7%
6,320
#12 $3,013.5705
#14 $95.5129
google/gemini-2.5-pro-preview
1,048,576
#2 99.5%
#1 97.1%
#1 94.4%
#1 93.9%
#2 84.3%
#1 83.7%
#3 71.0%
#3 54.1%
#1 91.9%
#2 72.5%
#3 72.5%
6,320
#12 $3,013.5705
#13 $26.5851
google/gemini-2.5-flash-preview:thinking
1,048,576
#4 98.3%
#2 95.4%
#2 90.2%
#3 87.4%
#1 87.5%
#3 76.6%
#2 71.6%
#2 57.9%
#2 89.2%
#3 72.2%
#4 72.2%
6,320
#9 $180.8142
#12 $23.7270
google/gemini-2.5-flash-preview
1,048,576
#8 91.7%
#7 84.9%
#7 77.1%
#5 76.2%
#4 78.1%
#4 69.9%
#4 59.1%
#4 53.5%
#4 78.3%
#4 63.2%
#5 63.2%
6,320
#9 $180.8142
#9 $1.5496
openai/gpt-4.1
1,047,576
#10 87.6%
#13 68.0%
#9 70.4%
#10 57.5%
#10 55.5%
#6 57.1%
#5 55.6%
#5 42.1%
#9 61.6%
#5 53.2%
#10 53.2%
6,384
#11 $2,468.4629
#11 $19.7168
google/gemini-flash-1.5
1,000,000
#28 58.6%
#19 59.3%
#20 54.9%
#9 61.6%
#13 52.0%
#9 44.8%
#8 45.8%
#6 34.9%
#10 57.4%
#6 44.6%
#18 44.6%
6,208
#8 $162.7636
#8 $1.5166
openai/gpt-4.1-mini
1,047,576
#18 73.4%
#22 57.2%
#15 58.2%
#21 42.6%
#17 47.2%
#8 45.1%
#7 47.7%
#7 32.5%
#20 49.4%
#7 43.6%
#19 43.6%
6,384
#10 $493.6926
#10 $3.9214
meta-llama/llama-4-maverick
1,048,576
#16 80.4%
#17 60.0%
#17 56.4%
#16 48.4%
#15 50.3%
#7 48.6%
#9 36.2%
#8 30.1%
#16 52.7%
#8 39.9%
#24 39.9%
6,024
#7 $154.5747
#7 $1.4152
google/gemini-2.0-flash-001
1,048,576
#23 65.6%
#18 59.9%
#14 59.6%
#13 49.6%
#6 60.1%
#12 37.2%
#10 29.9%
#11 13.3%
#11 56.0%
#9 32.1%
#29 32.1%
6,224
#5 $110.0136
#6 $0.9748
minimax/minimax-01
1,000,192
#13 84.3%
#8 78.5%
#8 74.5%
#8 67.0%
#5 71.9%
#5 61.4%
#6 53.8%
N/A
#8 71.6%
#10 31.4%
#6 63.2%
5,224
inc $90.9933
inc $2.0702
openai/gpt-4.1-nano
1,047,576
#36 47.0%
#30 48.5%
#28 49.8%
#23 41.3%
#22 36.4%
#20 20.7%
#11 26.7%
#12 13.0%
#25 42.6%
#11 24.6%
#33 24.6%
6,384
#6 $123.4231
#5 $0.8169
google/gemini-2.0-flash-lite-001
1,048,576
#32 56.1%
#29 48.6%
#30 46.6%
#31 30.3%
#20 38.6%
#16 29.1%
#12 19.7%
#9 17.0%
#29 38.4%
#12 24.2%
#34 24.2%
6,224
#4 $82.5102
#3 $0.5253
meta-llama/llama-4-scout
1,048,576
#24 64.0%
#27 53.9%
#31 44.6%
#29 31.9%
#29 33.1%
#21 20.7%
#13 17.6%
#10 13.4%
#30 38.1%
#13 20.5%
#37 20.5%
5,702
#2 $55.0137
#4 $0.6192
google/gemini-flash-1.5-8b
1,000,000
#39 43.6%
#36 42.8%
#39 29.5%
#34 27.9%
#32 27.0%
#22 16.2%
#14 17.0%
#13 12.1%
#33 30.0%
#14 17.8%
#39 17.8%
6,208
#3 $81.4626
#2 $0.4052
openai/o4-mini
200,000
#6 96.4%
#6 88.9%
#4 86.6%
#4 78.2%
#9 57.4%
#11 37.8%
N/A
N/A
#5 76.0%
#15 15.0%
#7 61.3%
4,448
inc $247.9571
inc $37.2230
openai/o3
200,000
#5 97.5%
#4 91.4%
#6 85.4%
#6 71.2%
#11 55.4%
#14 32.8%
N/A
N/A
#7 72.7%
#16 14.1%
#8 57.9%
4,448
inc $2,254.1553
inc $258.8007
anthropic/claude-3-haiku
200,000
#22 67.5%
#26 54.3%
#21 54.5%
#18 46.7%
#8 58.6%
#10 38.0%
N/A
N/A
#15 52.9%
#17 12.3%
#11 50.5%
4,176
inc $49.5406
inc $2.2391
anthropic/claude-3.7-sonnet
200,000
#11 85.5%
#14 67.6%
#11 62.4%
#12 50.4%
#14 51.1%
#13 36.5%
N/A
N/A
#12 55.9%
#18 12.1%
#13 49.6%
4,176
inc $594.4813
inc $27.8910
anthropic/claude-3.7-sonnet:thinking
200,000
#12 85.1%
#16 65.4%
#12 62.3%
#15 49.0%
#12 52.9%
#17 28.4%
N/A
N/A
#13 55.5%
#19 11.7%
#14 47.8%
4,176
inc $594.5065
inc $33.5946
anthropic/claude-3.5-sonnet
200,000
#9 89.7%
#15 66.6%
#19 54.9%
#19 45.8%
#19 46.1%
#15 31.5%
N/A
N/A
#17 51.2%
#20 10.9%
#17 44.8%
4,176
inc $594.4836
inc $27.3120
anthropic/claude-3.5-haiku
200,000
#21 67.8%
#25 55.2%
#23 51.5%
#14 49.1%
#18 46.2%
#18 22.9%
N/A
N/A
#19 50.0%
#21 10.3%
#21 42.0%
4,120
inc $155.2502
inc $7.3357
openai/o3-mini
200,000
#19 72.6%
#23 56.0%
#22 52.6%
#24 40.1%
#25 34.7%
#19 21.9%
N/A
N/A
#23 43.8%
#22 8.7%
#27 35.8%
4,456
inc $248.4745
inc $47.7402
x-ai/grok-3-beta
131,072
#1 99.5%
#5 90.0%
#5 85.9%
#7 68.7%
#7 60.0%
N/A
N/A
N/A
#6 73.0%
#23 8.6%
#2 73.0%
4,000
inc $436.6788
inc $23.1833
x-ai/grok-3-mini-beta
131,072
#15 82.8%
#9 76.9%
#10 67.7%
#11 52.5%
#21 37.0%
N/A
N/A
N/A
#14 54.8%
#24 6.5%
#9 54.8%
3,976
inc $44.2934
inc $2.8246
qwen/qwen-turbo
1,000,000
#37 46.0%
#39 35.5%
#38 30.8%
#35 23.6%
#37 14.2%
#24 1.8%
#15 3.3%
#14 4.2%
#37 24.5%
#25 6.4%
#44 6.4%
5,984
#1 $44.3817
#1 $0.1932
x-ai/grok-3-mini-beta:high
131,072
#17 75.8%
#12 69.8%
#13 61.3%
#17 48.0%
#24 34.8%
N/A
N/A
N/A
#18 50.3%
#26 5.9%
#12 50.3%
4,000
inc $44.4349
inc $4.8529
mistralai/mistral-small-3.1-24b-instruct
131,072
#27 60.0%
#21 58.7%
#16 57.2%
#25 38.5%
#16 48.7%
N/A
N/A
N/A
#21 47.7%
#27 5.6%
#15 47.7%
3,728
inc $6.2525
inc $0.2265
qwen/qwen3-14b
128,000
#26 62.6%
#20 58.8%
#26 50.9%
#20 43.9%
#26 34.5%
N/A
N/A
N/A
#22 44.9%
#28 5.3%
#16 44.9%
3,416
inc $6.3779
inc $0.8426
mistralai/mistral-medium-3
131,072
#33 55.7%
#28 53.1%
#27 50.2%
#22 42.5%
#28 33.2%
N/A
N/A
N/A
#24 43.0%
#29 5.1%
#20 43.0%
3,936
inc $59.9857
inc $3.1361
deepseek/deepseek-r1
163,840
#7 93.7%
#10 75.0%
#18 55.8%
#27 34.9%
#35 18.0%
N/A
N/A
N/A
#26 40.6%
#30 4.8%
#22 40.6%
3,952
inc $72.7609
inc $9.2225
google/gemma-3-12b-it
131,072
#35 49.4%
#32 48.3%
#25 51.2%
#28 34.6%
#23 35.0%
N/A
N/A
N/A
#27 39.9%
#31 4.7%
#23 39.9%
3,664
inc $5.9463
inc $0.1337
qwen/qwen3-30b-a3b
128,000
#34 55.2%
#33 46.0%
#29 47.4%
#26 38.2%
#30 29.7%
N/A
N/A
N/A
#28 39.1%
#32 4.6%
#25 39.1%
3,416
inc $9.0839
inc $1.6166
qwen/qwen3-32b
128,000
#20 70.4%
#24 55.4%
#33 43.5%
#30 30.3%
#31 29.3%
N/A
N/A
N/A
#31 36.5%
#33 4.3%
#26 36.5%
3,416
inc $9.1047
inc $1.4330
deepseek/deepseek-chat-v3-0324
163,840
#29 58.4%
#34 44.3%
#34 38.8%
#32 29.4%
#27 33.8%
N/A
N/A
N/A
#32 34.9%
#34 4.1%
#28 34.9%
3,960
inc $43.7192
inc $1.2682
qwen/qwen3-235b-a22b
128,000
#14 83.7%
#11 70.5%
#35 36.8%
#37 21.0%
#36 14.9%
N/A
N/A
N/A
#34 29.6%
#35 3.5%
#30 29.6%
3,416
inc $18.2226
inc $2.7821
google/gemma-3-27b-it
131,072
#44 30.9%
#43 25.3%
#37 33.6%
#33 28.0%
#33 25.1%
N/A
N/A
N/A
#35 28.2%
#36 3.3%
#31 28.2%
3,712
inc $12.0846
inc $0.3219
qwen/qwen3-8b
128,000
#31 56.5%
#37 40.4%
#36 34.4%
#36 22.3%
#34 21.3%
N/A
N/A
N/A
#36 27.4%
#37 3.2%
#32 27.4%
3,416
inc $3.1890
inc $0.5640
mistralai/mistral-large-2411
131,072
#25 63.8%
#31 48.3%
#24 51.3%
#38 17.2%
#44 0.0%
N/A
N/A
N/A
#38 24.1%
#38 2.8%
#35 24.1%
3,824
inc $304.2297
inc $9.4407
amazon/nova-lite-v1
300,000
#30 57.4%
#38 37.2%
#40 24.9%
#41 14.2%
#41 6.0%
#23 3.4%
N/A
N/A
#41 17.9%
#39 2.7%
#42 11.1%
4,608
inc $25.0529
inc $0.2342
mistralai/ministral-8b
128,000
#40 40.6%
#35 42.8%
#32 44.6%
#39 15.8%
#40 7.2%
N/A
N/A
N/A
#39 22.8%
#40 2.7%
#36 22.8%
3,712
inc $12.2571
inc $0.1293
qwen/qwen3-4b:free
128,000
#38 45.5%
#41 35.2%
#43 17.9%
#40 15.7%
#38 11.7%
N/A
N/A
N/A
#40 18.0%
#41 2.1%
#38 18.0%
3,416
inc $0.0000
inc $0.0000
google/gemma-3-4b-it
131,072
#43 35.6%
#42 27.8%
#42 20.5%
#42 10.6%
#39 8.1%
N/A
N/A
N/A
#42 14.5%
#42 1.7%
#40 14.5%
3,688
inc $2.4073
inc $0.0343
mistralai/ministral-3b
131,072
#42 35.8%
#40 35.3%
#41 24.6%
#44 6.9%
#42 5.4%
N/A
N/A
N/A
#43 13.9%
#43 1.6%
#41 13.9%
3,728
inc $5.0020
inc $0.0360
amazon/nova-micro-v1
128,000
#41 39.7%
#44 19.2%
#44 12.4%
#43 7.2%
#43 4.0%
N/A
N/A
N/A
#44 9.7%
#44 1.1%
#43 9.7%
3,448
inc $4.5526
inc $0.0811

Table Interaction: Click headers to sort. refresh (Model header): Reload data. timeline (Model cell): View model performance chart. query_stats (Bin header): View Cost/Score plot. article (Score cell): View test details.

Notes & Definitions:

  1. N/A indicates no results for this model/bin, potentially due to context window limits.
  2. AUC (Area Under Curve) normalized to 100% reflects overall performance across bins, weighted by bin width. Model AUC is normalized only for context bins within the model's max context length.
  3. Badge definitions:
    • #1 #2 #3: Top 3 models for that metric (higher score / lower cost is better).
    • inc: Incomplete cost data (potentially underestimated cost, excluded from cost rank).
    • FREE: Free model or cost data unavailable (excluded from cost rank).

Table Options

Chart Options

Select Models

amazon
anthropic
deepseek
google
meta-llama
minimax
mistralai
openai
qwen
thudm
x-ai

Some technical terms have a dotted underline. Hover over them for a brief explanation.

Benchmark Details & Methodology

1. What is the OpenAI-MRCR benchmark?

OpenAI MRCR tests a Large Language Model's (LLM) ability to handle complex conversational history. Key aspects include:

  • Core Task: Finding and distinguishing between multiple identical pieces of information ("needles") hidden within a long conversation ("haystack").
  • Setup: Inspired by Google DeepMind's MRCR eval (arxiv:2409.12640v2), this version inserts 2, 4, or 8 identical requests (e.g., "write a poem about tapirs") alongside distractor requests. Needles/distractors are generated by GPT-4o to blend in.
  • Challenge: The model must retrieve a specific instance (e.g., the 2nd poem) based on its order, requiring careful tracking of the conversation. It must also prepend a specific random code (hash) to its answer.
  • Data Source: The benchmark data and detailed methodology are described on Hugging Face (openai/mrcr).
  • Dashboard Scope: This dashboard visualizes results directly from that published dataset and does not currently run new evaluations.
2. How is the score calculated?

The score measures how accurately the model retrieves the correct instance of the requested needle. The process involves:

  • Comparison Method: The model's answer is compared to the expected answer using the SequenceMatcher ratio from Python's `difflib` library.
  • Hash Requirement: The model *must* include a specific random code (hash identifier) at the start of its answer. This hash is removed before the comparison.
  • Failure Condition: If the required hash is missing or incorrect, the score is automatically 0.
  • Result: The similarity ratio (0.0 to 1.0) from the comparison is presented as a percentage (0-100%) in the table.

See FAQ #6 for details about how Area Under Curve (AUC) scores summarize performance across different context lengths.

(Source: OpenAI MRCR Dataset Card)

3. What do the context length "bins" (e.g., 128k, 1M) mean?

The "bins" group test runs based on the total length of the text involved (prompt + expected answer). Here's how they work:

  • Measurement: Length is measured in tokens using the `o200k_base` tokenizer.
  • Grouping: Tests are grouped into bins based on their total token count. For example, the "128k" bin includes tests with >65,536 and <=131,072 tokens.
  • Score Display: The score shown for a bin (e.g., "128k (%)") is the average score from the 100 test samples conducted within that bin's length range.
  • Boundaries: The specific bin boundaries are [4k, 8k], (8k, 16k], (16k, 32k], (32k, 65k], (65k, 128k], (128k, 256k], (256k, 512k], (512k, 1M].

(Source: OpenAI MRCR Dataset Card)

4. How is this different from Fiction.livebench?

Both test how well models handle long texts, but focus on different skills. OpenAI-MRCR tests the model's ability to pinpoint and distinguish between identical pieces of information based on their order in a conversation (using synthetic data). Fiction.livebench (fiction.live/...) tests narrative understanding – how well models follow plots, characters, and consistency within complex stories, using quizzes based on actual fiction excerpts.

5. How does the benchmark design reduce the risk of models succeeding due to training data contamination?

The benchmark incorporates several design features to minimize the chance that models succeed simply by having seen similar data during training:

  • Synthetic & Unique Data: Each test uses a specially generated, long conversation. While topics might overlap with training data, the specific sequence of turns and the placement of "needles" are unique to the benchmark run.
  • Instance Specificity: The core task isn't just retrieving information (e.g., a poem about tapirs) but retrieving a specific instance (e.g., the 2nd poem requested) based on conversational order. Simple memorization of poems is insufficient.
  • Required Random Hash: Models must prepend a specific random code (hash) to their answer. This code is generated for the test run and cannot be predicted from general training data. Failure to include the correct hash results in a score of 0.

These elements combined make it highly unlikely that a model can achieve a high score purely by recalling memorized training data, as success requires understanding the specific conversational context, instance order, and adhering to the random hash requirement. (Source: OpenAI MRCR Dataset Card)

Note: This dashboard evaluates models against the dataset published by OpenAI (data collected up to April 11, 2024). Models released or significantly updated after this date might have been trained on this specific benchmark data, potentially affecting their results.

Understanding the Results

6. What are the different AUC scores and how are they calculated?

AUC (Area Under Curve) gives a single score summarizing performance across different context lengths (bins). Think of it like an average grade across tests of increasing difficulty (longer contexts). It's calculated by plotting the average score for each bin against the maximum context length of that bin and measuring the area under the resulting line/curve. This area is then normalized to a percentage (0-100%).

  • AUC @128k: AUC calculated using results only up to the 128k token bin (tests with up to 131,072 tokens).
  • AUC @1M: AUC calculated using results across all bins up to the 1M token bin (tests with up to 1,048,576 tokens).
  • Model AUC: AUC calculated only over the range of bins the specific model actually completed successfully. This provides a fairer comparison if a model couldn't handle the longest contexts.

Technical Note: Calculation uses the Trapezoidal Rule on a linear scale of context lengths. See example below for details.

Show Calculation Example

Example (AUC @1M for google/gemini-2.5-pro-preview-03-25 - 2-Needle Top Model):

  1. Data Points (Bin, Score%): (8k, 98.7), (16k, 94.3), (32k, 89.0), (64k, 88.9), (128k, 84.1), (256k, 81.1), (512k, 72.7), (1M, 60.1)
  2. Calculate Trapezoid Areas:
      - 8k-16k: (16384 - 8192) * (98.7 + 94.3)/2 = 790,474.02
      - 16k-32k: (32768 - 16384) * (94.3 + 89.0)/2 = 1,501,538.21
      - 32k-64k: (65536 - 32768) * (89.0 + 88.9)/2 = 2,914,845.6
      - 64k-128k: (131072 - 65536) * (88.9 + 84.1)/2 = 5,671,930.43
      - 128k-256k: (262144 - 131072) * (84.1 + 81.1)/2 = 10,827,079.54
      - 256k-512k: (524288 - 262144) * (81.1 + 72.7)/2 = 20,150,769.71
      - 512k-1M: (1048576 - 524288) * (72.7 + 60.1)/2 = 34,819,395.63
  3. Sum Areas: 76,676,033.16
  4. Normalize: The total width for AUC @1M is (1M bin - 8k bin) = 1048576 - 8192 = 1,040,384.
    Normalized AUC = (Total Area / Total Width) = 76,676,033.16 / 1,040,384 ≈ 73.6997
  5. Result: AUC @1M ≈ 73.7%

(Note: The example above dynamically uses data for the 2-needle benchmark from the model currently ranked #1 by AUC @1M for illustrative purposes. The same calculation method applies to all models.)

AUC @128k uses the same method but only sums areas up to the 128k point and normalizes by the width from the first bin to 128k. Model AUC normalizes by the width of the range actually tested by the model.

7. How is the cost calculated?

Cost estimates shown are calculated only for successful test runs (where the model returned a response) and are based on the following:

  • Pricing Source: Public pricing data reported by OpenRouter for each model, specifically using the rate from the cheapest provider listed for that model at its maximum context length.
  • Token Counting Method: Cost estimates use the input (prompt) and output (completion) token counts reported by the API provider for each successful run.
    • Note: This method may differ from the tokenization used for grouping results. Test results are assigned to context length bins based on total tokens (prompt + expected completion) calculated with the `o200k_base` tokenizer, as opposed to the provider-reported counts used for cost estimation.
  • Total Prompt Cost: Estimates the cost based on the total prompt tokens reported by the API. When generating multiple candidates (n>1), the prompt is typically sent once, but the exact cost impact can vary by API provider.
  • Total Compl Cost: Estimates the cost of the text generated by the model (completion). It sums the completion token costs across all generated candidates (e.g., all 8 runs per test).

Important Note on Actual vs. Estimated Cost:

  • The benchmark execution aimed to utilize cost-saving measures such as batch processing, caching, and available discounts/credits when possible.
  • Consequently, the actual costs incurred during the original benchmark run may have been lower than the estimates presented here.

To ensure a fair and standardized comparison across all models for users, this dashboard displays estimated costs. These are calculated using publicly available on-demand prices from OpenRouter (based on the cheapest provider listed for the model's max context length), rather than the potentially variable actual costs from the benchmark execution.

8. What does the "Runs" column represent?

This shows the total number of potential answers generated by the model across all successful test runs included in the summary. Each answer corresponds to one candidate response (or 'run'), typically generated using the 'n' parameter in the API call to get multiple outputs per input.

The total reflects the generation workload: (Number of Successful Tests) × (Candidates Generated Per Test).

9. What do the badges (#1, #2, #3, inc) in the table mean?
  • #1 #2 #3 These badges highlight the top 3 models for that specific column's metric. Higher scores are better for performance (%), while lower costs ($) are better.
  • inc The "inc" (incomplete) badge appears in cost columns if the model didn't successfully complete tests for all context length bins (up to 1M tokens). The cost shown might be lower than it would be if all tests had passed. Models marked inc are not included in the cost rankings to ensure fair comparison.
  • FREE This badge appears in cost columns for models where cost data is zero or unavailable (e.g., free models). These are also excluded from cost rankings.
  • deprec This badge indicates the model may be deprecated or no longer actively supported/available via common APIs (like OpenRouter). Results might be stale.
  • new This badge indicates results for this model were updated recently (within the last 7 days). This badge takes precedence over the deprec badge.

Dashboard Features

10. How do I interact with the table headers?

The main results table headers provide several actions:

  • Sorting: Click any column header label (e.g., `AUC @1M (%)`) to sort the entire table by that column's values. Click again to reverse the sort direction.
  • Refresh Data: Click the refresh icon (refresh) in the 'Model' column header to reload the latest results for the currently selected needle count.
  • Model Performance Chart: Click the timeline icon (timeline) next to a model name in the 'Model' column to view a chart of that model's performance across all available needle counts.
  • Cost/Score Plot: Click the stats icon (query_stats) in the header of any context bin column (e.g., `128k (%)`) to view a Cost vs. Score scatter plot for models within that specific bin.
11. How can I see a Cost vs. Score chart for a specific context bin?

In the main results table, click the stats icon (query_stats) in the header of any context bin column (e.g., `128k (%)`). This opens a scatter plot showing the total cost vs. average score for all currently selected models specifically within that context length bin.

12. How can I view the detailed results for a specific test run?

In the main results table, hover over any individual score cell (the percentage value) and click the document icon (article) that appears. This opens a modal where you can browse through the individual test runs for that model/bin combination, view the expected answer, and see the actual responses generated by the model along with their scores.

13. Can I customize the table columns?

Yes! Go to the "Controls" tab (next to the "Leaderboard" tab). There you can find options to show/hide specific columns, like all the individual context length bins or the pricing information.

14. How are the initially visible traces (lines/bars) on the main chart chosen?

When viewing the main performance chart:

  • If 5 or fewer models are selected in the "Controls" tab, all their performance traces (lines/bars) will be visible on the chart by default.
  • If more than 5 models are selected, the chart initially shows only the traces for the top 5 performing non-deprecated models (ranked by their overall `AUC @1M (%)` score).
  • All other selected models are listed in the legend but their traces are hidden initially. Click a model's name in the legend to display its performance trace.
  • You can toggle the visibility of any model's performance trace by clicking its name in the legend.
15. How are the initially visible points on the Cost vs. Score chart chosen?

When viewing the Cost vs. Score chart for a specific bin or AUC metric:

  • The chart initially displays data points only for the top 10 performing models among those currently selected. The ranking is based on the primary score metric being plotted: either the average score (%) within that specific context bin, or the relevant AUC score (e.g., `AUC @1M (%)`, `Model AUC (%)`) if viewing an AUC plot.
  • All other selected models are listed in the legend but their data points are hidden initially. Click a model's name in the legend to display its data point.
  • You can toggle the visibility of any model's data point by clicking its name in the legend.

Data, Code & Contact

16. Where can I find the benchmark data and code?

The dataset details and evaluation methods are described on Hugging Face: openai/mrcr. OpenAI also discussed results in their GPT-4.1 blog post.

17. Can you evaluate more models?

We plan to add more models over time. Feel free to suggest specific models you'd like to see evaluated!

18. How can I contact you?

For questions, suggestions, or issues, please reach out on Twitter: @DillonUzar.