Skip to main content

Why Our AI Search Benchmark Methodology Lives in a Public GitHub Repo

Published: March 3, 20267 min read
#ai-search#benchmark#transparency#open-source#aisearcharena

Why Our AI Search Benchmark Methodology Lives in a Public GitHub Repo

Date: March 2, 2026 Author: Jamie Watters Project: AI Search Arena (aisearcharena.com)


Most benchmark platforms tell you what they measured. Very few let you see how.

Today I moved the entire scoring methodology for AI Search Arena — every dimension, every weight, every evaluation prompt, every model configuration — into a public GitHub repository. Here's why that matters, and why I think more benchmarks should do the same.

The Problem with Black-Box Benchmarks

If you've ever looked at a software benchmark and wondered "but how did they actually score this?", you're not alone. The AI search optimization space is still young enough that there's no agreed standard for measuring tool quality. Most benchmarks are editorial — someone uses the tool, writes a review, assigns stars.

AI Search Arena takes a different approach. We use a panel of 6 frontier AI models to evaluate 51 scoring dimensions across every tool, then synthesize those evaluations through median aggregation with confidence scoring. It's systematic, reproducible, and now — fully transparent.

What Lives in the Public Repo

The GEO Benchmark Framework contains:

dimensions.yaml — All 51 scoring dimensions with their weights, categories, and 2-3 evaluation prompts each (128 prompts total). When our AI models evaluate a tool on "Schema Markup Support" or "AI Citation Frequency", these are the exact prompts they see.

models.yaml — The 6 AI models that form our consensus panel: GPT-5.2, Claude Sonnet 4.6, Gemini 3 Flash, Grok 4.1 Fast, DeepSeek V3.2, and Mistral Large 3. We deliberately chose models from different providers (US, EU, China), different architectures (dense, MoE), and both open and closed source — so no single vendor's biases dominate.

tracks/geo-platform.yaml — The track definition that maps all 51 dimensions to the GEO Platform evaluation track.

Everything is versioned. Dimension weights change? That's a pull request. New evaluation prompts? Visible in the commit history. Model panel updated? There's a CHANGELOG entry explaining why.

Framework-as-Data: Why YAML, Not Code

One architectural decision I'm particularly happy with: the methodology is pure data, not code.

When I want to adjust the weight of "AI Citation Frequency" from 0.04 to 0.05, I edit a YAML file and run a loader script. No code changes. No deployment. No risk of introducing bugs. The application reads the framework from the repo and syncs it to the database with a single command.

This separation means:

  • Vendors can audit the methodology without reading application source code
  • Changes are trackable — every weight adjustment, prompt refinement, or model swap is a git commit
  • Reproducibility is guaranteed — each benchmark cycle is pinned to a specific git SHA, so you can see exactly what methodology was used
  • Community contributions are possible — if someone thinks a dimension weight is unfair, they can open a PR and make their case publicly

The Model Panel Story

Today's model panel is a good example of transparent iteration. The original v1.0 panel used older models (GPT-4o, Gemini 2.0 Flash, Llama 3.1 405B). I ran a cross-analysis across 7 different LLMs asking each one to recommend the ideal evaluation panel for an AI search benchmark. The result was the v1.3 panel — 6 frontier models optimized for provider diversity, architectural diversity, and stability.

Then I verified every model identifier against OpenRouter's actual API. Two didn't exist yet (Gemini 3 Pro stable, Grok 4). Substituted the closest available alternatives. That verification — including the specific alternatives chosen and why — is documented in the CHANGELOG.

After running the first test cycle, I discovered that two models (Claude Opus and Gemini Pro) accounted for 85% of the evaluation cost. For rubric-based scoring with clear evaluation criteria, the difference between Opus and Sonnet is minimal. Swapped both to cost-optimized alternatives. Again — documented, versioned, transparent.

What This Means for Vendors

If your tool is being evaluated by AI Search Arena, you can:

  1. Read every evaluation prompt your tool will be scored against
  2. See the exact weights that determine your composite score
  3. Verify the model panel — which AI models evaluate you and why those were chosen
  4. Track changes — subscribe to the repo and get notified when methodology updates happen
  5. Propose changes — if you think a dimension is weighted unfairly, open an issue

This is the level of transparency I'd want if my product was being benchmarked. So that's the level we're building to.

First Benchmark Cycle: Running Now

As I write this, the first full benchmark cycle is running — 32 tools being evaluated across 51 dimensions by 6 AI models. That's 9,792 API calls, roughly 5 hours of evaluation time.

The test run scored AI Search Mastery (our own suite) at 6.3/10. We're benchmarking our own products alongside 28 competitors. Same prompts, same models, same methodology. If our tools score poorly on certain dimensions, that's public information — and motivation to improve.

Results will be published at aisearcharena.com once the cycle completes.

The Bigger Picture

I think the AI search optimization space needs credible, independent benchmarking. The tools in this space are genuinely useful — but comparing them is nearly impossible without a systematic framework.

Making the methodology public isn't just about trust. It's about building something that's actually useful for the industry. If our scoring dimensions are wrong, I want to know. If our model panel is biased, someone should be able to prove it. And if another benchmark wants to use our framework as a starting point, the CC BY 4.0 license lets them.

Open methodology. Verifiable results. That's the standard we're setting.


Links:

Building AI Search Arena in public. Follow along at jamiewatters.work.

Share this post