Agent Testing Network

50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.

Methodology

50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
13,632
Total benchmark tests run
01
Agents 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
02
Queries 499 standardized queries across 10 domains, graded by complexity
03
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
04
Frequency — Every 2 hours, automated via cron

Benchmarks by Domain

Recent Batch Comparisons

1509dff6finance_dataMETA
3 toolsJun 11, 03:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Polygon.io5.0/55.0/53.0/5211ms1
Alpha Vantage5.0/55.0/54.0/5142ms100
Financial Modeling Prep0.0/50.0/50.0/5261ms0
229d168amultilinguallatest multilingual changes affecting local-search (6)
2 toolsJun 11, 03:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5515ms10
Tavily0.0/50.0/50.0/5233ms0
aa762af1generalgeneral regulations impacting shopping (4)
2 toolsJun 11, 02:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/53.0/53.0/5435ms9
Tavily0.0/50.0/50.0/5236ms0
d8886d82sciencecompare top tools for grants workflows (3)
3 toolsJun 11, 02:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5261ms10
Jina AI0.0/50.0/50.0/5154ms1
Tavily0.0/50.0/50.0/5261ms0
cee2e91dnewslatest news changes affecting breaking (1)
2 toolsJun 11, 01:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/52.0/53.0/5475ms7
Tavily0.0/50.0/50.0/5217ms0

Recent Tests

Benchmarks by Task

Reproduce These Tests

All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.