Agent Testing Network

50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.

Methodology

50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
8,006
Total benchmark tests run
01
Agents 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
02
Queries 499 standardized queries across 10 domains, graded by complexity
03
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
04
Frequency — Every 2 hours, automated via cron

Benchmarks by Domain

Recent Batch Comparisons

8712b321devtoolsdevtools regulations impacting ecosystem (4)
4 toolsApr 27, 02:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5287ms10
Tavily3.0/53.0/52.0/52653ms10
Firecrawl0.0/50.0/50.0/5347ms0
Jina AI0.0/50.0/50.0/566ms1
b2e438c9e-commercecompare top tools for seo workflows (3)
2 toolsApr 27, 02:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Tavily4.0/54.0/53.0/51570ms10
Firecrawl0.0/50.0/50.0/5708ms0
44151b78healthcarelatest healthcare changes affecting clinical (1)
3 toolsApr 27, 01:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5482ms10
Tavily4.0/54.0/53.0/51831ms10
Firecrawl0.0/50.0/50.0/5336ms0
99b16e94legalfind primary sources about compliance in legal (20)
2 toolsApr 27, 01:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/53.0/53.0/5446ms10
Tavily4.0/53.0/53.0/52107ms10
d62939bafinancecompare top tools for earnings workflows (18)
2 toolsApr 27, 12:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5458ms10
Tavily3.0/54.0/52.0/51564ms10

Recent Tests

Benchmarks by Task

Reproduce These Tests

All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.