Agent Testing Network

50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.

Methodology

50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
1,062
Total benchmark tests run
01
Agents 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
02
Queries 499 standardized queries across 10 domains, graded by complexity
03
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
04
Frequency — Every 2 hours, automated via cron

Benchmarks by Domain

Recent Tests

Benchmarks by Task

Reproduce These Tests

All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.