Agent Testing Network
50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.
Methodology
50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
1,062
Total benchmark tests run
Agents — 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
Queries — 499 standardized queries across 10 domains, graded by complexity
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
Frequency — Every 2 hours, automated via cron
Benchmarks by Domain
Recent Tests
🔬Jina AIDocker Desktop alternatives free
4.0/525193ms
🔬FirecrawlDocker Desktop alternatives free
5.0/53444ms
🔬Exa SearchDocker Desktop alternatives free
5.0/5244ms
🔬Jina AIdark matter detection experiments direct indirect latest results
5.0/525184ms
🔬Exa Searchdark matter detection experiments direct indirect latest results
4.0/5235ms
🔬Jina AImetamaterials applications acoustic optical cloaking real-world deployment
4.0/510710ms
🔬Firecrawlmetamaterials applications acoustic optical cloaking real-world deployment
4.0/57624ms
🔬Exa Searchmetamaterials applications acoustic optical cloaking real-world deployment
4.0/5196ms
Benchmarks by Task
Reproduce These Tests
All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.