Agent Testing Network
50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.
Methodology
50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
8,006
Total benchmark tests run
Agents — 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
Queries — 499 standardized queries across 10 domains, graded by complexity
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
Frequency — Every 2 hours, automated via cron
Benchmarks by Domain
Recent Batch Comparisons
8712b321…devtoolsdevtools regulations impacting ecosystem (4)4 toolsApr 27, 02:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 5.0/5 | 3.0/5 | 287ms | 10 |
| Tavily | 3.0/5 | 3.0/5 | 2.0/5 | 2653ms | 10 |
| Firecrawl | 0.0/5 | 0.0/5 | 0.0/5 | 347ms | 0 |
| Jina AI | 0.0/5 | 0.0/5 | 0.0/5 | 66ms | 1 |
b2e438c9…e-commercecompare top tools for seo workflows (3)2 toolsApr 27, 02:00 PM▼
44151b78…healthcarelatest healthcare changes affecting clinical (1)3 toolsApr 27, 01:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 5.0/5 | 3.0/5 | 482ms | 10 |
| Tavily | 4.0/5 | 4.0/5 | 3.0/5 | 1831ms | 10 |
| Firecrawl | 0.0/5 | 0.0/5 | 0.0/5 | 336ms | 0 |
99b16e94…legalfind primary sources about compliance in legal (20)2 toolsApr 27, 01:00 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 3.0/5 | 3.0/5 | 446ms | 10 |
| Tavily | 4.0/5 | 3.0/5 | 3.0/5 | 2107ms | 10 |
d62939ba…financecompare top tools for earnings workflows (18)2 toolsApr 27, 12:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 5.0/5 | 3.0/5 | 458ms | 10 |
| Tavily | 3.0/5 | 4.0/5 | 2.0/5 | 1564ms | 10 |
Recent Tests
🔬Exa Searchdevtools regulations impacting ecosystem (4)
287ms
🔬Tavilydevtools regulations impacting ecosystem (4)
2653ms
🔬Tavilycompare top tools for seo workflows (3)
1570ms
🔬Exa Searchlatest healthcare changes affecting clinical (1)
482ms
🔬Tavilylatest healthcare changes affecting clinical (1)
1831ms
🔬Exa Searchfind primary sources about compliance in legal (20)
446ms
🔬Tavilyfind primary sources about compliance in legal (20)
2107ms
🔬Exa Searchcompare top tools for earnings workflows (18)
458ms
Benchmarks by Task
Reproduce These Tests
All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.