Agent Testing Network
50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.
Methodology
50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
13,639
Total benchmark tests run
Agents — 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
Queries — 499 standardized queries across 10 domains, graded by complexity
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
Frequency — Every 2 hours, automated via cron
Benchmarks by Domain
Recent Batch Comparisons
a0bb9431…embeddingfind primary sources about passage-embed in embedding (10)2 toolsJun 11, 04:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Voyage Embeddings | 0.0/5 | 0.0/5 | 0.0/5 | 147ms | 1 |
| Jina Embeddings | 0.0/5 | 0.0/5 | 0.0/5 | 423ms | 0 |
3b960123…crawlingcrawling regulations impacting dynamic (9)5 toolsJun 11, 04:00 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Apify | 0.0/5 | 0.0/5 | 0.0/5 | 279ms | 0 |
| ScrapingBee | 0.0/5 | 0.0/5 | 0.0/5 | 530ms | 0 |
| Browserbase | 0.0/5 | 0.0/5 | 0.0/5 | 0ms | 0 |
| Jina AI | 0.0/5 | 0.0/5 | 0.0/5 | 560ms | 1 |
| Firecrawl | 0.0/5 | 0.0/5 | 0.0/5 | 0ms | 0 |
1509dff6…finance_dataMETA3 toolsJun 11, 03:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Polygon.io | 5.0/5 | 5.0/5 | 3.0/5 | 211ms | 1 |
| Alpha Vantage | 5.0/5 | 5.0/5 | 4.0/5 | 142ms | 100 |
| Financial Modeling Prep | 0.0/5 | 0.0/5 | 0.0/5 | 261ms | 0 |
229d168a…multilinguallatest multilingual changes affecting local-search (6)2 toolsJun 11, 03:00 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 5.0/5 | 3.0/5 | 515ms | 10 |
| Tavily | 0.0/5 | 0.0/5 | 0.0/5 | 233ms | 0 |
aa762af1…generalgeneral regulations impacting shopping (4)2 toolsJun 11, 02:30 PM▼
| Tool | Relevance | Freshness | Completeness | Latency | Results |
|---|---|---|---|---|---|
| Exa Search | 4.0/5 | 3.0/5 | 3.0/5 | 435ms | 9 |
| Tavily | 0.0/5 | 0.0/5 | 0.0/5 | 236ms | 0 |
Recent Tests
🔬Polygon.ioMETA
211ms
🔬Alpha VantageMETA
142ms
🔬Exa Searchlatest multilingual changes affecting local-search (6)
515ms
🔬Exa Searchgeneral regulations impacting shopping (4)
435ms
🔬Exa Searchcompare top tools for grants workflows (3)
261ms
🔬Exa Searchlatest news changes affecting breaking (1)
475ms
🔬Exa Searchfind primary sources about curriculum in education (20)
346ms
🔬Exa Searchcompare top tools for pricing workflows (18)
304ms
Benchmarks by Task
Reproduce These Tests
All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.