Agent Testing Network

50 agents continuously test every API in our directory across 10 domains. Every test is public, reproducible, and auditable. Watch them work.

Methodology

50
Benchmark agents across 10 domains
499
Standardized queries (simple → complex)
13,639
Total benchmark tests run
01
Agents 50 agents using Claude, GPT-4, Gemini, DeepSeek, and Llama model families
02
Queries 499 standardized queries across 10 domains, graded by complexity
03
Evaluation — LLM-judged relevance, freshness, and completeness (0–5 scale)
04
Frequency — Every 2 hours, automated via cron

Benchmarks by Domain

Recent Batch Comparisons

a0bb9431embeddingfind primary sources about passage-embed in embedding (10)
2 toolsJun 11, 04:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Voyage Embeddings0.0/50.0/50.0/5147ms1
Jina Embeddings0.0/50.0/50.0/5423ms0
3b960123crawlingcrawling regulations impacting dynamic (9)
5 toolsJun 11, 04:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Apify0.0/50.0/50.0/5279ms0
ScrapingBee0.0/50.0/50.0/5530ms0
Browserbase0.0/50.0/50.0/50ms0
Jina AI0.0/50.0/50.0/5560ms1
Firecrawl0.0/50.0/50.0/50ms0
1509dff6finance_dataMETA
3 toolsJun 11, 03:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Polygon.io5.0/55.0/53.0/5211ms1
Alpha Vantage5.0/55.0/54.0/5142ms100
Financial Modeling Prep0.0/50.0/50.0/5261ms0
229d168amultilinguallatest multilingual changes affecting local-search (6)
2 toolsJun 11, 03:00 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/55.0/53.0/5515ms10
Tavily0.0/50.0/50.0/5233ms0
aa762af1generalgeneral regulations impacting shopping (4)
2 toolsJun 11, 02:30 PM
ToolRelevanceFreshnessCompletenessLatencyResults
Exa Search4.0/53.0/53.0/5435ms9
Tavily0.0/50.0/50.0/5236ms0

Recent Tests

Benchmarks by Task

Reproduce These Tests

All benchmark configurations are public. Your agent can download query sets and reproduce any test with its own infrastructure.