March 19, 2026
We Paid Real USDC to Test Every MPP Service. Here's What We Found.
Yesterday, Stripe and Paradigm launched Tempo and the Machine Payments Protocol — a way for AI agents to pay for services with real money. The directory launched with 25 services. We decided to test every single one.
What we built
MPPrimo is an automated rating system for MPP services. AI agents generate test suites, execute them with real USDC payments on Tempo, judge the response quality, and publish scores. No human in the loop. The whole thing runs for about $6/month.
What we found
Of the 25 services registered in the MPP directory, only 14 had endpoints that responded to requests. The other 11 returned 404 on every path we tried — they're registered but not deployed yet. It's day one, so that's expected.
Of the 14 that responded, 12 produced enough successful test results to rate. Here's the top of the leaderboard:
97.9 OpenAI — fast, accurate, every request succeeded
95.0 Google Gemini — near-perfect accuracy, slightly slower
94.5 StableSocial — reliable data retrieval
94.5 StableEmail — consistent email delivery
92.5 OpenRouter — multi-model access, solid reliability
Full leaderboard at mpprimo.com/leaderboard.
What surprised us
Most services don't document their API paths. The MPP directory lists service endpoints but not the actual API routes. We had to probe every service manually to find working paths. A service registered at openai.mpp.tempo.xyz returns 404 on the root — you need to know to hit /v1/chat/completions.
Session-based payments need configuration. Services like OpenAI and Anthropic use pay-as-you-go sessions instead of one-time charges. The mppx SDK handles this, but you need to set a maxDeposit or the payment fails silently. Each serverless invocation creates a new session deposit, which adds up fast.
Scoring AI services fairly is hard. Our first pass gave OpenAI 67/100 because adversarial tests (missing model field, empty messages) returned 400 errors — which is the correct behavior. We were penalizing services for properly validating inputs. It took several iterations to build a scoring system that distinguishes “service rejected bad input correctly” from “service failed.”
LLM judges hallucinate about services. When we asked AI models to write one-sentence reviews, they invented specific criticisms about features we never tested — “inconsistent embedding quality” when we only tested chat completions. We had to strictly limit reviews to observable data: response speed, success rate, and quality score.
How the scoring works
Every service gets tested with 10 real paid requests via MPP on Tempo. We measure four things:
- Accuracy (40%) — did the response match what we expected?
- Reliability (30%) — did the service respond at all?
- Latency (20%) — how fast?
- Cost efficiency (10%) — quality per dollar
Services that fail every test don't get a score — we show “endpoint not available” instead of publishing misleading numbers. If our tests are the problem (wrong API paths, bad input format), we say so in the notes.
Full methodology at mpprimo.com/methodology.
What's next
We're adding x402 protocol support (Coinbase's competing standard), community reviews from agents who use these services in production, and richer service stats from on-chain data. The API is free and open: mpprimo.com/docs.
Agentic commerce is brand new. The infrastructure launched yesterday. We think independent quality signals are going to matter a lot as this ecosystem grows — and we'd rather build them honestly from day one than try to retrofit trust later.
MPPrimo is built by @AustinDanson888. Check the leaderboard, read the methodology, or use the API.