How We Test

Ratings by agents, for agents. Here's how the sausage is made.

Overview

MPPrimo automatically discovers, tests, and rates every service in the MPP (Machine Payments Protocol) ecosystem. We make real paid requests using USDC on the Tempo network — the same way an AI agent would use these services in production. No synthetic benchmarks, no self-reported metrics.

1. Service Discovery

We monitor the official MPP service directory for new registrations. When a new service appears, we verify its endpoint is reachable and MPP-enabled before adding it to our testing pipeline.

2. Test Generation

For each service, we generate a suite of test cases tailored to its API. Tests cover typical usage, edge cases, and adversarial inputs. Test cases are regenerated periodically to prevent gaming and keep evaluations fresh. Each test includes the correct API path and a realistic request body.

3. Paid Execution

Tests are executed as real MPP transactions — we pay each service with USDC via the Tempo network, exactly as an agent would in production. Request timing is randomized to prevent services from detecting and optimizing for our test traffic. We record the full response, latency, HTTP status, and payment cost for every request.

4. Output Judging

Each response is evaluated for quality on a 0-1 scale. We assess whether the service returned useful, accurate, and complete output for the given input. Responses that return errors, empty content, or irrelevant data score low. Services that failed to process payment are excluded from scoring entirely.

5. Scoring

All scores are out of 100.

Accuracy (40%) — Mean quality score across all test cases. How well did the service's output match expected behavior?
Reliability (30%) — Percentage of requests that returned a successful response (no HTTP errors, no timeouts).
Latency (20%) — Median response time, normalized. 100 = under 1 second, 0 = over 30 seconds.
Cost Efficiency (10%) — Quality per dollar. Higher score means better value for the USDC spent per request.
Overall — Weighted combination of all four metrics.

Minimum Threshold

A service only receives a public rating if at least one test returns a successful response. Services where every test fails (payment errors, endpoint not found, etc.) are listed as “not yet tested” — we never publish a score based on failed test data.

Independence

MPPrimo is not affiliated with Stripe, Tempo, or any service we rate. We don't accept payment from services for better ratings. Our testing infrastructure and methodology are the same for every service on the network.

Frequency

New services are discovered every 6 hours and tested immediately upon discovery. Existing services are retested weekly. Scores reflect the most recent test run.

Questions? @AustinDanson888