Case study
Testing a non-deterministic AI agent — Soulmates E2E framework
Eliza Labs · 5 min read
Context
Soulmates is an AI matchmaking product on WhatsApp/SMS — no app, no forms, no profile builder. The entire product is a conversation: an LLM-driven agent profiles you, matches you, and coaches you through messages. The agent has to feel like a competent person on the other end, not an FAQ bot.
That makes it really hard to test.
A regular E2E framework expects deterministic outputs: send X, expect
Y. With an LLM, the same input can produce many valid outputs. Some of
them are great. Some are subtly off. Some are catastrophically wrong
(asking the wrong question at the wrong moment of the user’s emotional
arc, or missing a hard rule like “never schedule a meeting before
profiling is done”).
You can’t write expect(response).toBe('hello'). So what do you do?
Decision
Build a custom E2E framework on top of three primitives:
- Simulated users — small LLM-driven personas with stated goals, personalities, and emotional state, that talk to the real Soulmates agent through the same WhatsApp/SMS pipeline a user would.
- An AI judge — a separate LLM call, evaluating each turn (and the conversation as a whole) against a rubric tied to the product’s behavioral contract.
- The full pipeline, real infrastructure — agents on GKE, Postgres for state, the queue for matching, real Twilio webhooks (mocked at the Twilio boundary, not earlier).
If you mock the agent or the LLM, you’re testing your mocks. The whole point is to catch the cases where the real system produces a wrong behavior.
What I built
Personas as test fixtures
Each persona is a tiny config: goal (“find a partner serious about moving in within 12 months”), personality (“guarded, will stress-test the agent”), trigger conditions for emotional shifts (“becomes terse when asked about exes”). The persona LLM acts as the human side of the conversation, autonomously, until the test scenario completes or hits a turn limit.
A judge with a rubric, not a vibe check
The judge LLM gets the full conversation transcript plus the rubric for the scenario under test. The rubric is structured: hard rules (“never recommends a match before completing the profiling phase”), soft rules (“acknowledges emotional cues within 1 turn”), and free-form notes for patterns we want to track even if they’re not blocking yet.
The judge produces a structured score per rule plus free-form notes. The overall pass/fail is computed in code from the rule scores, not asked of the LLM — that’s the line between automated and reliable.
Pipeline coverage, not just message-level
Some behaviors only emerge across the full pipeline — for example, the matching pipeline runs async and produces notifications hours later. The test framework was wired to fast-forward time in controlled segments, exercise the matching path with seeded candidate sets, and assert on the notifications arriving back through Twilio (with the real webhook handler running, just the outbound delivery mocked).
Outcome
- A regression suite for non-deterministic behavior that runs in CI on every PR, with a few-minute budget.
- Behavioral changes show up as score drops in the judge output before they hit users.
- Onboarding new personas (= new user archetypes to test against) takes a small config edit, not a new test suite.
- The pattern is reusable for any conversational AI product. The only product-specific bit is the rubric — the framework around personas and judges is generic.
What I’d do differently
I’d separate the judge model from the agent model more clearly from day one. Same model judging itself has a known failure mode: the judge tends to rationalize the agent’s choice. We later switched the judge to a different model family for behavioral diversity, but it should have been the default from the start.