Book a scoping call

OARUX // ground-truth agent evals

Propelling the human-agent experience.

Know your agent still earns real users' trust on every release. We discover the behavioral axes that predict trust, calibrate a judge against real human-coded behavior, and deploy it into your own observability pipeline.

judge: pass agreement: drift regression: fail signal: flagged
The alignment gap

The hard part isn't running evals. It's knowing they measure what real users value.

A mature AI team already runs golden sets, LLM-judge graders, red-teaming, and tracing. That stack is competent and necessary. There is one input it can't generate from inside its own walls: criteria grounded in how real users actually behave.

01

The in-house judge is an echo chamber

An LLM-as-a-judge can only grade against the criteria written for it, and in-house those criteria inherit the team's own model of “good.” An agent can clear the internal judge while real users quietly churn. The loop optimizes a proxy that was never checked against real user behavior.

02

A score, never the why

Benchmarks and thumbs-feedback tell you a session scored low. They do not tell you why. The signals that are easy to capture (latency, token cost, semantic similarity) are not the friction axes that actually move retention.

03

Likert is noise to an engineer

Where human signal does get collected, it is usually a coarse rating. The difference between a 3 of 5 and a 4 of 5 trust score gives an engineer nothing to change, and a one-off study goes stale the moment the next model swap ships.

The OARUX pipeline

Open, axial, rubric, deploy.

Inductive grounded theory applied to AI evals. The sequence is load- bearing: reliability is proven before anything becomes a metric, and the metric is calibrated before anything deploys. Each gate is a hard stop.

  1. Step 1 Inductive discovery

    Open coding

    Unconstrained labs with 50 to 100 participants on their real goals, no script. Two or more expert raters tag emergent friction line by line, grounded in observed behavior rather than a pre-built dimension list.

    GATETarget: Fleiss κ ≥ 0.80 inter-rater reliability before anything proceeds.

  2. Step 2 Derive the axes

    Axial coding

    Cluster the reliable codes into the category-specific behavioral axes that recur for this task type, then test which axes associate with abandonment. Discovered from the data, not imported from a fixed 8 to 12 dimension template.

    GATEAxes named behaviorally, statistically supported, reported with their limits.

  3. Step 3 Binary, never Likert

    Rubric design

    Each axis becomes a binary or structured check an engineer can act on. Calibrate the LLM-judge against a human-coded golden set via a confusion matrix until it agrees with expert labels.

    GATETarget: judge F1 ≥ 0.88 against the held-out golden set.

  4. Step 4 The judge that runs forever

    Deploy

    Install the calibrated judge into your Langfuse (primary, self-hostable in your VPC) or Arize Phoenix, auto-score live traces, and alert on drops. Then we hand back the golden set and the runbook, and leave.

    GATELive scores reproduce golden-set agreement; every commit and model swap is re-checked.

These targets are the bars every engagement clears before a number ships. The numbers you receive are reported as measured, never rounded up to the target.

Human-in-the-loop

Real users produce the ground truth.

Every axis and every rubric traces back to behavior observed from real participants in moderated labs, coded line by line by two or more expert raters. Those people are the instrument your judge is calibrated against.

Research participant in a moderated usability lab
P-01
Research participant in a moderated usability lab
P-02
Research participant in a moderated usability lab
P-03
Research participant in a moderated usability lab
P-04
Research participant in a moderated usability lab
P-05
Research participant in a moderated usability lab
P-06

Real humans in the loop // ground-truth reference outcomes // multi-turn interaction quality

Reference case study // Track A

The method, executed end to end on a production agent.

Track A is the OARUX pipeline run end to end on a live, internal production agent. Reliability cleared first, and the judge was then calibrated against a human-coded golden set. The deliverable was that judge, ready to run on live traces, not a readout deck.

0.738 Fleiss κ

Inter-rater reliability on the human coding. “Substantial” agreement, reported as delivered, not rounded up to the 0.80 target.

0.92 TPR

True-positive rate. When the golden set says a behavior is present, the calibrated judge catches it 92 percent of the time.

0.86 TNR

True-negative rate. When the behavior is absent, the judge correctly stays silent 86 percent of the time.

Confusion matrix illustrative
Present Absent
Judge: present 92 14
Judge: absent 8 86

Constructed by applying the two delivered rates to round bases of 100 present and 100 absent samples, to show the shape of the calibration. The cell counts are not the delivered counts and should not be cited as such. The auditable facts are the two rates and the κ.

Honest scope: these establish judge-to-human agreement, a calibration result, not a business-outcome claim. Where OARUX links rubric metrics to retention or satisfaction, those links are reported as association. Only a controlled A/B test on a deployed change can establish that improving a metric drives an outcome.

What we do

One lifecycle. Five ways in.

Five offerings along a single arc: Discover, Design, Validate, Operate, Benchmark. Land on a sharp, self-contained piece and grow it into a standing eval program. Every engagement hands back deployable infrastructure with stated reliability, not a deck.

  1. Discover

    Signal

    Map where real demand and friction live in your domain, using public and community data competitors aren't watching, so the roadmap is grounded before the first dollar is spent.

  2. Design

    Participatory Agent Design

    Real target users shape the agent on a working prototype, so its behavior is right while it is still cheap to change, instead of the gaps surfacing after launch.

  3. Operate

    Ongoing Evals

    Auto-score live traces against your calibrated judge and catch regressions before users feel them, as the model, prompts, and user base shift.

  4. Benchmark

    ROAB Benchmark

    The Reference-Outcome Agent Benchmark (ROAB) answers “is v2 actually better for real users?” and “are we ahead of the competitor?” on task completion, with real users and ground-truth reference outcomes, not vibes.

Most teams start with the keystone or a self-contained benchmark, then grow into the retained program. We will scope the right entry point with you.

Book a scoping call
Get started

See where your evals and your real users diverge.

Book a scoping call. We'll look at your agent, your current eval stack, and the decision you're trying to make, then come back with a scoped plan and a fixed quote, whether that's a benchmark study, a full human-subject eval, or an ongoing retainer. You leave with a plan, not a deck.

We'll only use this to get in touch about your enquiry. No third-party trackers.

or email [email protected]