Book a scoping call

OARUX // agent UX research, end to end

Propelling the human-agent experience.

Ship an agent real users actually trust, across its whole lifecycle: discover real demand, design with real users, validate behavior against ground truth, then operate and benchmark every release. Human-Subject Evals is the keystone, the calibrated judge that becomes your standing eval asset.

discover: signal validate: pass operate: drift regression: fail
The trust gap

The hard part isn't shipping an agent. It's knowing real users will trust it, release after release.

A trustworthy agent is a research problem at every stage: which demand is real, what users do with a working prototype, whether behavior holds up against ground truth, and whether it stays honest once it's live. Your team already runs golden sets, judges, and tracing, a competent and necessary stack. The one input it can't generate from inside its own walls is criteria grounded in how real users actually behave. That's the sharp end, and it's where we work.

01

The in-house judge is an echo chamber

An LLM-as-a-judge can only grade against the criteria written for it, and in-house those criteria inherit the team's own model of “good.” An agent can clear the internal judge while real users quietly churn. The loop optimizes a proxy that was never checked against real user behavior.

02

A score, never the why

Benchmarks and thumbs-feedback tell you a session scored low. They do not tell you why. The signals that are easy to capture (latency, token cost, semantic similarity) are not the friction axes that actually move retention.

03

Likert is noise to an engineer

Where human signal does get collected, it is usually a coarse rating. The difference between a 3 of 5 and a 4 of 5 trust score gives an engineer nothing to change, and a one-off study goes stale the moment the next model swap ships.

What we do

One lifecycle. Five ways in.

Five offerings along a single arc: Discover, Design, Validate, Operate, Benchmark. Land on a sharp, self-contained piece and grow it into a standing eval program. Every engagement hands back deployable infrastructure with stated reliability, not a deck.

  1. Discover

    Signal

    Map where real demand and friction live in your domain, using public and community data competitors aren't watching, so the roadmap is grounded before the first dollar is spent.

  2. Design

    Participatory Agent Design

    Real target users shape the agent on a working prototype, so its behavior is right while it is still cheap to change, instead of the gaps surfacing after launch.

  3. Operate

    Ongoing Evals

    Auto-score live traces against your calibrated judge and catch regressions before users feel them, as the model, prompts, and user base shift.

  4. Benchmark

    ROAB Benchmark

    The Reference-Outcome Agent Benchmark (ROAB) answers “is v2 actually better for real users?” and “are we ahead of the competitor?” on task completion, with real users and ground-truth reference outcomes, not vibes.

Most teams start with the keystone or a self-contained benchmark, then grow into the retained program. We will scope the right entry point with you.

Book a scoping call
Inside the keystone

Open, axial, rubric, deploy.

Human-Subject Evals is the keystone of the lifecycle, and this is how it works: inductive grounded theory applied to AI evals. The sequence is load- bearing: reliability is proven before anything becomes a metric, and the metric is calibrated before anything deploys. Each gate is a hard stop.

  1. Step 1 Inductive discovery

    Open coding

    Unconstrained labs with 50 to 100 participants on their real goals, no script. Two or more expert raters tag emergent friction line by line, grounded in observed behavior rather than a pre-built dimension list.

    GATETarget: Fleiss κ ≥ 0.80 inter-rater reliability before anything proceeds.

  2. Step 2 Derive the axes

    Axial coding

    Cluster the reliable codes into the category-specific behavioral axes that recur for this task type, then test which axes associate with abandonment. Discovered from the data, not imported from a fixed 8 to 12 dimension template.

    GATEAxes named behaviorally, statistically supported, reported with their limits.

  3. Step 3 Binary, never Likert

    Rubric design

    Each axis becomes a binary or structured check an engineer can act on. Calibrate the LLM-judge against a human-coded golden set via a confusion matrix until it agrees with expert labels.

    GATETarget: judge F1 ≥ 0.88 against the held-out golden set.

  4. Step 4 The judge that runs forever

    Deploy

    Install the calibrated judge into your Langfuse (primary, self-hostable in your VPC) or Arize Phoenix, auto-score live traces, and alert on drops. Then we hand back the golden set and the runbook, and leave.

    GATELive scores reproduce golden-set agreement; every commit and model swap is re-checked.

These targets are the bars every engagement clears before a number ships. The numbers you receive are reported as measured, never rounded up to the target.

Human-in-the-loop

Real users produce the ground truth.

Every axis and every rubric traces back to behavior observed from real participants in moderated labs, coded line by line by two or more expert raters. Those people are the instrument your judge is calibrated against.

Research participant in a moderated usability lab
P-01
Research participant in a moderated usability lab
P-02
Research participant in a moderated usability lab
P-03
Research participant in a moderated usability lab
P-04
Research participant in a moderated usability lab
P-05
Research participant in a moderated usability lab
P-06

Real humans in the loop // ground-truth reference outcomes // multi-turn interaction quality

Reference case study // Track A

The method, executed end to end on a production agent.

Track A is the OARUX pipeline run end to end on a live, internal production agent. Reliability cleared first, and the judge was then calibrated against a human-coded golden set. The deliverable was that judge, ready to run on live traces, not a readout deck.

0.738 Fleiss κ

Inter-rater reliability on the human coding. “Substantial” agreement, reported as delivered, not rounded up to the 0.80 target.

0.92 TPR

True-positive rate. When the golden set says a behavior is present, the calibrated judge catches it 92 percent of the time.

0.86 TNR

True-negative rate. When the behavior is absent, the judge correctly stays silent 86 percent of the time.

Confusion matrix illustrative
Present Absent
Judge: present 92 14
Judge: absent 8 86

Constructed by applying the two delivered rates to round bases of 100 present and 100 absent samples, to show the shape of the calibration. The cell counts are not the delivered counts and should not be cited as such. The auditable facts are the two rates and the κ.

Honest scope: these establish judge-to-human agreement, a calibration result, not a business-outcome claim. Where OARUX links rubric metrics to retention or satisfaction, those links are reported as association. Only a controlled A/B test on a deployed change can establish that improving a metric drives an outcome.

The founder

Twenty-five years of UX research, rebuilt for the agent era.

Trust has been the thread of Andy Hay's work for twenty-five years — from an MSc thesis at University College London that became a patented trusted-computing interface, designed at HP Labs' Trusted Platforms Group, to OARUX, where the same question of when to trust the machine is now the premise. Across those years he has built research teams, products, and the systems behind them, with a research practice spanning AI/ML, cloud, and data and analytics — the exact surfaces agent teams build on today. OARUX points that rigor at something your engineering team can deploy.

Classic UX research ends in a deck. OARUX ends in a calibrated judge running in your observability layer.

Earlier, as a UX Research Lead at Microsoft (2002–2008), he was named on a product patent for Windows Server Update Services and earned the company's Gold Star award for the work. He went on to co-found User Research International (URI), scaling it from two people to more than a hundred over seventeen years before exiting in 2025. URI's engagements included foundational research for Microsoft Azure and Google Cloud Platform, and reached seven of the world's ten largest technology companies — the same product, engineering, and data teams OARUX is built for today. Beyond client work he built a 76,000-participant research panel and took Panel Pro, a SaaS platform, from concept to production: research delivered as working systems. His MSc in Ergonomics & HCI is from UCL, his BSc in Psychology from Goldsmiths, University of London.

Data handling Your data stays yours. The calibrated judge and its scoring run inside your own observability layer, self-hostable in your VPC on Langfuse or Arize Phoenix. Engagements are NDA-friendly, and this site runs no third-party trackers.

Get started

See where your evals and your real users diverge.

Book a scoping call. We'll look at your agent, your current eval stack, and the decision you're trying to make, then come back with a scoped plan and a fixed quote, whether that's a benchmark study, a full human-subject eval, or an ongoing retainer. You leave with a plan, not a deck.

We'll only use this to get in touch about your enquiry. No third-party trackers.

or email [email protected]