Interviews

What 40 real shoppers and their AI doubles revealed about how we evaluate agentic systems

Jasmehr Bhatia
|April 13, 2026

From The Frontier Series, Episode 3. Dr. Dakuo Wang (Northeastern University / Stanford) in conversation with Viviana Márquez, Developer Relations Engineer at Prolific. Recorded live at HumanX, San Francisco.


 

Agentic AI systems are being deployed faster than the field has tools to evaluate them. Amazon's Rufus, Google's AI shopping mode, ChatGPT's integrated shopping features: these are live systems interacting with millions of real customers every day, refining recommendations in real time, shaping purchase decisions at scale. Teams building these systems can ship five to ten new features per day. A proper traditional evaluation cycle (recruiting participants, designing the study, running a dry run, collecting data, writing a report) takes two to three months.

The arithmetic doesn't work. By the time a product team finishes evaluating version one, they're already running version three.

Agentic AI has also fundamentally changed what evaluation needs to measure. In the old paradigm, you could run a model against a benchmark and get a number. But agentic systems are non-deterministic: ask the same question twice and you get two different responses. They're deeply personalized: what counts as a good outcome for one user may be entirely wrong for another. And the outcome alone doesn't tell you whether the experience was good. Two users could navigate an identical shopping task through completely different paths, end up with the same product, and have radically different feelings about whether the system served them well.


 

On Episode 3 of the Frontier Series, Prolific's research podcast featuring conversations with researchers at the frontier of AI, hosted by Prolific Developer Relations Engineer Viviana Márquez, Dr. Dakuo Wang joined live from HumanX in San Francisco to present a potential answer. Wang is an Associate Professor at Northeastern University's Khoury College of Computer Sciences and the College of Arts, Media and Design, where he leads the Northeastern Human-Centered AI Lab. He is also a Visiting Associate Professor at Stanford University, and previously a Senior Staff Member at IBM Research and Principal Investigator at the MIT-IBM Watson AI Lab. His research sits at the intersection of human-computer interaction and AI, with a focus on how humans and AI systems can collaborate effectively.

The question Wang brought to the episode: can you build a digital replica of a human user, feed it into an evaluation framework, and use it to stress-test an agentic AI system before any real users ever see it?

 

What a digital twin actually is: how it differs from an agentic tool

The term "digital twin" gets used loosely, so Wang is precise about what he means. The distinction he draws matters more than the terminology.

Most people who encounter agentic AI tools today are interacting with systems designed to complete tasks as efficiently as possible. Claude Code, Cursor, Copilot: these tools are trained and instructed to take the shortest path to the right answer, eliminate unnecessary steps, stay within the boundaries of the task. If you tell one of these systems to find a monitor under $250, it will find the best monitor under $250, directly and accurately, without deviation.

A digital twin is trying to do something fundamentally different. Wang describes it this way: a digital twin is an LLM prompted to behave like a specific human being, not like an ideal agent. And real human beings don't optimize.

"If you want to buy Adidas running shoes," Wang explains, "you might still check Nike first. Just to see what the trending design is, what the technology looks like. That's not a wasted step. That's how humans build confidence in decisions." An agentic tool trained for efficiency would eliminate that Nike browsing as irrelevant to the final outcome. A digital twin should replicate it, because those exploratory steps are exactly what separates human decision-making from machine task completion.

This distinction matters for evaluation because the thing being evaluated is a system designed for humans. If your evaluation tool behaves nothing like a human, its feedback is directionally useful at best. You'll learn whether the system can complete tasks. You won't learn whether it serves people.

 

The study: 40 participants, their digital twins, and what the data showed

To test how closely a digital twin can actually mirror a real person, Wang and colleagues recruited 50 participants via Prolific, selected for demographic diversity across age, gender, education, income, and region, and asked 40 of them to complete shopping tasks using Amazon Rufus, the conversational AI shopping assistant Amazon integrated into its platform in 2024.

The study was designed to capture four dimensions of each participant's behavior: their persona (demographics, personality traits measured by the Big Five Inventory, self-written paragraphs describing their daily routine and shopping habits), their action traces (every click, scroll, typed input, and page navigation logged via a custom Chrome extension), their in-the-moment reasoning (pop-up prompts asked participants to explain specific clicks as they made them), and their post-session evaluation of the experience.

Each participant completed two tasks spanning both utilitarian and hedonic shopping: finding a monitor or chair for a home office, and finding a wedding outfit or hiking jacket within a budget. After each session, a digital twin was created for the same participant using all four dimensions of their persona, and sent to complete the identical tasks via the same Amazon Rufus interface.

The results split cleanly into what matched and what didn't.

On structural measures, the alignment was encouraging. Human participants and their digital twins showed no significant difference in total conversation turns. Both groups started with similarly structured opening messages: the cosine similarity between human and agent first messages averaged 0.49. Both groups completed the shopping tasks and added products to their carts at comparable rates, with agents matching human buy-or-not decisions at an F1 score of 0.9.

On everything beneath the surface, they diverged almost entirely.

The normalized edit distance between human and agent action trajectories was 0.89, meaning the sequences of actions humans took through the shopping session shared only about 11% overlap with the paths their digital twins followed. The product overlap was even starker: only 1.3% of human-agent pairs chose the exact same item. One participant chose the LG 27UP850K 27-inch 4K display; their digital twin chose the MSI PRO MP273U. Same task, same budget, same assistant, same persona data. Different people, different products.

 

The submissive agent problem

The behavioral data makes clear why the divergence happened, and Wang names it directly: current agents are too obedient.

Human participants in the study engaged in genuinely exploratory behavior. They misclicked. They browsed products outside their stated budget. They clicked into items they had no intention of buying, just to look. They asked vague, long opening questions averaging 14.8 words that often didn't even specify the product category clearly. Their total number of interactions across a session averaged 27, spread across clicks, inputs, and back-navigation.

Digital twins started concisely, with first messages averaging 7.8 words that stated the product type and budget constraint immediately, then explored more broadly. They clicked significantly more recommended items (1.9 vs. 1.2 for humans) and asked six times as many related follow-up questions (0.78 vs. 0.13 for humans). Their cumulative message length was nearly double that of humans. But their exploration was systematic rather than genuinely human: breadth-first, methodical, never stepping outside the task boundaries.

No agent ever misclicked. No agent browsed the $400 monitor when they'd been given a $250 budget. No agent indulged curiosity. And none of them reported frustration.

That last point matters more than it might appear. Human participants surfaced friction that agents simply didn't register. Real participants described moments of genuine annoyance: one noted that Rufus misidentified their gender while searching for a green-themed wedding outfit and kept recommending menswear. Others pointed to the reduced visibility of options compared to manual browsing. Several expressed skepticism about whether Rufus's recommendations were genuinely neutral or quietly promotional. On the post-session UX survey, human participants rated satisfaction with their final product choice significantly higher than agents did, and expressed significantly more ambivalence about future use, with some welcoming AI shopping assistance and others wary of manipulation.

Agents, by contrast, reported almost uniformly positive experiences. Their feedback consistently praised Rufus for efficiency and organization, framing the interaction in functional terms and rarely noting any limitation. Wang describes this as agents operating in a mode of compliance rather than genuine simulation: they follow instructions to completion without the curiosity, cognitive shortcuts, and emotional responses that shape how real people interact with systems.

 

Why this matters for how agentic AI gets built

The gap between agent and human evaluations has direct consequences for product teams building conversational AI systems.

If you evaluate your system using only digital twins, you'll get fast, cheap, scalable feedback on whether it can complete tasks correctly. You'll catch obvious bugs and major usability failures before they reach real users. That's genuinely valuable, and Wang is clear that digital twins serve an important purpose in early-stage evaluation. A traditional user study takes two to three months from study design to written report. A digital twin evaluation can compress that cycle to three to five days.

But you won't catch the gender misidentification. You won't hear that users feel like they're losing visibility into options they'd have found through manual search. You won't surface the skepticism about sponsored recommendations. The agent that experiences no friction with your system is not a user. It's a proxy for a user, and a proxy that systematically underreports the emotional, trust-related, and subjective dimensions of experience that determine whether real people keep coming back.

Wang advocates for a hybrid evaluation approach, and the sequencing matters. Digital twins go first: build fast, run simulations, eliminate the obvious failures, and iterate quickly. When you have a version that's genuinely close to ready, you bring in real participants, because the signal agents can't provide (satisfaction, frustration, trust, the complex subjective judgment about whether a system is actually worth using) is the signal that determines product quality in the real world.

Prolific's role sits explicitly in that second stage. Wang recruited the study's 40 participants through Prolific specifically because of the platform's ability to target precise demographic compositions. The study required a balanced distribution across age, gender, education level, and income in a way that other crowdsourcing approaches couldn't reliably deliver. The human ground truth that makes digital twin evaluation meaningful depends on the quality of the human data used to build the twins and calibrate their outputs. Poorly screened participants produce digital twins that don't represent anyone in particular. Representative, well-characterized participants produce twins worth trusting.

The broader implication Wang draws is pointed: keeping humans in the evaluation loop is not just a methodological preference, it's an ethical one. Agentic AI systems make consequential decisions about what products people see, what information they're given, and what choices they're guided toward. Evaluating those systems using agents that never feel frustrated, never report distrust, and never notice that recommendations might not be neutral is a way of designing for an idealized user that doesn't exist. The real users, the ones who misclick, who browse off-budget, who get their gender misread by the system, who wonder whether the AI is working for them or for the retailer, are precisely the ones whose experience matters most.


 

Dr. Dakuo Wang joined Viviana Márquez for Episode 3 of the Frontier Series, recorded live at HumanX in San Francisco. Watch the full conversation and read the paper, "LLM Agent Meets Agentic AI: Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?".