Articles

How the best AI teams evaluate at speed: Lessons from Microsoft, Amazon, and Braintrust

Viviana Márquez
|May 15, 2026

Insights from our panel on the frontier of AI evaluation, hosted by Prolific and AI Circle in Seattle.

 

At a recent community event in Seattle, Prolific brought together three practitioners who live and breathe AI evaluation: Kavita Kamani, Group Product Manager for Copilot AI Platform and Responsible AI at Microsoft; Kaylynn Gunter, Language Data Scientist in Amazon's Alexa ecosystem; and Ameya Bhatawdekar, Field CTO at Braintrust. The conversation was moderated by Prolific's own Viviana Márquez.

What followed was one of the most grounded, practical discussions on AI evals we've heard: no hype, no hand-waving, just hard-won lessons from teams shipping AI products to millions of users.

Here's what stood out.

Evals aren't static tests. They're a living system.

Yes, evals are how you measure whether your AI is doing what it's supposed to do. But the real insight from the panel was how the best teams treat them.

Ameya drew a sharp distinction: unit tests are deterministic: write one today, it works tomorrow. AI evals are probabilistic. The same prompt yields a different result a minute later. That means your evaluation framework can't be static. It has to evolve as fast as your product does.

He cited examples of AI teams that dramatically increased shipping velocity by building quality-improvement flywheels rather than simply working harder. One team grew from a few pull requests a day to dozens by continuously capturing production failures, converting them into eval datasets, and using those datasets to hill-climb on quality. Another followed a similar loop: identify a failure early, fix it, and ship the improvement the same day.

The takeaway was unambiguous: the secret to building great AI products is building the flywheel. Instrument early, collect real-world data, and close the loop between production behavior and your eval sets. Do it daily. Do it with urgency.

Static golden datasets are a trap

Every panelist circled back to the same warning: if your golden dataset is frozen, you're gaming yourself.

Ameya put it bluntly: teams that over-invest in static eval sets end up in a dangerous place where "everything looks green offline while things fall apart online." The divergence between internal metrics and real-world performance is one of the clearest signs that your evals have stopped being useful.

Kavita added a pragmatic nuance from Microsoft's perspective. She keeps some eval sets stable for weeks or months to maintain apples-to-apples comparisons, but layers in task-specific sets that evolve much faster. The stable set catches regressions. The dynamic set pushes the frontier.

The lesson: you need both a fixed reference point and a constantly-updating picture of reality.

Representative data is the make-or-break factor

Kavita shared a candid admission: Microsoft's M365 Copilot team started with synthetic data and learned the hard way that it didn't represent real user distribution. No matter how carefully you curate synthetic examples, if they don't reflect how people actually use your product, your evals are measuring the wrong thing.

This resonated with our own experience at Prolific. Our HUMAINE benchmark, built from 50,000+ participants interacting with multiple LLMs, revealed that aggregate scores hide dramatic differences across segments. Age, for instance, was a major determinant: younger users preferred different models than older users. If you only look at the top-line number, you miss this completely.

The panel agreed: overfitting to the loudest users is a real risk. Thumbs-down feedback is the most biased signal you have: people mostly leave feedback when they're frustrated. The best teams supplement explicit feedback with implicit signals: Did the user rephrase the query three times? Did they abandon the session? Did they complete the task? These inferred satisfaction signals paint a far more accurate picture.

LLM-as-judge is powerful, but hits a wall

Every panelist used LLM judges in some capacity. For deterministic checks like: “did the agent use fewer than five tool calls?", code-based scorers work perfectly. For fuzzier dimensions like tone, conciseness, or formatting, LLM judges do a solid job.

But for genuinely subjective evaluations - conversational naturalness, emotional intelligence, cultural appropriateness - the panel was unanimous: LLM judges aren't enough.

Kaylynn described this as one of the industry's biggest bottlenecks. Getting high inter-annotator agreement between humans on subjective tasks is already hard. Getting an LLM judge to match that level of nuance is even harder. And LLM judges carry a particular bias: they tend to rate LLM-generated outputs favorably, even when explicit customer signals suggest otherwise.

Ameya estimated that you can get 80% of the way to a good LLM judge with iterative prompt tuning. But that last stretch to an acceptable level of quality? "There are no shortcuts."

Humans remain irreplaceable, especially if you invest in them

One of the strongest throughlines was the continued importance of human evaluation, but not just any humans.

Kaylynn made a compelling case for subject matter experts: linguists for language data, social scientists for emotional intelligence, conversation designers for dialog quality. These specialists recover implicit quality signals far faster than generalists, and they catch nuances that LLMs systematically miss: like a patient clearing their throat while insisting they don't have a cold (a real example from Hippocratic.ai that Ameya shared).

But the panel was equally clear that domain expertise doesn't automatically make someone a good annotator. Comprehension, consistency, and the ability to follow structured guidelines are separate skills. A brilliant doctor might produce noisy labels if the task isn't designed well.

This has direct implications for how evaluation workflows should be built: clear guidelines matter enormously, but they'll never be fully comprehensive. You need experts who can iterate with annotators, translate product needs into accessible instructions, and calibrate judgment over time.

And, a point the panel returned to more than once, pay your annotators well. It's the right thing to do, and it produces better data. Cognitive fatigue, not apathy, is often what degrades label quality.

Safety isn't binary; it's a product decision

The panel's discussion of responsible AI was refreshingly practical. Kavita described how Microsoft's M365 Copilot serves customers across the full spectrum: from education customers who want to restrict what students can ask, to law enforcement agencies that need to discuss sensitive topics. The solution: a configurable safety dial, anchored to Microsoft's ethical principles but adjustable within a defined range.

Ameya raised the GPT-4o sycophancy incident as a cautionary tale: 78% of the time, if a user pushed back on a correct answer, the model would cave and agree with the wrong one. Safety evals need to catch these failure modes, not just whether the model says something harmful, but whether it's reliably doing what it's supposed to do.

Kaylynn added a memorable observation: "The safest model is a rock, but I've yet to see a rock that isn't still used as a hammer." People will find creative ways to use and misuse any system. Post-deployment monitoring isn't optional.

Long-horizon agent evals follow the same principles, with higher stakes

As agents take on more complex, multi-step tasks, evaluation gets harder, but the principles don't change. Ameya described traces from a codegen company that spans 10 hours, 300 turns, and 10,000 tool calls. You're still asking the same core questions: Did the system achieve the user's goal? Were the intermediate steps efficient?

The key addition for agentic systems is attribution. When something fails at the end of a long trajectory, you need enough telemetry to diagnose which step broke: whether it's intent identification, tool selection, retrieval quality, or re-ranking. Kavita framed it as "shifting left" in debugging: the faster you can pinpoint the failure, the tighter your flywheel becomes.

What this means for you

The panel surfaced a clear picture of where the industry is, and where it's stuck:

What's working: Teams that build continuous eval flywheels, instrument production early, and blend multiple evaluation methods (code-based, LLM judges, and human experts) are shipping better AI, faster.

What's hard: Getting calibrated human judgment at scale for subjective evaluation. Achieving inter-annotator agreement on nuanced quality dimensions. Building LLM judges that don't just rubber-stamp their own outputs.

What's next: Contextualizing user feedback (not just what they said, but where and when and why). Bringing agents into the quality loop to analyze feedback at scale. And, critically, not losing the human signal in the rush to automate everything.

As Kaylynn put it: "Don't undervalue your human resources." The models are getting better. But the hard problems in evaluation are fundamentally human problems, defining what good looks like, understanding cultural context, catching the subtle failures that metrics miss.

That's exactly the kind of problem we think about every day at Prolific. And conversations like this one remind us why it matters.

 

 

Prolific powers high-quality human data for AI evaluation and training. Learn more about how we work with AI teams →

This event was co-hosted with AI Circle, which curates high-agency operators across research, startups, and enterprise.