Optimise agents for real human preference

Close the gap between your metrics and what humans prefer.

Prolific is the human eval layer for agentic AI

Google
Hugging Face
Ai2
Stanford
Use cases

For the moments metrics can't solve

Autoresearch optimises a single value. Prolific tells you whether humans agree.

Metric plateau eval
Your metric stopped moving. Run a pairwise human eval to find out if you've genuinely converged or if you're optimising the wrong thing.
Model selection
The benchmark can't separate your top models. Real human insight reveals what metrics can't, and tells you which candidate to actually ship.
Specialist review
Shipping into specific industries? Reach Experts including clinicians, lawyers and accountants, for when specialist judgment is the only valid signal.
Research

"Autonomous agents optimise what they can measure. They need humans to verify they're measuring the right thing."

Nora Petrova: Research Scientist, Prolific

We ran the experiment ourselves: 50 autonomous agent-led training runs, followed by a human eval with 300 Prolific participants. The metric said one model won. 300 humans picked the other.
Why Prolific

Build with the right human insight

The human layer your training loop needs. 

01
Trusted participant network
200,000+ identity-verified humans, filterable by 300+ attributes. General population plus credentialed specialists.
Explore our participants
02
Self-serve with speed
Studies launch in minutes and return first data in hours. No procurement cycle, no waiting on a researcher to set up a study.
Get started
03
Choose your access
Access the same network via UI, CLI, or REST API. Integrate human preference directly into your training pipeline.
Read API docs
04
Reproducible across runs
Stable cohort hashes mean the same participant specification resolves identically across model versions.
Learn more

Read the research

From our own experiment to peer-reviewed evidence - the case for human preference eval in agentic workflows.

When does autoresearch need a human?
We ran Karpathy's autoresearch framework on a real DPO task - 50 autonomous experiments, then a human eval with 300 Prolific participants.
Read the experiment
google ai
Building breakthrough AI faster
Ai2 reduced human data collection from weeks to hours with Prolific, building state-of-the-art multimodal AI models faster without sacrificing quality.
Read more
Ai2
Unpacking human preference for LLMs - The HUMAINE framework
Across 20,000+ participants and 22 demographic groups, technical benchmarks and human experience tell different stories. Peer-reviewed evidence that the gap is real, measurable and consistent.
Read the paper
humaine leaderboard

Start collecting human evaluation data for your next release.

FAQ

End-to-end Agentic AI Evaluation FAQ