Optimise agents for real human preference

Close the gap between your metrics and what humans prefer.

Talk to the team

Get started for free

Prolific is the human eval layer for agentic AI

Use cases

For the moments metrics can't solve

Autoresearch optimises a single value. Prolific tells you whether humans agree.

Metric plateau eval

Your metric stopped moving. Run a pairwise human eval to find out if you've genuinely converged or if you're optimising the wrong thing.

Model selection

The benchmark can't separate your top models. Real human insight reveals what metrics can't, and tells you which candidate to actually ship.

Specialist review

Shipping into specific industries? Reach Experts including clinicians, lawyers and accountants, for when specialist judgment is the only valid signal.

Research

"Autonomous agents optimise what they can measure. They need humans to verify they're measuring the right thing."

Nora Petrova: Research Scientist, Prolific

We ran the experiment ourselves: 50 autonomous agent-led training runs, followed by a human eval with 300 Prolific participants. The metric said one model won. 300 humans picked the other.

Read the experiment

Why Prolific

Build with the right human insight

The human layer your training loop needs.

Trusted participant network

200,000+ identity-verified humans, filterable by 300+ attributes. General population plus credentialed specialists.

Explore our participants

Self-serve with speed

Studies launch in minutes and return first data in hours. No procurement cycle, no waiting on a researcher to set up a study.

Get started

Choose your access

Access the same network via UI, CLI, or REST API. Integrate human preference directly into your training pipeline.

Read API docs

Reproducible across runs

Stable cohort hashes mean the same participant specification resolves identically across model versions.

Learn more

Read the research

From our own experiment to peer-reviewed evidence - the case for human preference eval in agentic workflows.

Talk to an expert

When does autoresearch need a human?

We ran Karpathy's autoresearch framework on a real DPO task - 50 autonomous experiments, then a human eval with 300 Prolific participants.

Read the experiment

Building breakthrough AI faster

Ai2 reduced human data collection from weeks to hours with Prolific, building state-of-the-art multimodal AI models faster without sacrificing quality.

Unpacking human preference for LLMs - The HUMAINE framework

Across 20,000+ participants and 22 demographic groups, technical benchmarks and human experience tell different stories. Peer-reviewed evidence that the gap is real, measurable and consistent.

Read the paper

Start collecting human evaluation data for your next release.

Talk to the team

Get started for free

FAQ

End-to-end Agentic AI Evaluation FAQ

How quickly can we start collecting evaluation data?

Our platform is designed for immediate deployment. Self-serve video and preference projects launch in minutes, with results arriving within hours. Managed teleoperation or safety projects depend on scope, hardware integration, and evaluator specialisation requirements.

How much work will my team need to do versus what Prolific handles?

With our self-serve platform, you control the process. We provide infrastructure and participants. You design tasks - video review, preference, teleop, or survey - in your evaluation tool or our AI Task Builder, set criteria, and analyse results. With managed services, we handle everything from participant sourcing to quality assurance. You define requirements and get verified results.

How does Prolific ensure evaluation data quality for AI evaluations?

We combine participant verification, specialised qualification tests, credentials checks, performance tracking, and automated quality controls to maintain a high-quality pool. For physical AI evaluations, we recommend AI Taskers or Domain Experts when you need robotics, autonomy, mechanical, or safety expertise for your tasks.

How does Prolific compare to traditional data vendors?

Traditional vendors use large annotation or teleop teams on hire, with little transparency into evaluator profiles and selection criteria. Prolific gives you direct access to verified evaluators through self-serve or managed options - the quality assurance of managed services, the transparency and control of direct access, and faster turnaround times.

Roles that get the most from Prolific on agentic AI.

Post-training / RLHF lead running DPO, reward-model, or preference-data collection on multi-turn agent trajectories.
Applied AI lead on a coding, clinical, or legal agent needing domain-expert review at scale.
Eval lead running multi-turn or human-user-simulation benchmarks, needing a representative user distribution beyond internal testers.
Agent safety lead red-teaming for prompt injection, tool misuse, and policy violation pre-release.

Do you support RLHF, DPO, or reward model training data?

Yes. Trajectory-level preference pairs, Likert ratings, free-text rationale, and structured process labels — all programmatic via API with formats matched to DPO, reward-model, and RLAIF training pipelines. Use self-serve for fast iteration, managed services for calibrated production programmes.

How do you handle long trajectories and multi-turn evaluation?

Our task tooling and AI Task Builder are designed for whole-trajectory review — full tool-call sequences, conversation histories, and intermediate reasoning — with structured step-level and outcome-level judgement. Process reward models and long-horizon DPO are first-class use cases, not afterthoughts.

Do you provide domain experts when the task requires them?

Yes. Domain Experts include licensed clinicians, engineers, lawyers, finance specialists, and researchers. For tasks that mix expertise with population scale — for example, clinical agent deployment requiring both specialist judgement and patient-population acceptance — use both pools in the same programme.