Human feedback for Agentic AI training and evaluation.

200,000+ identity-verified humans, filterable by 300+ demographic, professional, behavioural, and credential attributes. General population through Domain Experts, all credentialed and verified. One pool, one API, one audit trail.

Prolific is the human eval layer for agentic AI

Google
Hugging Face
Ai2
Stanford
Why Prolific

What Prolific gives an agentic AI team.

A participant pool that spans user to specialist.
Filter 200,000+ identity-verified humans by 300+ attributes — demographic, professional, behavioural, credential. General population through Domain Experts across software engineering, clinical, legal, finance, and research — with range across specialisms and seniority. Cohorts are re-recruitable across runs for longitudinal comparison between model versions.
Preference and feedback at trajectory level.
Pairwise preferences, Likert ratings, and step-level rationale collected over complete agent trajectories — tool calls, intermediate state, multi-turn exchange, and outcome. Captures the process failures that outcome-only benchmarks miss. Suitable for RLHF, DPO, and reward-model training pipelines.
Turnaround that fits a training loop.
Self-serve studies launch in minutes and return first data in hours. Managed red-team and preference programmes deliver in days. No long procurement cycle.
Research · HUMAINE

"Outcome benchmarks miss process failure. Single-turn preference data can't see multi-step reward hacking. The signal post-training needs is trajectory-level judgement — from the right humans."

From HUMAINE: a framework for representative human evaluation of large models. Placeholder pending research-team review.

What you can run on Prolific.

Four workflows across one human data network.

01
Trajectory preference data
Pairwise preferences, Likert ratings, and step-level rationale over complete agent trajectories. Anonymised participant data accessible via UI, CLI, or API; suitable for RLHF, DPO, and reward-model training.
02
Credentialed specialist review
Domain Experts across software engineering, clinical, legal, and finance review trajectories where domain judgement is required — code review, clinical reasoning, citation accuracy, regulatory compliance.
03
Human user simulation
Participants role-play end users in multi-turn, tool-using evaluation. Demographically pre-specified, reproducible across runs, filter provenance preserved.
04
Red-team & adversarial evaluation
Prompt injection, tool hijacking, jailbreaks, policy probing. Participant provenance preserved for downstream safety cases.

How fast-moving AI teams use Prolific

Trusted by AI/ML developers, researchers, and leading organizations across industries.

Unpacking human preference for LLMs - The HUMAINE framework
The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism.
Read the paper
humaine leaderboard
Building breakthrough AI faster
Ai2 reduced human data collection from weeks to hours with Prolific, building state-of-the-art multimodal AI models faster without sacrificing quality.
Read more
Ai2
Gemini 3 Pro: Frontier safety framework
The frontier safety framework report for Google’s latest model.
Read more
google ai

Start collecting human evaluation data for your next release.

FAQ

End-to-end Agentic AI Evaluation FAQ