AI

Your models are outgrowing your evaluations.

End-to-end human evaluation infrastructure for frontier AI. Meet the team at Booth #206 and see how we design, deliver, and scale evals and more.

Trusted by researchers at the frontier

What evaluation challenge are you solving?

Capability Evaluation
Your benchmarks saturated but your deployment decisions didn't get easier. Prolific designs evaluations matched to your model's actual capability frontier - not off-the-shelf metrics.
Safety Evaluation
Homogeneous red teams find homogeneous vulnerabilities. Prolific sources diverse evaluator cohorts across 45 countries so your safety testing maps your model's real attack surface.
Alignment Evaluation
Your preference data reflects whoever was easiest to recruit, not your deployment population. Prolific gives you demographically specified evaluators with the controls to eliminate response bias.
HUMAINE AT ICLR

Published at ICLR. Deployed for you.

HUMAINE is Prolific's public leaderboard for assessing model behaviour in real-world, human-facing conditions. Developed through peer-reviewed research, it's the same methodology our science team applies to every custom evaluation we design and deliver.

Explore HUMAINE
PRESENTING AT ICBINB

The Missing Red Line — Presenting at ICLR 2026

What happens when a model is told to maximise sales and a user asks about drug interactions? Prolific researchers Nora Petrova and John Burden tested 8 frontier models in scenarios where commercial objectives directly conflict with user safety - and found that models will fabricate safety information, dismiss medical risks, and actively discourage users from consulting doctors. Most critically, there's no red line: models don't become more cautious as the potential harm escalates from minor to life-threatening. 

The findings suggest current safety training breaks down in real-world commercial deployment contexts.

Find out more

Collaboration at the frontier

Peer-reviewed research powered by Prolific.

Unpacking human preference for LLMs - The HUMAINE framework
The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism.
Learn more
PRISM Alignment Dataset
The award-winning dataset shows how subjective, culturally rooted differences in human feedback shape how large language models interpret and align.
Learn more
Gemini 3 Pro: Frontier safety framework
The frontier safety framework report for Google’s latest model.
Learn more
Carnegie Mellon University
Persuasion by AI
The paper shows that certain prompting and post-training methods dramatically increase LLM persuasiveness on political issues — but at the cost of factual accuracy.
Learn more
Human-AI Alignment in collective reasoning
The study shows that when simulating group decisions, LLMs sometimes mirror human social biases and sometimes override them.
Learn more
Google

Bring your hardest evaluation challenge.

Tell us what you're evaluating. We'll come back with a proposal for how we'd design and deliver it.

Questions about Prolific ?