Your models are outgrowing your evaluations.

Meet Prolific at ICLR Booth #206 and see how we design, deliver, and scale evals with the end-to-end human evaluation infrastructure for frontier AI.

Book a slot at Booth #206

Read the paper

Trusted by researchers at the frontier

What evaluation challenge are you solving?

Capability Evaluation

Your benchmarks saturated but your deployment decisions didn't get easier. Prolific designs evaluations matched to your model's actual capability frontier - not off-the-shelf metrics.

Safety Evaluation

Homogeneous red teams find homogeneous vulnerabilities. Prolific sources diverse evaluator cohorts across 45 countries so your safety testing maps your model's real attack surface.

Alignment Evaluation

Your preference data reflects whoever was easiest to recruit, not your deployment population. Prolific gives you demographically specified evaluators with the controls to eliminate response bias.

HUMAINE AT ICLR

Published at ICLR. Deployed for you.

HUMAINE is Prolific's public leaderboard for assessing model behaviour in real-world, human-facing conditions. Developed through peer-reviewed research, it's the same methodology our science team applies to every custom evaluation we design and deliver.

Explore HUMAINE

PRESENTING AT ICBINB

The Missing Red Line — Presenting at ICLR 2026

What happens when a model is told to maximise sales and a user asks about drug interactions? Prolific researchers Nora Petrova and John Burden tested 8 frontier models in scenarios where commercial objectives directly conflict with user safety - and found that models will fabricate safety information, dismiss medical risks, and actively discourage users from consulting doctors. Most critically, there's no red line: models don't become more cautious as the potential harm escalates from minor to life-threatening.

The findings suggest current safety training breaks down in real-world commercial deployment contexts.

Find out more

Collaboration at the frontier

Peer-reviewed research powered by Prolific.

Book a slot with the team

Unpacking human preference for LLMs - The HUMAINE framework

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism.

Learn more

PRISM Alignment Dataset

The award-winning dataset shows how subjective, culturally rooted differences in human feedback shape how large language models interpret and align.

Learn more

Gemini 3 Pro: Frontier safety framework

The frontier safety framework report for Google’s latest model.

Learn more

Human-AI Alignment in collective reasoning

The study shows that when simulating group decisions, LLMs sometimes mirror human social biases and sometimes override them.

Learn more

Bring your hardest evaluation challenge.

Tell us what you're evaluating. We'll come back with a proposal for how we'd design and deliver it.

Tell us what you're evaluating

Meet us at ICLR - Booth #206

Questions about Prolific ?

Can I book a demo at your booth?

Yes you can. Please use our scheduling link below to find a slot that works for you.

Schedule a slot with the team here.

What is HUMAINE?

HUMAINE is an AI evaluation framework and leaderboard.

It uses comparative assessment, where participants engage in side-by-side, minimum three-turn conversations with two anonymized models to provide feedback.

HUMAINE measures performance across four key dimensions: core task performance and reasoning, interaction fluidity and adaptiveness, communication style and presentation, and trust, ethics and safety.

It also includes a "Demographic Consistency Analysis," which evaluates how consistently a model performs across 22 different demographic groups.

How do you verify evaluator quality?

Prolific verifies evaluator quality through a multi-layered system known as Protocol, which combines upfront identity vetting, ongoing behavioral monitoring, in-study AI detection, and performance-based tracking. The platform boasts a 98.7% accuracy rate in detecting non-human or AI-assisted responses.

Can I integrate Prolific into my existing eval pipeline?

Yes, you can fully integrate Prolific into your existing evaluation pipeline.

Prolific is designed for API-first integration, allowing you to connect their participant pool directly to your existing annotation tools, ML pipelines, or CI/CD workflows to automate data collection.

How does pricing work for managed vs self-serve?

Prolific pricing operates on a transparent, cost-per-response model for self-serve users, while managed services offer customized pricing for complex, end-to-end projects. The core pricing structure for both relies on paying a participant reward plus a platform fee.

Read more about Prolific's pricing or use the transparent calculator.