Case Studies

How a leading AI company accelerated factuality testing by 10x with Prolific

George Denison
|September 8, 2025

A leading AI company needed a way to test powerful language models for factual accuracy - in hours, with thousands of evaluators. Prolific's Managed Services made it possible. The team built a bespoke infrastructure and turned a major bottleneck into a fast, repeatable process that sped up development by 10x.

Challenge: Fast, precise factuality testing at scale

AI research scientists at a leading tech giant were developing language models that were becoming more powerful by the day. These models were approaching human-level performance in complex reasoning tasks.

As the models grew more sophisticated, they presented a common challenge. They could generate information that sounded very plausible - but was incorrect. How do you ensure generated content is factually accurate and grounded in reliable sources?

The stakes couldn't be higher. Soon, these models would serve millions of users - people looking for everything from medical advice to financial support. Every undetected error during development could lead to millions of factual inaccuracies.

To catch these nuanced errors, the researchers needed to evaluate thousands of model outputs every day before each new iteration. But standard approaches weren't fast enough for their ambitious development timeline.

The researchers faced three main challenges:

  • Tight turnaround times: The team needed evaluation results within 24 to 48 hours to keep their development pipeline flowing.
  • High precision: Each evaluation task took 20 to 60 minutes of focused attention. Evaluators had to spot nuanced factual errors that automated checks missed.
  • Consistency at scale: Thousands of evaluators needed to apply the same standards of rigor.

Solution: A tailored evaluation infrastructure

The demands of this project were unique. So, Prolific's Managed Services team worked with the company to build an infrastructure tailored to their needs.

 

The right expertise

First, they tapped into Prolific’s pool of thousands of AI Taskers. These specialist participants had already shown strong performance in Prolific's own factuality assessments.

These weren’t general contributors learning on the job. They brought a deep understanding of factual evaluation from day one, which helped the AI team hit the ground running.

 

Quality calibration

The services team used weekly feedback cycles to raise quality even further, sharing performance insights directly with AI Taskers. They also delivered targeted training programmes aligned to the task types and project goals.

This created a continuous improvement loop. As a result, data quality increased each week to match the model’s growing complexity.

 

Integration

Prolific's API-first platform allowed the team to integrate with their data collection platform. This made it much easier to push and deploy fresh tasks to the evaluation pipeline. Time-to-data reduced to hours, enabling rapid iteration.

To run thousands of evaluations every day across 1,000+ AI Taskers, the researchers needed more than just software. Prolific’s infrastructure and experienced services team brought the human oversight they needed to make it work.

Real-time monitoring dashboards, dedicated support, and always-on communication channels kept AI Taskers engaged - and turnaround times on track.

Results: From bottleneck to breakthrough

Over the course of three months, the team collected 180,000 high-quality evaluations. This enabled them to find subtle factual errors that had consistently slipped past automated checks and reveal systematic error patterns. With this insight, they could refine their models and improve factual responses to user queries.

  • Up to 10x faster iteration cycles: What used to take days and weeks can now be completed in 24-48 hours, which means the team can iterate much faster.
  • Comprehensive coverage: With thousands of evaluations per day, the team can assess models across hundreds of factual claims.

The researchers now have reliable, on-demand access to skilled human evaluation and can maintain a fast development cycle. What had once been a bottleneck became a scalable, repeatable process, enabling quicker product iterations and more confident deployment decisions.

Needs rapid evaluation from real people to improve your AI models? Prolific makes it easy to access the expertise and specialized AI taskers you need to evaluate sophisticated models - rigorously and at speed. Find out more.