Case Studies

How a leading AI company used Prolific's expert vetting to source novel math problems

George Denison

|September 29, 2025

A leading tech company needed to test the mathematical reasoning of their frontier AI models. Prolific enabled them to source PhD-level experts who created fully original math problems that could challenge state-of-the-art AI.

Challenge: How do you create original problems for genuine reasoning?

Training AI models to truly reason through complex math requires solving problems they've never seen before. But finding such problems proved nearly impossible. Most training data is contaminated with problems - and solutions - that can already be found online.

Models trained on this data didn't develop true problem-solving strategies. Instead, they learned to recognize and memorize patterns from existing data available on the internet.

The company needed math problems that would:

Push models beyond pattern matching into true reasoning
Match the complexity of High-Level Examination (HLE) and American Invitational Mathematics Examination (AIME) problems
Be completely original, so models couldn't train on answers that were already on the internet
Include detailed solution paths to train chain-of-thought reasoning

Finding the right talent presented more hurdles. The project required PhD-level mathematicians who could create novel problems. But expertise alone wasn't enough. Multiple experts had to agree on subjective criteria like "sufficient difficulty" and "elegant solution paths." Getting consensus among academics with strongly held opinions proved challenging.

Solution: Expert recruitment and a 7-stage validation pipeline

Prolific designed a validation pipeline that balanced scale with rigorous quality control. Rather than choosing between human expertise and automated efficiency, the solution used both.

The process began with careful recruitment and verification of math experts with PhD-level math expertise. To encourage quality over quantity, Prolific implemented a progressive unlocking system. Contributors started with permission to submit one problem. They could only submit more after they had proved their ability to deliver quality work.

The validation pipeline had seven distinct stages:

Initial submission screen: Submissions were screened for completeness and format (presence of question, solution, answer).
Duplicate question detection: Questions were compared against thousands of known math benchmarks. Submissions with high similarity scores were automatically rejected.
Web search check: A web search identified any evidence of pre-existing solutions online. Questions with publicly available solutions were excluded.
LLM solvability test: Questions were tested against multiple frontier models. Questions that two or more models could solve were rejected.
LLM authorship detection: An analysis of linguistic patterns assessed whether submissions were likely AI-generated.
Peer review and quality ranking: The contributors themselves reviewed top submissions. LLM-assisted scoring surfaced strong entries.
Final expert review: Trusted reviewers assessed borderline cases to ensure only original, challenging, and well-structured problems remained.

This created a funnel that filtered submissions from the initial automated checks to a final review by human experts. Automated similarity detection caught problems that already existed online, while LLM solvability testing across models found problems that current AI could already solve.

Human expertise handled the nuanced judgments that automation couldn't. Three peer reviewers evaluated each submission, assessing originality, difficulty, clarity, and solution elegance. This system ensured that problems met the high standards needed for training frontier models.

Prolific's Managed Services team handled the operation end-to-end, from recruiting participants to delivering validated data. By removing all operational burden, they enabled the client to focus on their core AI development work.

The results

The pilot identified a world-class team of mathematical experts: 70% held PhDs in mathematics, statistics, physics, or computer science, including university lecturers, quantitative researchers, and assistant professors from institutions like the Max Planck Institute and leading research universities. Several contributors had competed in international mathematics competitions, ensuring both academic rigor and creative problem-solving expertise.

By combining advanced tools with hands-on expert management, Prolific created a scalable process for creating original content. This pilot shows that with the right approach, you can create quality training data that pushes AI models to reason, not just memorize.

Need genuine, verified experts to challenge and improve your AI models? Prolific makes it easy to access the expertise you need to evaluate sophisticated models - rigorously and at speed. Find out more.

Client anonymity maintained at their request due to confidentiality reasons.

Share this post:

Case Studies

How Dashmap built an AI-powered crash detection app in 48 hours with Prolific

3 mins read

January 6, 2026