Case Studies

How a leading AI company used Prolific to create 43 novel math problems

George Denison
|September 29, 2025

A leading tech company needed to test the mathematical reasoning of their frontier AI models. Prolific enabled them to source PhD-level experts who created 43 fully original math problems that could challenge state-of-the-art AI.

Challenge: How do you create original problems for genuine reasoning?

Training AI models to truly reason through complex math requires solving problems they've never seen before. But finding such problems proved nearly impossible. Most training data is contaminated with problems - and solutions - that can already be found online.

Models trained on this data didn't develop true problem-solving strategies. Instead, they learned to recognize and memorize patterns from existing data available on the internet.

The company needed math problems that would:

  • Push models beyond pattern matching into true reasoning
  • Match the complexity of High-Level Examination (HLE) and American Invitational Mathematics Examination (AIME) problems
  • Be completely original, so models couldn't train on answers that were already on the internet
  • Include detailed solution paths to train chain-of-thought reasoning

Finding the right talent presented more hurdles. The project required PhD-level mathematicians who could create novel problems. But expertise alone wasn't enough. Multiple experts had to agree on subjective criteria like "sufficient difficulty" and "elegant solution paths." Getting consensus among academics with strongly held opinions proved challenging.

Solution: Expert recruitment and a 7-stage validation pipeline

Prolific designed a validation pipeline that balanced scale with rigorous quality control. Rather than choosing between human expertise and automated efficiency, the solution used both.

The process began with careful recruitment and verification of math experts with PhD-level math expertise. To encourage quality over quantity, Prolific implemented a progressive unlocking system. Contributors started with permission to submit one problem. They could only submit more after they had proved their ability to deliver quality work.

The validation pipeline had seven distinct stages:

  1. Initial submission screen: Submissions were screened for completeness and format (presence of question, solution, answer).
  2. Duplicate question detection: Questions were compared against thousands of known math benchmarks. Submissions with high similarity scores were automatically rejected.
  3. Web search check: A web search identified any evidence of pre-existing solutions online. Questions with publicly available solutions were excluded.
  4. LLM solvability test: Questions were tested against multiple frontier models. Questions that two or more models could solve were rejected.
  5. LLM authorship detection: An analysis of linguistic patterns assessed whether submissions were likely AI-generated.
  6. Peer review and quality ranking: The contributors themselves reviewed top submissions. LLM-assisted scoring surfaced strong entries.
  7. Final expert review: Trusted reviewers assessed borderline cases to ensure only original, challenging, and well-structured problems remained.

This created a funnel that filtered submissions from the initial automated checks to a final review by human experts. Automated similarity detection caught problems that already existed online, while LLM solvability testing across models found problems that current AI could already solve.

Human expertise handled the nuanced judgments that automation couldn't. Three peer reviewers evaluated each submission, assessing originality, difficulty, clarity, and solution elegance. This system ensured that problems met the high standards needed for training frontier models.

Prolific's Managed Services team handled the operation end-to-end, from recruiting participants to delivering validated data. By removing all operational burden, they enabled the client to focus on their core AI development work.

The results

The pilot project delivered 43 novel, verified mathematics problems. Through rigorous validation, each one was confirmed to be 100% human-authored.

By combining advanced tools with hands-on expert management, Prolific created a scalable process for creating original content. This pilot shows that with the right approach, you can create quality training data that pushes AI models to reason, not just memorize.

Need genuine, verified experts to challenge and improve your AI models? Prolific makes it easy to access the expertise you need to evaluate sophisticated models - rigorously and at speed. Find out more.

 

Client anonymity maintained at their request due to confidentiality reasons.