Case Studies

How Columbia Business School created the most extensive public dataset for digital twin research

Jasmehr Bhatia
|November 17, 2025

Columbia Business School recruited over 2,000 participants to test how closely AI-generated demographic 'digital twins' reflect real human populations. Prolific's exceptional retention rates made it possible to run 19 studies with the same participant pool, creating the most comprehensive evaluation of digital twin accuracy to date.

The challenge: Testing whether AI-generated digital twins can replicate human behavior

Dr. Olivier Toubia, Glaubinger Professor at Columbia Business School, and his team of 22 co-authors wanted to answer a pressing question in AI research: Can large language models create accurate "digital twins" of real people?

Digital twins are AI-generated personas trained on individual data that go beyond simple demographics. Unlike synthetic data commonly used in research, these twins are trained on detailed individual profiles to simulate how specific people might respond to surveys and studies.

The promise was compelling. If digital twins could accurately predict human responses, they could offer researchers a faster, more cost-effective alternative to traditional data collection. Some studies[1][2] were already claiming 85% to 90% accuracy rates.

But Toubia and his colleagues were skeptical - they saw ambiguity in the field. They wanted to create a transparent, shared benchmark that would let researchers rigorously evaluate different ways of creating synthetic data.

The team faced three major challenges:

Recruiting a representative sample at scale: They needed over 2,000 US participants to answer more than 500 questions covering demographics, personality, cognitive profiles, and economic preferences.

Maintaining engagement across multiple waves: To build accurate digital twins and then test them, participants would need to complete four separate waves over several weeks.

Testing across diverse research domains: After creating the twins, the team planned to run 19 different studies spanning political opinions, privacy concerns, pricing, and news consumption to see where digital twins performed well and where they fell short.

Traditional panel providers couldn't deliver the combination of scale, retention, and data quality the research demanded.

The solution: Prolific's platform delivered high retention and quality data across 19 studies

Toubia chose Prolific because it's widely recognized as a dependable source for high-quality research participants. The platform's reputation for reliability and its global pool of 200k+ participants made it the natural choice for this ambitious multi-wave study.

Seamless multi-wave recruitment with 82% retention

The research required participants to return multiple times over several weeks. Toubia initially worried about significant dropout between waves, wondering if starting with 2,500 people might leave only 1,000 by the end.

Those concerns proved unfounded. Starting with 2,500 participants in wave one, the team retained 2,058 through all four waves - an 82% retention rate. Over 1,700 then participated in at least one of the 19 follow-up studies, and the team planned to continue running even more studies with the same engaged participant base.


 

Prolific's platform features made the complex multi-wave design manageable:

Bonus incentives for completion: Participants who completed all four waves received a bonus, encouraging continued engagement.

Workspace collaboration: With Prolific's workspace feature, co-authors could fund their own studies while maintaining consistency across the research program.

Easy participant re-contact: The Prolific platform made it simple to invite the same participants back for subsequent waves and studies.

Quality data from engaged participants

The Prolific experience was remarkably smooth. While the team faced technical challenges on the survey platform side and in feeding data to the large language models, Prolific consistently delivered quality participants who provided thoughtful responses. 

Because the research compared digital twins to genuine human behavior, the reliability and authenticity of Prolific’s participant responses were essential. Each participant represented the ‘ground truth’ against which AI models were measured. High-quality, human-generated data ensured that any differences in outcomes reflected the models’ performance, not inconsistencies in the human sample.

The results: Comprehensive insights into digital twin accuracy and limitations

Over several months, the team collected over 180,000 responses across the initial profiling waves and 19 follow-up studies. This created the most extensive publicly available dataset for evaluating digital twins.

Key findings on digital twin performance

The research revealed both the promise and limitations of current digital twin technology:

~75% accuracy overall: Digital twins achieved approximately 75% accuracy in predicting exact human responses when aggregated across all outcomes tested in the 19 studies.

Better at showing relative differences than absolute accuracy: Digital twins could distinguish between different types of people—for instance, identifying that Person A would likely respond differently than Person B. But they didn't improve the ability to predict Person A's exact answer or calculate accurate population-level statistics compared to basic demographic personas.

Domain-specific performance: Twins performed better in some research areas than others. They showed promise in social contexts but struggled in political domains and when evaluating content or performing annotation-type tasks.

Demographic and education bias: Digital twins were more accurate for participants with higher education levels, higher incomes, and moderate political views. This mirrors existing biases in social science research and raises important questions about the representativeness of AI-generated data.

A public dataset advancing the field

The team made their complete dataset and code publicly available on Hugging Face. "We shared this data on Hugging Face a few months ago," Toubia explained. "And we're already seeing quite a bit of adoption. Academics are definitely interested in the data."

The dataset's breadth makes it valuable even beyond synthetic data research. Researchers can explore correlations between hundreds of different measures spanning personality, cognition, economics, and behavior.

Setting a rigorous standard for synthetic data research

This research provides a transparent benchmark for an increasingly important question: When should researchers use real humans, and when might synthetic data suffice?

The findings suggest that while digital twins show promise and should be researched further, they're not ready to replace real human participants for most research applications. "We would not advise companies to deploy this technology without testing in their specific use case," Toubia cautioned.

The research also highlights something Prolific has long advocated: you need real, engaged humans to get accurate data. The retained participant pool from Prolific made it possible to benchmark synthetic approaches against genuine human responses across diverse research questions.

The result is a landmark study that advances our understanding of AI's capabilities and limitations while demonstrating what's possible when researchers have access to a reliable, engaged participant pool.

Need to recruit participants for longitudinal research or benchmark synthetic data against real human responses? Prolific's platform makes it easy to maintain engagement across multiple waves while delivering the data quality your research demands.

Get started for free →