Cited papers

Time after time: How Prolific helped validate an AI temporal reasoning benchmark

Simon Banks

|June 23, 2025

Challenge

Understanding how people use and interpret temporal expressions like "recently," "just," or "a long time ago" is deceptively challenging for AI models. Researchers from Bielefeld University and Honda Research Institute Europe wanted to create a benchmark (TRAVELER) to test how well language models (LLMs) resolve different types of temporal references, such as explicit dates, implicit relative timings, and especially vague temporal phrases. Vague references, however, pose unique difficulties because they rely on subjective human perceptions rather than concrete definitions.

To address these challenges, the researchers needed:

Real human judgments to define what vague temporal expressions mean with higher accuracy.
Reliable, high-quality data from participants capable of giving nuanced responses.
A practical method for converting subjective human input into clear benchmarks that AI models could be tested against.

Solution

The researchers used Prolific to gather clear, thoughtful feedback from real people about how they understand vague time-related phrases.

Human-validated benchmark creation

Using Prolific, the researchers ran detailed surveys asking participants to rate how appropriately phrases like "just," "recently," and "a long time ago" described events occurring at specific intervals in the past. For instance, participants evaluated statements such as:

"Tom chatted with a friend in his living room one day ago. Statement: Tom chatted with a friend recently."

The human feedback provided a clear, probabilistic definition of these otherwise ambiguous phrases, offering precise, human-derived probability distributions that served as ground truth for testing AI models.

Robust, high-quality participant data

Prolific made it easy to recruit attentive participants who gave thoughtful, detailed feedback. Because the surveys were clear and well-structured, the responses genuinely captured how real people interpret vague expressions.

Effective integration into benchmark testing

The researchers took the data gathered through Prolific and plugged it straight into their evaluation. People's real-world judgments shaped the scoring system, creating a clear way to see how well AI models handled vague temporal expressions.

Execution

The researchers developed TRAVELER, a synthetic benchmark dataset comprising 3,300 questions covering explicit, implicit, and vague temporal expressions. While explicit and implicit references could be automatically evaluated, vague references relied on the probabilistic standards established through the Prolific surveys. Four state-of-the-art LLMs were evaluated against this data.

Results

The benchmark revealed several key insights:

AI model accuracy declined significantly when temporal references moved from explicit (dates) to implicit or vague expressions. For example, the best-performing model's accuracy dropped from 92% on explicit references down to 74% on implicit references and just 45% on vague references
Performance degraded further when AI models had to reason over longer sequences of events.
The probabilistic approach, validated by human survey data from Prolific, clearly demonstrated the gaps in how current models handle subjective temporal reasoning.

Conclusion

Researchers from Bielefeld University and Honda Research Institute Europe successfully built a robust, human-validated benchmark for temporal reasoning, enabled by Prolific’s reliable and high-quality participant data. Such innovative use of human feedback demonstrates how Prolific can effectively bridge the gap between subjective human perception and measurable AI performance.

For research teams looking to create benchmarks requiring nuanced human judgment, Prolific provides a pathway to accurate, human-validated insights.

Citation

Kenneweg, S., Deigmöller, J., Cimiano, P., & Eggert, J. (2025). TRAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague, Implicit, and Explicit References. arXiv preprint arXiv:2505.01325.

https://arxiv.org/pdf/2505.01325

Research institutions: Bielefeld University, Honda Research Institute Europe.