Case Studies

How Ai2 built a new benchmark for robotics videos with Prolific’s expert annotators

George Denison
|June 23, 2026

The Allen Institute for AI (Ai2) is building some of the most advanced open vision-language models in the world. To do this, they need something no model can generate on its own: high-quality perception data from real human annotators.

With Prolific, they can now collect this data at scale in just a couple of days. This has enabled them to iterate and improve their Molmo2 models and create the first real-world benchmark for robotics videos.

The challenge: Getting ground-truth data from real people at scale

Ai2 is a nonprofit research institute with a founding commitment to openness. Where most AI labs release a finished model, Ai2 publishes the weights, the data, and the entire pipeline so that anyone in the research community can reproduce their work from scratch.

But that commitment goes beyond just releasing the pipeline. It shapes where the data comes from. Many open-weight models rely heavily on synthetic data distilled from proprietary VLMs. This means the community gets a working model, but not the foundational knowledge to build one themselves, which defeats the purpose of Ai2’s open approach. That’s why all of its training data is constructed so the full process can be reproduced and audited from the ground up.

For Jae Sung Park, a Research Scientist on Ai2's Perceptual Reasoning and Interaction Research (PRIOR) team, this meant that if a model was going to understand the visual world, humans had to teach it.

"Models are only as good as the data behind them," Jae Sung explains. "For video understanding and robotics, you simply can't get ground-truth annotations automatically. There's a layer of human perception that is still irreplaceable."

Closing the perception gap

Jae Sung's research sits at the frontier of multimodal grounded reasoning. This involves teaching AI systems to anchor their understanding of an image or frames in a video to what is actually in it. The challenge is even more profound than it might appear.

"Coding agents have exploded in capability," Jae Sung notes. "But hand the same model an image and ask something as simple as 'how many objects are there?' and you can't trust it to give accurate responses the way you trust it on code. Current models often can't count up to fifty objects correctly, something most humans do easily."

This is Moravec's paradox in practice. High-level abstract reasoning costs a model relatively little, while low-level perception has been the real bottleneck. Closing that gap requires training data that no web crawl can provide, data from human annotators who have carefully examined images and videos and precisely described what they saw.

Accessing quality annotators for challenging tasks

Jae Sung’s colleagues had used Amazon Mechanical Turk for annotation work on earlier projects, but the experience highlighted real limitations. Worker quality varied, and the platform didn't make it easy to give annotators targeted feedback or iterate on the study design in real time.

When using MTurk to collect annotation data for Action Atlas, a benchmark for video action recognition, the team had to invest significant effort in their own quality checks to verify that the annotations met the benchmark's requirements. "People were careless, and it was hard to enforce the quality standards we required," Jae Sung recalls.

For Ai2's more demanding tasks, they needed something better.

The solution: A scalable pipeline for quality human data

Ai2’s Molmo project, its first flagship open-source vision-language model, needed large volumes of pointing and captioning data from real humans. The team turned to Prolific to collect this data, and the results changed how they thought about annotation at scale.

After Molmo's open-sourced dataset was released, almost every subsequent model trained on it improved counting performance. Ai2 had developed a scalable pipeline for the kind of grounded perception data the field lacked.

Dedicated video annotators for object tracking 

Ai2's Molmo2 project, a best paper award candidate at CVPR 2026, extended the model to video understanding.

Jae Sung's focus on this project was object tracking, which involved teaching the model to correctly point to and follow objects as they moved across frames. Annotators were given a video, asked to select objects by clicking on them, write text labels, and track those objects frame by frame, marking when something became occluded or left the scene entirely.

While the team had built its own custom group of participants from previous projects, JaeSung used Prolific's dedicated video annotator pool for this work. The difference was immediately apparent. "The response quality was noticeably higher on average. Having that initial set of high-quality workers is a tangible differentiator."

Point tracking for real-world robotics 

Building on the object-tracking work in Molmo2, Jae Sung turned to robotics, one of the fastest-moving areas of AI research, where a specific kind of annotation was needed: point tracking.

Unlike object tracking, which follows an entire object, point tracking locks onto a single precise location, such as a specific point on a cup handle or an exact joint on a robot arm, and follows it through every frame of a video. This level of granularity matters in robotics because predicted point trajectories have become a key means of guiding robot policies.

"Instead of predicting every pixel in a video, you track a sparse set of defined points," Jae Sung explains. "This gives you a compact, data-efficient motion representation that a policy can actually plan against."

Despite the growing use of point tracks in robot learning, there was no benchmark for evaluating point trackers themselves on real manipulation videos. Existing benchmarks used simulation data, where the ground truth is generated automatically. But simulation-trained models often fail when they encounter real-world video, like natural lighting, varied objects, and unpredictable motion. The sim-to-real gap has been increasingly mitigated in point tracking, but it remains a persistent challenge when generalizing to robotics.

To address it, Jae Sung and Vincent Shao, a student collaborator at the University of Washington, built RoboTrack, a benchmark for point tracking in real-world robotics using video-based annotations sourced from Prolific.

Why point tracking requires high-quality participants 

Point tracking is not a task you can hand to casual workers. Annotators had to identify a specific point on a physical surface and track it with pixel-level accuracy across an entire video sequence. A drift of even a few pixels would corrupt the annotation entirely.

"You simply cannot throw this task at anyone," Jae Sung says. "It requires complex spatial reasoning and sustained attention to detail. The more demanding the task, the more you need a platform that can deliver consistently high-quality workers."

Ai2's reputation with Prolific workers, built across multiple projects, also made a difference. Experienced annotators who had worked on earlier Molmo studies already understood the level of precision the team expected.

"That trust accumulates,” Jae Sung explains. “You need a sustained track record of projects to build that level of relationship with a worker community. It gave us the confidence to attempt something as ambitious as a real-world point tracking benchmark in the first place."

Building a community around quality 

For each project, the team created a dedicated Discord server, now standard practice at Ai2 whenever a new annotation project begins.

The Discord channel became a live quality-control layer. Experienced workers answered recurring questions, helped troubleshoot technical issues, and gave feedback to newcomers. Annotators could review each other's work, score it with comments, and even push back on quality assessments with reasoned explanations, creating a feedback loop that raised standards over time.

"Direct communication at scale is hard," Jae Sung notes. "Having a community where workers can support each other makes a real difference to throughput and quality."

The results: Research the whole AI community can build on

A benchmark that exposed the sim-to-real gap 

Ai2’s benchmark demonstrated something the robotics research community had long suspected but lacked data to prove. Point-tracking models that perform well on simulation benchmarks struggle significantly on real-world video.

The benchmark provided the community with a reliable evaluation tool that simulation-based datasets could not, and the annotation quality from Prolific workers was central to that. "Qualitatively, when we reviewed the point tracks, they were excellent," Jae Sung says.

A track record of research impact 

Projects that have used Prolific-sourced data have had a strong reception in the research community:

  • The Molmo2 open dataset is a best paper award candidate at CVPR 2026.
  • VideoNet, a benchmark for specialized action recognition featuring sports, hobbies, and rare activities, received a spotlight at CVPR 2026. Given just a handful of examples of an obscure action, Prolific workers could reliably discriminate it from others, a real-world demonstration of few-shot visual learning in humans.

Speed without compromise 

Sourcing annotations through traditional channels for tasks this complex would have taken weeks and introduced quality issues that could have invalidated the research.

With Prolific, Jae Sung's team can collect high-quality annotations in days. This enables them to iterate quickly: deploy a study, review results, refine the task design, and run again, all within tight timeframes.

What's next?

Looking ahead, Jae Sung wants to explore more models with domain expertise, trained on ground truth data that only qualified human annotators can provide.

If you're developing AI models that require high-quality human annotation, evaluation, or alignment, Prolific can help you move faster without compromising on data quality. Learn more about Prolific for AI.