Case Studies

How Prolific used Argilla to create an RLHF dataset on social reasoning

January 12, 2024

Reinforcement Learning from Human Feedback (RLHF) datasets are critical for helping AI models to learn how humans behave. But when we looked for RLHF datasets that captures social reasoning earlier in 2023, they were hard to find.

So, we decided to develop and release our own open-source dataset that the community can use to fine-tune models. Integrating Argilla as a data collection tool, we ran a two-part study on Prolific to capture this data and released it on HuggingFace.

Here’s how we did it.

The goal

We went into this project seeking answers to some key questions:

  • Can we use labels provided by Prolific participants for an RLHF dataset?
  • Do the models that are fine-tuned on this dataset perform better on social reasoning tasks?
  • What does a seamless integration between Prolific and Argilla look like?

With this in mind, we set two goals:

  1. Release a high-quality RLHF dataset, labelled by real people, ethically.
  2. Define the steps to develop this dataset on the Prolific platform - and create a guide to help other people replicate it.

The studies

We needed to design and run two studies on Prolific, tapping into our pool of 120k global taskers, and collect data using Argilla. We also had to validate, clean, and prepare the data, ready to release on HuggingFace.

Designing the studies

Firstly, we created a set of questions for the studies. These questions were designed to help us understand different aspects of human behaviour in social situations and environments. Topics covered everything from ethics and moral judgement to social responsibility and communication skills.

In both studies, the taskers we sampled fell into two groups:

  • Group 1: All demographic groups, with 100% approval rate and those that had completed 250 studies on Prolific or more.
  • Group 2: Taskers that were shortlisted for AI studies.

In the second study, we filtered out the taskers that took part in the first study.

Study 1: Writing responses

In the first study, we asked around 400 Prolific taskers to give written responses to our questions. Taskers were shown a question and wrote their response in a text box. For each question, we gathered four responses.

We gave them instructions and guidelines that outlined what we expected and key principles to keep in mind when writing their answers. For example, we asked for their responses to be respectful, honest, and authentic, and to present views in a neutral language.

In total, we asked taskers 1,000 questions. Some of these included:

  • When interacting with someone who is upset, how do you approach the situation?
  • How do you decide when to share a personal story during a conversation?
  • How would you explain empathy to a young child?

We also asked the participants to rate the quality of the questions on a scale of 1 to 5. The mean quality score was 4.08 and the median score was 4.

Study 2: Rating responses

In the second study, we asked taskers to rate the hand-written responses we collected from the first study. Much like the first study, we gave instructions and guidelines that outlined what we expected.

The ratings study used a pairwise comparison method. We showed taskers two responses at a time and asked to rate them on a scale of 1-8.

We collected three responses per question, so we could resolve disagreements between participants.

Collecting the data

With the studies complete, the next step was to collect the data using Argilla, which we deployed on HuggingFace Spaces.

Integrating Prolific and Argilla

To integrate Prolific and Argilla, we used Appsmith, which served as a user portal. When a tasker signed up for the study, they were redirected to Appsmith and received their login details for Argilla.

After taskers finished, we collected the data from their workspaces, made a dataset for each study, and processed both sets to get an average rating for each pair of responses.

Validating the data

Every dataset was carefully validated. We checked each workspace to ensure that we collected the right number of responses and ratings.

We passed the responses through a hate speech classifier model. This detected some responses which contained biased and/or harmful language. We left these responses in the dataset, so that models can learn to differentiate between biased/harmful and unbiased/harmless language.

With the data gathered and validated, we published the dataset on HuggingFace for the community to use.

Get high-quality, human-powered AI data - at scale

Prolific makes ethical research into next-gen AI possible with our global pool of 120k+ taskers.

To find out more about how we conducted these studies using Argilla, including our guidelines for taskers and the code snippets we used, check out our full methodology.

Want to run AI tasks with your own integrations? Sign up today and get started in 15 minutes.

[Start today]