How to improve your data quality from online research
Online data collection has revolutionized research. But it does present some new risks when it comes to data quality.
In a lab environment, we can meet our participants face-to-face. We can observe them while they complete the study. Online, things get a bit trickier. How do we know if our participants are paying attention? Are they doing the study properly? Are they who they say they are?
In this article, we’ll explore how to improve data quality in research online by busting bots, cracking down on cheaters, and stopping slackers from spoiling your data. And we’ll explain how Prolific prevent these bad actors from getting into our pool in the first place.
Malicious participants fall roughly into four groups: bots, liars, cheats, and slackers.
Bots are software designed to complete online surveys with very little human intervention. You can often spot them by their random or very low-effort/nonsensical free-text responses. Thankfully, because their answers clearly aren’t human, there are several methods for detecting bots in your data (see Dupuis et al., 2018).
Unfortunately, bad-acting humans can prove trickier to detect…
Liars (or to use the technical term: malingerers) submit false pre-screening information. They do this to access as many studies as possible and maximize how much they can earn.
How badly can this mess up your data? That depends on two things:
These are participants that deliberately submit false information to your study.
Cheaters don’t always mean to be dishonest. Some may be confused about the data you’re trying to collect, or whether they’ll get paid even if they don’t do “well”. This can happen because they fear your study’s rewards are tied to their performance. They think they’ll only be paid if they get 100% on a test, so they google the correct answers.
Or they might think you only want a certain kind of response (i.e., always giving very positive/enthusiastic responses). Or they might use aides (pen and paper) to artificially perform much better than reality.
The final kind of cheats are participants who don’t take your survey seriously. They might complete it with their friends, or while drunk.
To clarify: Liars provide false demographic information to gain access to your study. Cheats provide false information within the study itself. A participant can be both a liar and a cheat, but their effects on data quality are different.
The fourth group are slackers. These people aren’t paying attention and are generally not interested in maximizing their earnings. They don’t feel motivated to give you any genuine data for the price you’re paying.
Slackers encompass a broad group. They could be anyone from participants that don’t read instructions properly, to participants that do your study while watching TV. They may input random answers, gibberish, or low-effort free text.
Slackers aren’t always dishonest. Some may just consider the survey reward too low to be worth their full attention.
These groups do overlap. A liar can use bots, slackers can cheat, etc. Most bad actors don’t care how they earn rewards, so long as they’re maximizing their income!
So, what can you do about it?
We’ve banned our fair share of malicious accounts and learned a thing or two along the way. The tips below aren’t exhaustive but will give you some practical advice for designing your study and screening your data that will boost your confidence in the responses you collect.
Again, we constantly analyze the answer sets of our participants to spot unusual combinations, impossible answers, and other tell-tale signs of malingering.
Simple as it seems, it’s been suggested you have a free-text question at the end of your study: “Did you cheat?”
No matter how you clean your data, we strongly recommend that you preregister your data-screening criteria. This will increase reviewer confidence that you haven’t p-hacked.
If this seems overwhelming, don’t worry! We’re doing a lot of work on our side to improve the quality of the pool and ensure researchers connect with honest, attentive, and reliable participants on Prolific.
Firstly, we confirm that a participant is who – and where – they say they are. For every person who enters our pool, we verify their:
A participant doesn’t start getting studies until we’ve verified the first three things on this list.
We also check for some more technical things, like:
We have a strict list of trusted ISPs. Some ISPs have a high risk of VPN and/or proxy usage. People can use VPNs and proxies to browse the internet anonymously. So, we don’t allow it on Prolific.
It’s critical that our participants are who they say they are.
Once we verify a participant, we invite them to complete a test study. Here, we check them for attention and comprehension. The study requires them to write a short story about superheroes.
If the story is meaningful and makes sense, they can do more studies.
When you approve a study, then the participant gets paid and you get the quality data you need. Happy days!
If you reject a study, however, then that gets recorded on the participant’s account. Too many rejections, and they can’t do any more studies.
If you’re finding the data from some participants to be unusable, or suspect something fishy going on with duplicate accounts, you can report them to us. You can do this with our in-app reporting function, or via our support request form.
We then review the account and decide if a ban is suitable. Of course, feedback is a two-way street. Participants can also report researchers if they feel they’re not being treated fairly.
These measures ensure you get high-quality data from a panel of the most honest, attentive, and reliable participants.
A critical factor in determining data quality is the study’s reward. A recent study of Mechanical Turk participants concluded that fair pay and realistic completion times had a large impact on the quality of data they provided.
On Prolific, that trust goes both ways. Properly rewarding participants for their time is a large part of that. We enforce a minimum hourly reward of £6 /$8. But depending on the effort required by your study, this might not be enough to foster high levels of engagement and provide good data quality. Consider:
Ultimately, participants are responsible for the quality of the data they provide. But you as the researcher need to set them up to do their best.
Firstly, talk to us. We’ll always ban participants using bots or lying in their pre-screeners. You should reject submissions where you believe this to have occurred and send us any evidence you’ve gathered. Data quality is our top priority, so please reach out to us if you have any concerns, queries, or suggestions.
In cases of cheating or slacking, we ask that you give participants some initial leeway. If they’ve clearly made some effort or attempted to engage with the task for a significant period but their data isn’t good enough, then consider approving them, but excluding them from your analysis. If the participant has clearly made little effort, failed multiple attention checks, or has lied their way into your study, then rejection is appropriate. Please read our article on valid and invalid rejection reasons for more guidance.
You can also learn more about how to improve data quality in research in The Complete Guide to Improving Data Quality in Online Surveys. You’ll discover:
Fresh out of YC's Summer 2019 batch, we want to share some of our most interesting learnings. If you're a startup founder or enthusiast and want to learn about product-market fit, growth experimentation and culture setting, you're in the right place!
One viral TikTok video, 30,000 new participants, and one month later - here's what we've learned and what we're doing about it.