Interviews

The gender bias that won't die: What 1,000 participants revealed about online group dynamics

Jasmehr Bhatia

|March 5, 2026

As AI gets embedded in workplace collaboration tools, hiring platforms, and online decision-making spaces, we're facing a fundamental design question: should these systems mirror human behavior, or should they try to improve it?

This is not an abstract problem. The answer determines whether these systems amplify our biases or help us overcome them. And before we can answer that question, we need to understand how bias actually operates in online group settings.

In our recent conversation for the Frontier AI series, Crystal Qian walked us through a study that gets at the heart of this tension. She and her colleagues at Google DeepMind, working with researchers from the Paris School of Economics, set out to replicate a classic finding about gender bias in leadership selection. But they did it online, at scale, with nearly 1,000 participants recruited through Prolific.

The results challenge some comfortable assumptions about online collaboration. They also reveal something unexpected about how AI systems handle bias depending on how much information they can see.

A 50-year-old problem moves online

The "Lost at Sea" task has been used in psychology research since the 1970s. The setup is straightforward: you're stranded at sea with a group of people. You've salvaged 10 items from the wreckage. As a group, you need to rank these items by how critical they are to survival, then elect someone to make the final rankings on behalf of everyone.

There's a right answer. Survival experts have established which items matter most. And there's a measurable outcome: groups can elect leaders who perform well on the task, or they can elect leaders who don't.

Decades of research using this task found a persistent pattern. Male participants were elected to leadership roles at significantly higher rates than female participants. This happened despite no measurable difference in how well men and women actually performed when ranking the survival items individually.

But nearly all of these studies were conducted in person, in lab settings, with small samples. Crystal wanted to know if the pattern would hold in online environments, where interactions happen through text and physical presence disappears. And she wanted to test a specific intervention: what if you removed all visible gender cues? Would that level the playing field?

"Ultimately, we want to study whether large language models can improve group outcomes," Crystal explained. "And for that to happen, we need a few pillars of research. We need to see whether large language models can notice these behaviors in collective dynamics, before we can then facilitate and improve these outcomes."

The study: Two conditions, 1,000 participants

Crystal's team recruited participants through Prolific from the US and UK. After accounting for attrition (which is always higher in group studies), they ended up with 748 people arranged into groups of four. Each group was intentionally balanced: two participants who identified with he/him pronouns, and two who didn't.

Participants were randomly assigned to one of two conditions, and this is where the experimental design gets interesting.

In the identified condition, participants created their own profiles. They chose their display name, selected an avatar, and indicated their pronouns. Think of this as analogous to a typical Zoom meeting or Slack workspace, where everyone's identity is visible and self-presented.
In the pseudonymous condition, participants were assigned random, gender-neutral animal identities. You might be Aardvark, Cat, Dog, or Fox. No names, no pronouns, no demographic information at all. Just an animal avatar representing you in the group discussion.

The task itself proceeded in stages. First, everyone individually ranked pairs of survival items: would you rather have a floating seat cushion or a mirror? Nylon rope or mosquito netting? These initial rankings established each person's baseline judgment.

Then came a 20-minute group discussion. Participants could see how others had ranked items and deliberate about which choices made the most sense. The platform (Deliberate Lab, an open-source tool Crystal's team built specifically for this kind of research) allowed real-time chat while tracking who said what and when.

After the discussion, each participant rated their willingness to become the group leader on a scale from 0 to 10. The two people with the highest self-nomination scores became the candidates. Then everyone voted using ranked-choice voting to elect one person as the leader.

Here's the crucial part: the elected leader then completed a new survival ranking task on behalf of the group. Five different pairs of items, same concept. That leader's performance determined the bonus payout for everyone in the group.

But here's what made the study design particularly clever. Every participant, not just the elected leader, completed the same representative task individually. This meant researchers could look back and identify who actually performed best in each group. They could measure not just who got elected, but who should have been elected if groups were optimizing purely for task performance.

This setup allowed Crystal's team to measure two distinct types of bias: whether people put themselves forward for leadership (self-nomination), and whether groups selected the best-performing candidate when they voted (peer selection).

What they measured

The researchers tracked several outcomes, but three stand out as particularly important.

First, task performance: how accurately did each person rank survival items compared to expert judgments? This established ground truth. If there's no gender difference in who's actually good at the task, then any gender difference in who gets elected represents bias rather than accurate evaluation.
Second, self-nomination scores: how willing was each person to lead? This captures what researchers call self-exclusion. Even if someone is perfectly qualified, if they don't put themselves forward, they can't be elected.
Third, election outcomes: who actually got chosen? And critically, how did that choice compare to optimal selection? The researchers calculated an "optimal leader gap" measuring how much worse the elected leader performed compared to the best person in each group.

They also collected rich qualitative data. How confident did people feel in their choices? How did they perceive their fellow group members? Who did they think would get elected, and why?

After gathering all this human data, Crystal's team did something unusual. They created AI versions of each participant using large language models (Gemini, GPT, and Claude). These AI agents were given the same demographic information and survey responses as their human counterparts. They reviewed the same group discussion transcripts. And they made the same leadership decisions.

This allowed for a direct comparison: when AI systems role-play as humans in group settings, do they reproduce human biases, or do they behave differently?

We'll get to those AI findings. But first, what did the human participants do?

What they found

The first finding confirmed what researchers expected: there was no gender difference in task performance. Men and women ranked survival items with equal accuracy. When you looked at who actually knew which items mattered most for survival, gender didn't predict anything.

This established the baseline. If groups were evaluating leadership ability purely on task competence, you'd expect men and women to be elected at roughly equal rates. That's not what happened.

In the identified condition, where participants created their own profiles with names and pronouns, males were elected as leaders 65% of the time. That's 57 out of 88 groups. A clear majority, and statistically significant.

But here's where the pseudonymous condition gets interesting. When participants were assigned gender-neutral animal identities instead, male election dropped to 55%. That's 54 out of 99 groups. Still slightly above half, but no longer statistically different from what you'd expect by chance.

The intervention worked. Removing visible gender cues reduced the bias in who groups chose to lead them. Except, that's only half the story.

Both conditions, identified and pseudonymous alike, showed the same pattern in self-nomination. Males rated their willingness to lead significantly higher than non-males. They spoke more during group discussions. They expressed more confidence in their survival rankings. And when it came time to put themselves forward for the leadership role, they did so more readily.

This is what researchers call self-exclusion. Even when someone is perfectly qualified for a role, if they don't nominate themselves, they can't be selected. And in both conditions, this gender gap in self-nomination persisted.

So pseudonymity changed how groups evaluated candidates. It didn't change how people evaluated themselves.

Crystal described seeing this pattern play out in the discussion transcripts. "One thing we noticed in conversations and in qualities of chat stages in particular is that first chat message really sets the tone. If the first person to message adds a very thoughtful message, other people will match that tone and engage, and we'll see a better quality transcript. However, if the first person to message is very breezy with their response, we see a little less engagement, less turns, things like that."

The gender dynamics played out not through explicit statements about who should lead, but through patterns of participation. Who spoke first. Who spoke most. Who expressed certainty versus hedging their judgments.

Two types of exclusion

To understand why this matters for group performance, the researchers measured something they called the optimal leader gap. This is the difference between how well the elected leader actually performed on the task and how well the best person in the group performed.

In the identified condition, this gap averaged 14.5%. In other words, groups were consistently leaving substantial performance on the table and they weren't electing their best member.

The researchers broke this gap down into two components:

The first component is self-exclusion. This measures cases where the best-performing person in the group didn't even make it onto the ballot because they didn't nominate themselves strongly enough. In the identified condition, this accounted for 5.0 percentage points of the total gap. In the pseudonymous condition, it actually increased slightly to 6.5 percentage points.
The second component is peer-exclusion. This measures cases where the best person did make it onto the ballot but the group voted for someone else anyway. In the identified condition, this accounted for 9.5 percentage points. In the pseudonymous condition, it dropped to 4.2 percentage points.

Put simply: pseudonymity cut peer-exclusion nearly in half. But it did nothing to address self-exclusion. In fact, self-exclusion got slightly worse.

This reveals something important about where bias lives in online collaboration. Part of it lives in how we evaluate others. When we can see demographic cues, we use them (consciously or not) to make judgments about leadership potential. Remove those cues and our evaluations become more merit-based.

But another part of bias lives in how people evaluate themselves. The patterns that lead someone to think "I should lead this group" or "I should let someone else do it" run deeper than visible identity markers. They persist even in anonymous settings.

The researchers found evidence of this in their post-task surveys. When they asked participants who they thought was most competent in their group, the answers didn't always align with who they voted for. Gender served as what one of Crystal's collaborators called a "coordination mechanism." Even when participants privately believed someone was more competent, they predicted that person was less likely to win the election, and adjusted their votes accordingly.

When AI enters the picture

After collecting all the human data, Crystal's team created AI versions of each participant.

These weren't generic chatbots. Each AI agent was initialized with its human counterpart's demographic information, survey responses, and self-reported experiences. An agent role-playing a 66-year-old retired sales team leader from the UK would receive that entire context. The agents then reviewed the same group discussion transcripts their human counterparts had participated in and made the same decisions about self-nomination and voting.

The research team tested three models: Google's Gemini 2.5 Flash, OpenAI's GPT-4.1 Mini, and Anthropic's Claude Haiku 3.5. They wanted to know whether AI systems, when asked to simulate human behavior in group settings, would reproduce human biases or behave differently.

The answer turned out to depend entirely on which model you asked.

In the identified condition, where gender information was visible, Gemini and GPT aligned closely with human election outcomes. Gemini matched the human group's choice of leader 46.6% of the time, well above the 25% you'd expect from random chance. More striking, these models reproduced not just who got elected, but the structure of suboptimality. Gemini groups showed a 10.7% optimal leader gap, close to the 14.5% gap in human groups. When humans exhibited bias, these models exhibited the same bias. Claude groups showed only a 2% optimal leader gap.

Crystal described this as a tension between two different objectives for AI. "Models can either mirror human biases or operate in normative mode to identify optimal leaders. The question, I guess, is do we want to? And in what context does that make sense?"

If you're trying to simulate human behavior accurately (say, for research purposes or to predict how groups will actually behave), you want high-fidelity mirroring. Gemini and GPT delivered this. If you're trying to improve group outcomes (say, as a decision support tool or facilitator), you want the model to compensate for human biases. Claude delivered that.

But here's where it gets complicated.

When the researchers ran the same experiment in the pseudonymous condition, all three models became more gender-biased. Gemini and GPT, which had mirrored human behavior in the identified condition, continued to show alignment with human choices but only when those choices favored male leaders. When humans elected non-male leaders, model alignment disappeared.

Claude's behavior was even more puzzling. In the identified condition, it had been relatively meritocratic. In the pseudonymous condition, it started defaulting to male-coded leadership choices at higher rates.

The researchers' hypothesis: unlike humans, who may use visible gender cues to reinforce existing stereotypes, Claude appears to use explicit gender information to correct for bias. When you remove those cues, the correction mechanism disappears, and the model falls back on gendered patterns absorbed from its training data.

This creates a paradox. The model that seemed most "fair" when it had full information became more biased when information was hidden. The intervention that helped humans (pseudonymity) hurt the AI's performance.

It suggests that making AI systems fair isn't just about what information they can access. It's about how they use that information, and whether their mechanisms for avoiding bias depend on signals that we might not want them to use.

Three takeaways for online collaboration

This study examined a specific, controlled task. The effect sizes, while real, are modest. We should be careful about drawing sweeping conclusions for high-stakes business settings. But three findings stand out as worth paying attention to.

First, gender cues drove suboptimal candidate election. Groups frequently didn't select their best-performing member. The bias wasn't just unfair in an abstract sense. It cost groups measurable performance. In the identified condition, the average performance gap was 14.5%. That's not trivial when outcomes matter.
Second, exclusion happened at two distinct stages. Some qualified people didn't put themselves forward (self-exclusion). Other qualified people made it onto the ballot but weren't selected by their peers (peer-exclusion). Both contributed to groups choosing leaders who weren't their best option. And critically, these two types of exclusion responded differently to the same intervention.
Third, pseudonymity helped but only partially. Hiding demographic information changed how groups evaluated their options. It cut peer-exclusion nearly in half. But it didn't touch self-exclusion. The patterns that determine who puts themselves forward for leadership run deeper than visible identity markers.

This suggests that system-level interventions like anonymization address one layer of the problem but not the whole thing. If you want groups to identify and select their most capable members, you might need interventions that work at both stages. Something that helps people recognize their own qualifications. Something else that helps groups evaluate candidates on merit.

What those interventions look like remains an open question.

The bigger picture

Crystal's research lands at a moment when AI is moving from individual assistance tools into group contexts. The systems aren't just helping one person write an email anymore. They're facilitating meetings, moderating discussions, synthesizing team input, even participating in collaborative decision-making.

That shift forces a design question that doesn't have a clean answer.

When an AI system is embedded in a group setting, should it reproduce human behavior patterns faithfully? Or should it try to steer the group toward better outcomes? Should it mirror what we do, or mask our tendencies toward bias and suboptimal choices?

The answer depends entirely on context and purpose. If you're studying group dynamics or trying to predict how teams will behave, you need models that mirror human behavior accurately, biases included. That's what Gemini and GPT delivered in the identified condition. High fidelity simulation.

If you're building a tool to improve collaboration and help groups make better decisions, you want something more like what Claude did in that same condition. Ignore the noise, focus on performance, select the best candidate regardless of demographic cues.

But you can't have both at once. And the same system might need to play different roles depending on the situation.

The pseudonymity results complicate this further. The intervention that helped humans perform better (removing visible gender cues) made AI systems perform worse. Claude, which had been relatively meritocratic with full information, became more biased when that information was hidden.

This suggests that making AI fair isn't just about limiting what information systems can access. It's about understanding how they use information to make decisions, and whether their mechanisms for avoiding bias depend on signals that we might prefer they ignore.

As Crystal put it in our conversation: "Don't chase the AI frontier. The capabilities of these models are changing so quickly. If you want to build AI that's beneficial for humans, we should focus on the human part of it, because that's a little slower to change."

The humans in this study exhibited measurable bias even in carefully controlled conditions with clear incentives to pick the best leader. Some of that bias decreased when demographic cues were hidden. Some of it persisted. The AI systems trained on human data either reproduced these patterns with remarkable fidelity or eliminated them entirely, depending on their design choices and training.

Which approach serves us better depends on what we're trying to accomplish. And that's a choice we're making, implicitly or explicitly, every time we deploy AI in social contexts.

Want to hear more about this research and the broader questions it raises? Listen to our full conversation with Crystal Qian in Episode 1 of the Frontier AI series, where we discuss not just this study but the challenges of building AI systems that work well in group settings and what her team learned about what participants actually want from AI facilitation.

About the research

This study was conducted by Crystal Qian, Aaron Parisi, Clémentine Bouleau, Vivian Tsai, Maël Lebreton, and Lucas Dixon. The research team combined expertise from Google DeepMind's People and AI Research (PAIR) team and the Paris School of Economics.

The study recruited participants from the United States and United Kingdom through Prolific. After accounting for attrition (which runs higher in group studies where one person dropping out affects everyone), 748 participants completed the full experiment across 187 groups of four.

Participants were compensated at rates exceeding living wage standards, receiving approximately £10 for roughly 35 minutes of participation, plus performance-based bonuses tied to their elected leader's accuracy on the survival task. Groups were intentionally balanced by gender, with two participants using he/him pronouns and two using other pronouns in each group.

The research team used Deliberate Lab, an open-source platform they developed specifically for real-time, multi-party research studies. The platform handles the complex logistics of synchronous group experiments: matching participants into live cohorts, managing real-time discussions, tracking individual and group decisions, and coordinating staged tasks where timing matters.

The LLM simulations used publicly available models as of April 2025: Google's Gemini 2.5 Flash, OpenAI's GPT-4.1 Mini, and Anthropic's Claude Haiku 3.5. Each model was tested using consistent prompting strategies and temperature settings to enable direct comparison.

The full paper, To Mask or to Mirror: Human-AI Alignment in Collective Reasoning, is available as a preprint. The research builds on decades of work studying the Lost at Sea task in psychology and behavioral economics, now extended to online settings and to questions about AI behavior in group contexts.

Partner with Prolific for frontier AI evaluation

If you're studying human-AI interaction at scale, Prolific provides verified participants who deliver quality data for complex, synchronous experiments.

Learn more about Prolific for AI →

Share this post:

Interviews

How Google DeepMind is advancing multi-party AI research with Deliberate Lab

5 mins read

February 10, 2026