The FHIBE Team: Data, Dignity, and the People Who Made It Possible
AI Ethics
Sony AI
November 7, 2025
AI runs on data, but too often, that data has been scraped without consent, assembled without care, and used without accountability. The consequences ripple through our daily lives: facial recognition systems that fail across skin tones, models that reinforce stereotypes, and datasets that erase entire communities.
Sony AI wanted to prove that another way was possible. The result is FHIBE—the Fair Human-Centric Image Benchmark—a dataset three years in the making, designed around consent, compensation, diversity, and transparency. Unlike many benchmarks that are released once and forgotten, FHIBE was built to evolve, reflecting the principle that data stewardship doesn’t end with publication.
The origins of FHIBE trace back to Alice Xiang, Sony Group’s Global Head of AI Ethics, who set the project in motion with a bold challenge: to build a benchmark that could measure fairness in human-centric computer vision using data collected ethically from the start.
Senior Research Scientist Julienne LaChance recalled how this directive shaped the team’s mission: “Alice gave this directive that we should go out and collect a consensually-sourced, diverse fairness benchmark for human-centric computer vision. When we all began getting involved in this data collection work, it was really apparent to us that there was this extraordinary need for ethical data collection.”
That need wasn’t just philosophical—it was practical. “There were many folks who wanted to begin working to make AI more ethical,” LaChance said. “But they didn’t have access to the data to run those analyses.”
For LaChance and the early team, Xiang’s challenge reframed the work of dataset building. This wasn’t just about technical performance. It was about proving that fairness could be measured with integrity, and that ethical practices could coexist with scientific rigor.
Rethinking Consent
One of FHIBE’s defining features is its approach to consent. Unlike most datasets, where consent is either missing or treated as a one-time checkbox, FHIBE introduced revocability: The right for participants to withdraw their data even after the dataset’s release.
AI Governance Counsel Tiffany Georgievski explained why this mattered: “One of the pieces that was super unique to us was the ability to withdraw. People’s sentiments toward AI can change. We wanted contributors to know: this isn’t forever. You can take your data back.”
For engineers, this created anxiety. Jinru Xue, Data Engineer, who worked closely on data management, admitted, “I’ll admit I worried about people pulling out and leaving us with nothing. It’s hard when you’ve invested so much time.” But rather than weaken the project, the team saw this as a strength. “We didn’t just ask for consent—we explained the risks. Participants knew their images would be public. And they still chose to take part. That shows how transparency can build trust,” Georgievski explained.
Consent also became a teaching moment. Victoria Matthews, Senior AI Policy Specialist, saw firsthand how the process reshaped participants’ understanding of AI. “We were educating as we went,” she explained. “Some participants may not have considered what it meant to contribute to a dataset. But they read the risks, asked questions, and opted in. That kind of meaningful, informed consent is rare—and it’s powerful.”
By embedding revocability and education, FHIBE showed that consent can be more than paperwork: It can be a form of accountability.
Privacy, Safety, and Trust
Consent was only the beginning. FHIBE also had to address privacy and safety at every stage. That required balancing ethical considerations with legal frameworks. “You need someone who understands not just contracts, but the AI-specific context,” said Georgievski. “The research perspective on consent is different from the privacy law perspective. If you don’t harmonize those early, you’ll run into trouble.”
The team employed both technical and manual safeguards. Incidental people captured in the background of images were removed with diffusion-based in-painting, then checked by reviewers. All images were screened against the National Center for Missing & Exploited Children’s hashed database for CSAM. To minimize privacy risks, participants were not asked for intrusive documentation like driver’s licenses.
These measures ensured that contributors weren’t simply data providers—they were partners in an ethically designed system, with their safety and dignity protected throughout the process.
Fair Compensation
Alongside consent and privacy, compensation was another cornerstone. Too often, datasets rely on unpaid or underpaid labor from contributors and annotators. FHIBE rejected that model. Vendors were asked that every contributor and reviewer be paid at least the local minimum wage, and in many cases, much more.
For Matthews, documenting that choice mattered as much as the payment itself. “If you can say in your paper, ‘we took steps to ensure minimum wage compensation’—and this is how—you’re setting a benchmark for others,” she said. “You’re not just saying you’re ethical. You’re showing it.”
By tying compensation to documentation, the team signaled that ethics wasn’t just about values but about measurable practices others could replicate.
Wiebke Hutiri, who worked on documentation, described the deeper implication: “Data is the 21st century extractive industry. The reason AI is cheap is because someone else pays the cost. What FHIBE shows is what it really costs to do it right.” By treating compensation as a first-class principle, FHIBE demonstrated that fairness in AI isn’t only about algorithms—it starts with the people behind the data.
Representation in Practice
Perhaps the most complex challenge was representation. FHIBE aimed to reflect the diversity of human identity across age, ancestry, pronouns, and skin tone, but also to capture how people live, move, and present themselves.
LaChance highlighted a recurring problem: “We found vendors hitting diversity targets within individual categories, but not across them. We’d see diverse skin tones, but the only people swimming or doing yoga were white. Others were just standing with generic backdrops.”
This made the team rethink what representation really meant. As Georgievski explained, “In other datasets, we noticed patterns—like only white people shown swimming or playing sports. So we worked to disrupt those defaults. It’s not just who is included, but how they’re represented.”
Reviewers also described the intimacy of annotation. Clothing, hairstyles, expressions, and activities all signaled identity in ways that couldn’t be neatly categorized. The team often had to reconcile visual traits with self-reported pronouns or gender—choosing always to defer to contributors’ own framing.
The team’s solution was to define 1,234 intersectional identity groups that accounted for both demographic and contextual factors. They also annotated phenotypic traits such as hair type, eye color, and facial marks, alongside performative markers like clothing and cosmetics.
HCI Research Scientist Morgan Klaus Scheuerman, who has conducted extensive research on how identity is represented in computer vision datasets, noted that FHIBE stands out for how it puts autonomy into practice.
“There has been so much great work on implementing ethics for data curation in principle, but FHIBE is one of the most robust approaches I have seen done in practice,” he said.
“Choosing to allow participants to self-label their own identity attributes at such a large scale really showed respect for autonomy,” Scheuerman added. “The team thoughtfully designed methods that respect participants’ self-identity while protecting against potential harms from reporting sensitive information—like using pronouns rather than sex or gender identity.”
As Matthews noted, cultural nuance required constant deliberation. “You don’t just label a dataset like this. You engage with how people see themselves. We had image-level questions from reviewers trying to understand if something was offensive or acceptable across cultures. There were so many edge cases—so many moments where we had to slow down and think.”
Representation then became not just about numbers but about dignity, ensuring people were portrayed as more than checkboxes.
Building with Adaptability
FHIBE’s scope meant the team had to be both rigorous and adaptable. With more than 10,000 images reviewed multiple times, the scale was immense.
“Even internally, people underestimated the work,” said Sony AI, Senior AI Engineer, Austin Hoag. “There was this idea early on that researchers could just build this dataset. But to do it right, you need a whole infrastructure—contractors, review processes, engineering tools, audits. It’s a massive lift.”
The interdisciplinary nature of the project made that infrastructure possible. “It’s a very complex project, and it requires different skill sets: science, policy, engineering, HCI, legal,” explained Hutiri. “You can't just hire one type of team. Everyone needs to work together. Most research projects aren't structured that way, but this one had to be.”
Adaptability became a survival skill. “You can’t stick rigidly to a plan,” reflected Shruti Nagpal. “We had to adjust timelines, pivot strategies, and near the end, we even added evaluations using foundation models—something that wasn’t originally in scope. That flexibility was crucial.”
By embracing collaboration across disciplines and staying open to change, FHIBE turned what could have been a fragmented project into a cohesive whole—one that combined legal safeguards, technical rigor, and ethical reflection at every step.
Utility and Evaluation
FHIBE was never meant to be a symbolic dataset. It was designed to be technically useful and capable of stress-testing real models. “At the end of the day, yes—ethical—but it also still has to be useful as a computer vision evaluation dataset,” said Hutiri.
That meant considering details often overlooked in fairness datasets. From face orientation to lighting conditions, the team balanced representation with real-world usability—ensuring that FHIBE would be relevant for testing how models perform in everyday scenarios.
That utility came from detailed annotations: bounding boxes, 33 keypoints, segmentation masks, and metadata capturing both phenotypic and performative traits. These features made FHIBE applicable across tasks like pose estimation, person detection, face verification, and vision-language evaluation.
The dataset also revealed hidden biases in leading AI systems. When FHIBE was used to test CLIP and BLIP-2, clear patterns emerged: CLIP disproportionately associated African and Asian ancestry with rural settings, while BLIP-2 generated captions that reinforced occupational stereotypes.
For Senior Manager, Rebecca Bourke, this proved why FHIBE matters. “Use it—and learn from it,” she urged. “If your model underperforms on certain groups, don’t ignore that. Adjust it. We want this dataset to help push fairness forward, not just measure it.”
And for Xue, the dataset’s richness opens the door to future discoveries. “Honestly, I’d love to see people use FHIBE in ways we didn’t even expect,” she said. “We included so many layers of data. If someone finds a creative, responsible use case we didn’t think of, that would be amazing.”
FHIBE, in short, wasn’t built only to check fairness. It was built to challenge the industry to act on what fairness requires.
A Dataset That Lives On
Most datasets are released once and left to drift—sometimes resurfacing as “zombie datasets” long after their creators abandon them. FHIBE aimed to break that cycle.
“We talk about ‘zombie datasets’—the ones that never die, never get updated, and never get owned,” Georgievski said. “I’d love to see more ownership. More accountability. These are assets, not internet slop.”
For Senior AI Researcher, Apostolos Modas, the key is stewardship. “Building an ethical dataset is like nurturing a living organism. You don’t just create it and walk away. You commit to its future.”
And for Matthews, FHIBE’s broader legacy may be public awareness. “Most people don’t know how this data gets collected. But they’re using AI every day—in their phones, in their cars. If this dataset sparks more awareness, we’ve done something important.”
Hoag also hoped FHIBE could shift perceptions of what’s possible: “Even though FHIBE is for evaluation, I hope people see it as an example of how to do things differently.”
The Takeaway
FHIBE is more than an image benchmark. It is proof that datasets can be built with consent, transparency, representation, and still meet the highest technical standards.
It’s a signal to the AI community that responsible data collection is possible. It requires relentless rigor and collaboration. Above all, it requires commitment.
As Georgievski put it: “We wanted to show that AI doesn’t have to exploit. You can build something that reflects people’s dignity—and still be technically useful.”
FHIBE sets a new bar. Now it’s up to the rest of the industry to meet it.
To learn more about FHIBE, visit FairnessBenchmark.ai.sony to read the research, download and explore the benchmark, and watch our short film, “A Fair Reflection.”
Stay tuned as we share more resources and information around FHIBE in the coming days.
Latest Blog
November 5, 2025 | AI Ethics
Introducing FHIBE: A Consent-Driven Benchmark for AI Fairness Evaluation
Why Fairness Needs a Better ApproachBuilding AI that works fairly across people, places, and contexts globally requires data that represents real human diversity, not scraped snaps…
November 4, 2025 | AI Ethics
What If Fairness Started at the Dataset Level?
At Sony AI, we believe ethical AI starts with the inputs. And that means reexamining how datasets are collected and shared. Our research has consistently shown that fairness and re…
October 31, 2025 | Sony AI
Advancing AI: Highlights from October
October marked another milestone month for Sony AI, showcasing impactful work across computer vision, generative modeling, and human–AI perception. At ICCV 2025, our researchers in…



