How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision

VIEW PUBLICATION

Morgan Klaus Scheuerman

Allison Woodruff

Jed R. Brubaker

ACM SIGCHI

2025

Abstract

Data workers play a key role in the big data industry. Clients hire data workers to collect and annotate data with human identity concepts, like demographic categories or clothing items. Often, such workers are treated as computational—they are expected to quickly and objectively conduct their work, with the goal of having huge, unbiased datasets for training models. Computer vision is especially interested in fair and impartial data due to biases and unethical practices in the field. However, far from impartial, data workers imbue computer vision data with "biases" beyond correct versus incorrect answers. Data workers embed their own specific positional perspectives about identity concepts in both collection and annotation processes. Through interviews and ethnographic observations of data workers (freelance and business process outsourcing (BPO) employees), we show how worker positionality influences decisions during data work. We also show the unintended outcomes, like social biases, that occur when positionality is not explicitly attended to in client instructions. We discuss how employing a lens of positionality in data work reveals the gulfs between data worker perspectives and client expectations, which are colored by a web of positional actors beyond isolated data workers. We propose positional (il)legibility as an approach to data work that embraces the reality of positionality in classification practices and addresses the failures of positivist bias mitigation practices.

Related Publications

Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era …

ACM SIGCHI, 2025
Morgan Klaus Scheuerman, Dora Zhao*, Jerone T. A. Andrews, Abeba Birhane, Q. Vera Liao*, Georgia Panagiotidou*, Pooja Chitre*, Kathleen Pine, Shawn Walker*, Jieyu Zhao*, Alice Xiang

AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard crit…

Transphobia is in the Eye of the Prompter: Trans-Centered Perspectives on Large Language Models

CVPR, 2025
Morgan Klaus Scheuerman, Katy Weathington, Adrian Petterson, Dylan Thomas Doyle, Dipto Das, Michael Ann DeVito, Jed R. Brubaker

Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…

A Taxonomy of Challenges to Curating Fair Datasets

NeurIPS, 2024
Dora Zhao*, Morgan Klaus Scheuerman, Pooja Chitre*, Jerone Andrews, Georgia Panagiotidou*, Shawn Walker*, Kathleen H. Pine*, Alice Xiang

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…

SEE ALL

HOME
Publications
How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE