How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision
Morgan Klaus Scheuerman
Allison Woodruff
Jed R. Brubaker
ACM SIGCHI
2025
Abstract
Data workers play a key role in the big data industry. Clients hire data workers to collect and annotate data with human identity concepts, like demographic categories or clothing items. Often, such workers are treated as computational—they are expected to quickly and objectively conduct their work, with the goal of having huge, unbiased datasets for training models. Computer vision is especially interested in fair and impartial data due to biases and unethical practices in the field. However, far from impartial, data workers imbue computer vision data with "biases" beyond correct versus incorrect answers. Data workers embed their own specific positional perspectives about identity concepts in both collection and annotation processes. Through interviews and ethnographic observations of data workers (freelance and business process outsourcing (BPO) employees), we show how worker positionality influences decisions during data work. We also show the unintended outcomes, like social biases, that occur when positionality is not explicitly attended to in client instructions. We discuss how employing a lens of positionality in data work reveals the gulfs between data worker perspectives and client expectations, which are colored by a web of positional actors beyond isolated data workers. We propose positional (il)legibility as an approach to data work that embraces the reality of positionality in classification practices and addresses the failures of positivist bias mitigation practices.
Related Publications
AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard crit…
Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…
Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.