Analysis: Long hours and low wages: The human labour powering AI’s development

Data labellers across the world are keeping AI on track, but struggling to make ends meet themselves. (Pexels/Google DeepMind)

BY Ben Taylor

November 16, 2023

The Finnish tech firm Metroc recently began using prison labour to train a large language model to improve artificial intelligence (AI) technology. For 1.54 euros an hour prisoners answer simple questions about snippets of text in a process known as data labelling.

Data labelling is often outsourced to labour markets in the Global South where companies can find workers who are fluent in English and willing to work for low wages.

Due to the lack of Finnish speakers in these countries, however, Metroc has tapped into a local source of cheap labour. Were it not for the prison labour program, Metroc would likely be hard-pressed to find Finns willing to take data-labelling jobs that pay a fraction of the average salary in Finland.

These cost-cutting strategies not only highlight the significant amount of human labour still required to fine tune AI, but they also raise important questions about the long-term sustainability of such business models and practices.

AI’s labour problem

The ethical ambiguity of prison labour-sourced AI is part of a larger story about the human cost behind AI’s significant growth in recent years. One issue that has become more evident over the past year revolves around the question of labour.

Leading AI firms are not denying their use of outsourced and low-wage labour to do work like data labelling. However, the hype around tools like OpenAI’s ChatGPT has drawn attention away from this aspect of the technology’s development.

As researchers, including myself, are trying to understand the perceptions and use of AI in higher education, the ethical problems associated with current AI models continue to pile up. These include the biases that AI is prone to reproducing, the environmental impact of AI data centres, and privacy and security concerns.

More: Tradition meets innovation: Generative AI in McMaster classrooms

Current practices of outsourcing data labelling work expose an uneven global distribution of AI’s costs and benefits, with few proposed solutions.

The implications of this situation are twofold.

First, the massive amount of human labour that is still required to shape the “intelligence” of AI tools should give users pause when evaluating the outputs of these tools.

Second, until AI firms take serious steps to address their exploitative labour practices, users and institutions may want to reconsider the so-called values or benefits of AI tools.

What is data labelling?

The “intelligence” component of AI still requires significant human input to develop its data processing capabilities. Popular chatbots like ChatGPT are pre-trained (hence, the PT in GPT). A critical phase in the pre-training process consists of supervised learning.

During supervised learning, AI models learn how to generate outputs from data sets that are labelled by humans. Data labellers, like the Finnish prisoners, perform different tasks. For example, labellers might need to confirm whether an image contains a certain feature or to flag offensive language.

In addition to improving accuracy, data labelling is necessary to improve the “safety” of AI systems. Safety is defined according to the goals and principles of each AI firm. A “safe” model for one company might mean avoiding the risk of copyright infringement. For another, it might entail minimizing false information or biased content and stereotypes.

For most popular models, safety means that the model should not generate content based on prejudiced ideologies. This is partly achieved through a properly labelled training data set.

A hand using a computer mouse. — Tech companies rely on low-wage labour around the world to develop the programs that power their AI systems.(Shutterstock)

Who are data labellers?

The job of combing through thousands of potentially graphic images and snippets of text has fallen on data labellers largely concentrated in the Global South.

In early 2023, TIME magazine reported on OpenAI’s contract with Sama, a data labelling firm based in San Francisco. The report revealed that employees at a Kenyan satellite office were paid as little as US$1.32 per hour to read text that “appeared to have been pulled from the darkest recesses of the internet.”

WIRED also investigated the global economic realities of data labellers in South America and East Asia, some of whom worked more than 18 hours per day to earn less than their country’s minimum wage.

The Washington Post has taken a close look at ScaleAI which employs at least 10,000 workers in the Philippines. The newspaper revealed the San Francisco-based company “paid workers at extremely low rates, routinely delayed or withheld payments and provided few channels for workers to seek recourse.”

The data labelling industry and its required workforce is set to expand drastically in the coming years. Consumers who increasingly use AI systems need to know how they are built as well as the harm and inequities being perpetuated.

Transparency needed

From prisoners to gig workers, the potential for exploitation is real for all entwined in big AI’s thirst for data to fuel bigger (and possibly more unpredictable) models.

As institutions and individuals are swept up by the momentum of AI and all of its promises, the public tends to pay less attention to ethical aspects of the technology’s development.

Researchers at Stanford University recently launched a website showcasing their Foundation Model Transparency Index. The index provides metrics on measures of transparency for the most widely used AI models. These metrics range from how transparent companies are about where they source their data to how clear they are on the potential risks of their models.

Ten AI models were examined based on criteria of how transparent the company that operates them is about its labour practices. The index shows that tech companies have much work to do to improve transparency.

AI is becoming a growing part of our increasingly digital lives. That is why we must remain critical of a set of technologies that, unchecked and unexamined, may cause more problems than they solve and deepen divides in the world rather than eliminate them.

Ben Lee Taylor, Postdoctoral Fellow in Research on Teaching and Learning, McMaster University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Republish this Article for Free

Republish this Article

We believe in the free flow of information. This work is licensed under a Creative Commons Attribution-No Derivs 2.5 Canada (CC BY-ND 2.5 CA), so you can republish our articles for free, online or in print.

All republished articles must be attributed in the following way and contain links to both the site and original article: “This article was first published on Brighter World. Read the original article.”

The Finnish tech firm Metroc <a href="https://www.wired.com/story/prisoners-training-ai-finland/" target="_blank" rel="noopener">recently began using prison labour to train a large language model to improve artificial intelligence (AI) technology</a>. For 1.54 euros an hour prisoners answer simple questions about snippets of text in a process known as data labelling.

Data labelling is often outsourced to labour markets in the Global South where companies can find workers who are fluent in English and willing to work for low wages.

Due to the lack of Finnish speakers in these countries, however, Metroc has tapped into a local source of cheap labour. Were it not for the prison labour program, Metroc would likely be hard-pressed to find Finns willing to take data-labelling jobs that pay a fraction of <a href="https://www.oecdbetterlifeindex.org/countries/finland/" target="_blank" rel="noopener">the average salary in Finland</a>.

These cost-cutting strategies not only highlight the significant amount of human labour still required to fine tune AI, but they also raise important questions about the long-term sustainability of such business models and practices.
<h2>AI’s labour problem</h2>
The ethical ambiguity of prison labour-sourced AI is part of a larger story about the <a href="https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots" target="_blank" rel="noopener">human cost behind AI’s significant growth</a> in recent years. One issue that has become more evident over the past year revolves around the <a href="https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/" target="_blank" rel="noopener">question of labour</a>.

Leading AI firms are not denying their use of outsourced and low-wage labour to do work like <a href="https://www.ibm.com/topics/data-labeling" target="_blank" rel="noopener">data labelling</a>. However, the hype around tools like OpenAI’s ChatGPT has drawn attention away from <a href="https://fair.work/en/fw/publications/fairwork-cloudwork-ratings-2023-work-in-the-planetary-labour-market/" target="_blank" rel="noopener">this aspect of the technology’s development</a>.

As researchers, including myself, are trying to understand <a href="https://theconversation.com/a-study-buddy-that-raises-serious-questions-how-uni-students-approached-ai-in-their-first-semester-with-chatgpt-207915" target="_blank" rel="noopener">the perceptions and use of AI in higher education</a>, the ethical problems associated with current AI models continue to pile up. These include the <a href="https://theconversation.com/eliminating-bias-in-ai-may-be-impossible-a-computer-scientist-explains-how-to-tame-it-instead-208611" target="_blank" rel="noopener">biases that AI is prone to reproducing</a>, the <a href="https://doi.org/10.1016/j.joule.2023.09.004" target="_blank" rel="noopener">environmental impact of AI data centres</a>, and <a href="https://www.wired.com/story/ai-chatbots-can-guess-your-personal-information/" target="_blank" rel="noopener">privacy and security concerns</a>.

<hr />

<a href="https://dailynews.mcmaster.ca/articles/tradition-meets-innovation-generative-ai-in-mcmaster-classrooms/" target="_blank" rel="noopener"><strong>More: Tradition meets innovation: Generative AI in McMaster classrooms</strong></a>

<hr />

Current practices of outsourcing data labelling work expose <a href="https://www.weforum.org/agenda/2023/01/davos23-ai-divide-global-north-global-south/" target="_blank" rel="noopener">an uneven global distribution of AI’s costs and benefits</a>, with few proposed solutions.

The implications of this situation are twofold.

First, the massive amount of human labour that is still required to shape the “intelligence” of AI tools should give users pause when evaluating the outputs of these tools.

Second, until AI firms take serious steps to address their exploitative labour practices, users and institutions may want to reconsider the so-called values or benefits of AI tools.
<figure></figure>
<h2>What is data labelling?</h2>
The “intelligence” component of AI still requires significant human input to develop its data processing capabilities. Popular chatbots like ChatGPT are pre-trained (hence, the PT in GPT). A critical phase in the pre-training process consists of <a href="https://www.ibm.com/topics/supervised-learning" target="_blank" rel="noopener">supervised learning</a>.

During supervised learning, AI models learn how to generate outputs from data sets that are labelled by humans. Data labellers, like the Finnish prisoners, perform different tasks. For example, labellers might need to confirm whether an image contains a certain feature or to flag offensive language.

In addition to improving accuracy, data labelling is necessary to improve the “safety” of AI systems. Safety is defined according to the goals and principles of each AI firm. A “safe” model for one company might mean <a href="https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem" target="_blank" rel="noopener">avoiding the risk of copyright infringement</a>. For another, it might entail minimizing false information or <a href="https://www.bloomberg.com/graphics/2023-generative-ai-bias/" target="_blank" rel="noopener">biased content and stereotypes</a>.

For most popular models, safety means that the model should not generate content based on prejudiced ideologies. This is partly achieved through a properly labelled training data set.
<figure class="align-left zoomable">

[caption id="" align="alignnone" width="891"]<img class="" src="https://images.theconversation.com/files/559720/original/file-20231115-29-fudzxj.jpg?ixlib=rb-1.1.0&amp;q=45&amp;auto=format&amp;w=754&amp;fit=clip" alt="A hand using a computer mouse." width="891" height="594" /> Tech companies rely on low-wage labour around the world to develop the programs that power their AI systems.(Shutterstock)[/caption]</figure>
<h2>Who are data labellers?</h2>
The job of combing through thousands of potentially graphic images and snippets of text has fallen on data labellers largely concentrated in the Global South.

In early 2023, <a href="https://time.com/6247678/openai-chatgpt-kenya-workers/" target="_blank" rel="noopener"><em>TIME</em> magazine reported on OpenAI’s contract</a> with Sama, a data labelling firm based in San Francisco. The report revealed that employees at a Kenyan satellite office were paid as little as US$1.32 per hour to read text that “appeared to have been pulled from the darkest recesses of the internet.”

<a href="https://www.wired.com/story/millions-of-workers-are-training-ai-models-for-pennies/" target="_blank" rel="noopener"><em>WIRED</em> also investigated the global economic realities of data labellers</a> in South America and East Asia, some of whom worked more than 18 hours per day to earn less than their country’s minimum wage.

The <a href="https://www.washingtonpost.com/world/2023/08/28/scale-ai-remotasks-philippines-artificial-intelligence/" target="_blank" rel="noopener"><em>Washington Post</em> has taken a close look at ScaleAI</a> which employs at least 10,000 workers in the Philippines. The newspaper revealed the San Francisco-based company “paid workers at extremely low rates, routinely delayed or withheld payments and provided few channels for workers to seek recourse.”

The data labelling industry and its required workforce <a href="https://finance.yahoo.com/news/data-collection-labeling-market-worth-070000709.html" target="_blank" rel="noopener">is set to expand drastically in the coming years</a>. Consumers who increasingly use AI systems need to know how they are built as well as the harm and inequities being perpetuated.
<h2>Transparency needed</h2>
From prisoners to gig workers, the potential for exploitation is real for all entwined in big AI’s thirst for data to fuel <a href="https://doi.org/10.1145/3442188.3445922" target="_blank" rel="noopener">bigger (and possibly more unpredictable) models</a>.

As institutions and individuals are swept up by the momentum of AI and all of its promises, the public tends to pay less attention to ethical aspects of the technology’s development.

Researchers at Stanford University recently launched a <a href="https://crfm.stanford.edu/fmti/" target="_blank" rel="noopener">website showcasing their Foundation Model Transparency Index</a>. The index provides metrics on measures of transparency for the most widely used AI models. These metrics range from how transparent companies are about where they source their data to how clear they are on the potential risks of their models.

Ten AI models were examined based on criteria of how transparent the company that operates them is about its labour practices. The index shows that tech companies have much work to do to improve transparency.

AI is becoming a growing part of our increasingly digital lives. That is why we must remain critical of a set of technologies that, unchecked and unexamined, may cause more problems than they solve and deepen divides in the world rather than eliminate them.<!-- Below is The Conversation's page counter tag. Please DO NOT REMOVE. --><img style="border: none !important; margin: 0 !important; max-height: 1px !important; max-width: 1px !important; min-height: 1px !important; min-width: 1px !important; padding: 0 !important;" src="https://counter.theconversation.com/content/217038/count.gif?distributor=republish-lightbox-basic" alt="The Conversation" width="1" height="1" /><!-- End of code. If you don't see any code above, please get new code from the Advanced tab after you click the republish button. The page counter does not collect any personal data. More info: https://theconversation.com/republishing-guidelines -->

<em><a href="https://theconversation.com/profiles/ben-lee-taylor-1480780" target="_blank" rel="noopener">Ben Lee Taylor</a>, Postdoctoral Fellow in Research on Teaching and Learning, <a href="https://theconversation.com/institutions/mcmaster-university-930" target="_blank" rel="noopener">McMaster University</a></em>

<em>This article is republished from <a href="https://theconversation.com" target="_blank" rel="noopener">The Conversation</a> under a Creative Commons license. Read the <a href="https://theconversation.com/long-hours-and-low-wages-the-human-labour-powering-ais-development-217038" target="_blank" rel="noopener">original article</a>.</em>

Media Enquiries

Phone: (905) 525-9140 ext. 24073
Email: comms@mcmaster.ca

The Communications and Public Affairs Office is staffed from 8:30 a.m. to 4:30 p.m. Monday to Friday.

The University has a broadcast quality television studio to facilitate live and pre-recorded interviews with media. Learn more about our experts.

Brighter World

Analysis: Long hours and low wages: The human labour powering AI’s development

AI’s labour problem

What is data labelling?

Who are data labellers?

Transparency needed

Related Stories

Channels

Brighter World

Analysis: Long hours and low wages: The human labour powering AI’s development

AI’s labour problem

What is data labelling?

Who are data labellers?

Transparency needed

Republish this Article

Media Enquiries

Related Stories

‘The momentum has sustained itself’

Analysis: Indigenous engagement is essential for small modular nuclear reactor projects

Channels