The hidden cost of scraped data Many large AI datasets have historically been assembled through web scraping – in other words, the automated collection of images, text or other content from across the internet.
However, Alice points out that this often takes place“ without the consent, awareness or compensation of the individuals whose data is included”.
She adds:“ This raises important ethical concerns, but it also creates technical risks.”
Those technical risks include misrepresentation, embedded cultural stereotypes and structural biases that become baked into a model long before it ever reaches a user. As AI is deployed in high-stakes settings, from healthcare to law enforcement, the danger of these inherited flaws grows more acute.
Alice notes:“ Models trained on scraped data may reinforce discriminatory patterns, amplify privacy violations or produce errors that disproportionately affect marginalised groups.”
aimagazine. com 23