Thousands of images of sexually abused children scraped from the internet are part of a commonly-used database used to train artificial intelligence image generators, according to a report, which warns that AI applications can use offensive photos to create realistic-looking fake child exploitation images that can be sold.
The report, released today by the Stanford University Internet Observatory (SIO), says removal of the source images is going on now because researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P).
The investigation found the worrisome images in the biggest repository of images used by AI developers for training, known as LAION-5B, containing billions of images scraped from a wide array of sources, including mainstream social media websites and popular adult video sites.
According to the Associated Press, LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, said in a statement that it “has a zero tolerance policy for illegal content and in an abundance of caution” has taken down the datasets until the offending images can be deleted.
The SIO study of LAION-5B was primarily conducted using hashing tools such as Microsoft’s PhotoDNA, which match a fingerprint of an image to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse. Researchers did not view abuse content, and matches were reported to NCMEC and confirmed by C3P where possible.
There are methods to minimize child sexual abuse material (CSAM) in datasets used to train AI models, the SIO said in a statement, but it is challenging to clean or stop the distribution of open datasets with no central authority that hosts the actual data.
The report outlines safety recommendations for collecting datasets, training models, and hosting models trained on scraped datasets. Images collected in future datasets should be checked against known lists of CSAM by using detection tools such as Microsoft’s PhotoDNA or partnering with child safety organizations such as NCMEC and C3P.
The LAION‐5B dataset is derived from a broad cross‐section of the web, and has
been used to train various visual generative machine learning models. This dataset
was built by taking a snapshot of the Common Crawl5 repository, downloading
images referenced in the HTML, reading the “alt” attributes of the images, and using CLIP6
interrogation to discard images that did not sufficiently match the captions. The developers of LAION‐5B did attempt to classify whether content was sexually explicit as well as to detect some degree of underage explicit content.
However, the report notes, version 1.5 of one of the most popular AI image-generating models, Stable Diffusion, was also trained on a wide array of content, both explicit and otherwise. LAION datasets have also been used to train other models, says the report, such as Google’s Imagen, which was trained on a combination of internal datasets and the previous generation LAION‐400M.17.
“Notably,” the report says, “during an audit of the LAION‐400M, Imagen’s developers found
‘a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes’, and deemed it unfit for public use.”
Despite its best efforts to find all CSAM in LAION-5B, the SIO says its work was a “significant undercount” due to the incompleteness of industry hash sets, attrition of live hosted content, lack of access to the original LAION reference image sets, and the limited accuracy of “unsafe” content classifiers.
Web-scale datasets are highly problematic for a number of reasons, even with
attempts at safety filtering, says the report. Ideally, such datasets should be restricted to research settings only, with more curated and well‐sourced datasets used for publicly distributed AI models.