Skip to content
Adventures in 21st-century privacy

Artist finds private medical record photos in popular AI training data set

LAION scraped medical photos for AI research use. Who's responsible for taking them down?

Benj Edwards –
Censored medical images found in the LAION-5B data set used to train AI. The black bars and distortion have been added.
Censored medical images found in the LAION-5B data set used to train AI. The black bars and distortion have been added. Credit: Ars Technica
Censored medical images found in the LAION-5B data set used to train AI. The black bars and distortion have been added. Credit: Ars Technica

Late last week, a California-based AI artist who goes by the name Lapine discovered private medical record photos taken by her doctor in 2013 referenced in the LAION-5B image set, which is a scrape of publicly available images on the web. AI researchers download a subset of that data to train AI image synthesis models such as Stable Diffusion and Google Imagen.

Lapine discovered her medical photos on a site called Have I Been Trained, which lets artists see if their work is in the LAION-5B data set. Instead of doing a text search on the site, Lapine uploaded a recent photo of herself using the site's reverse image search feature. She was surprised to discover a set of two before-and-after medical photos of her face, which had only been authorized for private use by her doctor, as reflected in an authorization form Lapine tweeted and also provided to Ars.

Lapine has a genetic condition called Dyskeratosis Congenita. "It affects everything from my skin to my bones and teeth," Lapine told Ars Technica in an interview. "In 2013, I underwent a small set of procedures to restore facial contours after having been through so many rounds of mouth and jaw surgeries. These pictures are from my last set of procedures with this surgeon."

The surgeon who possessed the medical photos died of cancer in 2018, according to Lapine, and she suspects that they somehow left his practice's custody after that. "It’s the digital equivalent of receiving stolen property," says Lapine. "Someone stole the image from my deceased doctor’s files and it ended up somewhere online, and then it was scraped into this dataset."

Lapine prefers to conceal her identity for medical privacy reasons. With records and photos provided by Lapine, Ars confirmed that there are medical images of her referenced in the LAION data set. During our search for Lapine's photos, we also discovered thousands of similar patient medical record photos in the data set, each of which may have a similar questionable ethical or legal status, many of which have likely been integrated into popular image synthesis models that companies like Midjourney and Stability AI offer as a commercial service.

This does not mean that anyone can suddenly create an AI version of Lapine's face (as the technology stands at the moment)—and her name is not linked to the photos—but it bothers her that private medical images have been baked into a product without any form of consent or recourse to remove them. "It’s bad enough to have a photo leaked, but now it’s part of a product," says Lapine. "And this goes for anyone’s photos, medical record or not. And the future abuse potential is really high."

Who watches the watchers?

LAION describes itself as a nonprofit organization with members worldwide, "aiming to make large-scale machine learning models, datasets and related code available to the general public." Its data can be used in various projects, from facial recognition to computer vision to image synthesis.

For example, after an AI training process, some of the images in the LAION data set become the basis of Stable Diffusion's amazing ability to generate images from text descriptions. Since LAION is a set of URLs pointing to images on the web, LAION does not host the images themselves. Instead, LAION says that researchers must download the images from various locations when they want to use them in a project.

The LAION data set is replete with potentially sensitive images collected from the Internet.
The LAION data set is replete with potentially sensitive images collected from the Internet, such as these, which are now being integrated into commercial machine learning products. Black bars have been added by Ars for privacy purposes.
The LAION data set is replete with potentially sensitive images collected from the Internet, such as these, which are now being integrated into commercial machine learning products. Black bars have been added by Ars for privacy purposes. Credit: Ars Technica

Under these conditions, responsibility for a particular image's inclusion in the LAION set then becomes a fancy game of pass the buck. A friend of Lapine's posed an open question on the #safety-and-privacy channel of LAION's Discord server last Friday asking how to remove her images from the set. LAION engineer Romain Beaumont replied, "The best way to remove an image from the Internet is to ask for the hosting website to stop hosting it," wrote Beaumont. "We are not hosting any of these images."

In the US, scraping publicly available data from the Internet appears to be legal, as the results from a 2019 court case affirm. Is it mostly the deceased doctor's fault, then? Or the site that hosts Lapine's illicit images on the web?

Ars contacted LAION for comment on these questions but did not receive a response by press time. LAION's website does provide a form where European citizens can request information removed from their database to comply with the EU's GDPR laws, but only if a photo of a person is associated with a name in the image's metadata. Thanks to services such as PimEyes, however, it has become trivial to associate someone's face with names through other means.

Ultimately, Lapine understands how the chain of custody over her private images failed but still would like to see her images removed from the LAION data set. "I would like to have a way for anyone to ask to have their image removed from the data set without sacrificing personal information. Just because they scraped it from the web doesn’t mean it was supposed to be public information, or even on the web at all."

A selection of images generated by Stable Diffusion. Knowledge of how to render them came from scraped images on the web.
A selection of images generated by Stable Diffusion. Knowledge of how to render them came from scraped images on the web, thanks to the LAION image data set.
A selection of images generated by Stable Diffusion. Knowledge of how to render them came from scraped images on the web, thanks to the LAION image data set. Credit: Lexica

On the LAION Discord, when asked if there was a way to connect with Stability AI and remove the images from their copy of the data set, Beaumont replied, "If you wish to ask people not to use some URLs from the web, one way is to create such a list and distribute it ... then Stable Diffusion trainers or other trainers could decide to blacklist these samples."

Ars Technica asked Stability AI CEO Emad Mostaque about Lapine's case. In the past, Mostaque has claimed that his company's and LAION's open approach is more ethical than those of closed-model companies like OpenAI that do not release a public list of their data sources. Mostaque said that he couldn't speak for LAION, but he pointed to a recent MIT Technology review article about privacy issues surrounding the GPT-3 language model. "It’s an interesting analogy, as we have no idea what those datasets are. Plenty of folk I know have found confidential data in output."

When we asked if it's possible to "un-train" Stable Diffusion to remove certain images from its AI image generation model, Mostaque's reply suggested it is possible, but he says that current Stable Diffusion tech does not spit out exact copies of the data it learns from. "You can do that already in fine-tuning similar to doing a negative weight," he says, describing a way to de-emphasize learned data in the Stable Diffusion weights file, which is the summation of what an AI model has learned. "It’s not like a database or even GPT-3 in recall and memorization though." He did not provide any plans for removing knowledge of particular images in response to requests from people like Lapine, however.

Lapine's discovery comes as the artist community grapples with a related issue about AI image models using their work for training without their consent. Lapine has gathered a following on Twitter for creating AI-powered art. Does this change how she feels about it? "Yes. I am very wary of generating photorealistic images of people," she replied. "I stick to animal subjects, statues, and art that fits under a style or period rather than a specific artist." But the experience has spooked her somewhat. "I haven’t touched AI art since finding my post-op pic in the LAION dataset."

It's becoming more evident by the day that these AI-powered creative tools probably represent an inevitable technological advancement that can push productivity to new levels. But as a society, we still have difficult questions to ask ourselves. Is it ethical to expect people who uploaded an image to the Internet a decade ago—or had it uploaded illegally for them—to accept that their data is now being used to train the future of AI without recourse? And will it ever matter if the answer is no?

Listing image: Ars Technica

Photo of Benj Edwards
Benj Edwards Senior AI Reporter
Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a widely-cited tech historian. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
Prev story
Next story