TIL a huge number of AI training images came from one site without clear permission

Read a report from the University of Amsterdam. They found LAION-5B, a massive dataset, used over 5 billion images from Common Crawl. Many were personal photos from Flickr, taken without asking the photographers. Makes you wonder who really owns the data behind these models. Has anyone else seen stats on where their training data actually comes from?

3 comments

3 Comments

taylor.reese22d ago

That Common Crawl scrape is a huge mess. I had to check my own portfolio after reading about the Getty case reed.skyler mentioned. Found a few of my old Flickr shots in a dataset audit tool. The best you can do right now is run your URLs through haveibeentrained.com to see what's been scraped.

reed.skyler22d ago

Yeah, I saw a piece about how Getty Images is suing over this exact thing!

william86422d ago

Wait, they used billions of photos without even asking?