Sign up for The Key Point of View, our weekly newsletter of blogs and podcasts!
By now, we all know that artificial intelligence (AI) can struggle sometimes with the “intelligence” part. Hallucinations and decreases in capability were being seen so much last year that news and scientific articles were being written about concerns with generative AI producing nonsense or (worse) outright lies. The issue clearly hasn’t been fixed as there’s now a number of people making videos and blogging about AI’s problem with the word “strawberry.”
But why? Has ChatGPT not been doing its homework? Is Microsoft Copilot rotting its brain with reality TV? As with a lot of problems in life, the issue could be a lack of diversity…
The AI Echo Chamber
A quick and generalized primer: Most generative AI tools use large language models (LLMs) to help train them on natural language processing and statistical models based on great amounts of data to help the AI “think” like a person when performing actions like answering a prompt or creating computing code. These LLMs often comb various sources to help populate their data banks so that the AI can pull from a lot of different places. The issue is that many of these sources are taken indiscriminately and without credit or proper sourcing to prove they are factual (or to compensate the original artist for images used, but that’s another blog). This means equal weight is given to the statements that “the sky is blue due to sunlight hitting Earth’s atmosphere and scattering blue light along the human visual spectrum” and “the sky is blue because Beyoncé and the Illuminati chose the color as a heraldic symbol of her daughter, Blue Ivy”.
And now, AI is starting to pull data generated by other AI models and incorporating it into their LLMs. Because there is no real regulation in how AI is used, many news sites, academic journals, and other websites are publishing content created with generative AI. This is then fed into AI tools like ChatGPT and Microsoft Copilot, who gives any hallucinations and wrong information the same value as data that is factually correct.
As discussed in a recent New York Times article, researchers noted that—as generative AI was purposely trained on its own output—content became more and more unhinged and nonsensical. When asked to complete a prompt of “To cook a turkey for Thanksgiving, you…”, the AI trained on its own responses told researches ominously that they needed to “know what you are going to do with your life if you don’t know what you are going to do with your life if you don’t…” on a repeated list à la Jack Torrence. The same held true for visual prompts. An image generator given handwritten numbers took only a few generations before the once crisp numerals devolved into blurry Bigfoot-like smears. This has been labeled as a “model collapse” by researchers at the University of Oxford as well as “AI slop” by many of us commonfolk online.
|
Source: Bhatia, Aatish. “When A.I.’s Output Is a Threat to A.I. Itself.” New York Times, August 2024. |
Part of the problem that arises from this AI ouroboros is that content starts to lose any diversity it once held due to biases within the AI’s LLM. We’ve all heard by now how certain words have been marked as flags that you’re reading AI content due to how rare they are in common use but are prevalent in AI-generated content. This same favoritism for words like “delved” and “endeavor” also applies to images. In the same NYT article referenced above, researchers asking for pictures of people’s faces noticed that AI given a range of ages, races, and sexes at the start to train from eventually weeded out all features until it created only white men or women with dark hair and eyes somewhere between 20 and 30 years old. While people fitting that description do exist in the real world, they are hardly the only ones…
This lack of diversity becomes just as big an issue as LLMs trained on factually incorrect data. Eventually, even if generative AI can produce something that isn’t an ominous warning or a H.R. Giger-esque nightmare of too many fingers or eyes in the wrong places, all of our written content is going to sound like it was written by middle schoolers who have just discovered a thesaurus while images are all populated by white couples walking their golden retrievers. We are (as a collective) stunting our own output when the whole point of using generative AI is to help us produce what we can’t do ourselves.
What Are We Going to Do?
The problem is that the human-based data sources AI is trained on are starting to disappear, making it much more likely that generative AI is being trained on itself. Many websites are opting-out or restricting their content so they can’t be scraped for training models. This lack of content is also causing the cost of training AI systems to grow. It takes a lot of energy, storage space, and innovation from researchers and developers to overcome these challenges and the growing issue of slop appearing in data sets.
To combat this, there are a few tactics we can take:
- Watermarks, while not necessarily a deterrent, can help us identify when generative AI is pulling from other AI-created content. This can help data managers keep track of it or even remove it from LLMs.
- Creating free-use or paid content libraries can help establish data sources that we know contain images and written works made by real artists and authors. This does require that we involve a variety of different styles of content creator or we’re just curating our own version of AI slop.
- Establishing regulations on how LLMs scrape content could offer some safeguards in how the data is used. Likewise, establishing parameters on sourcing data or (at the least) establishing when websites use generative AI could help LLM trainers from pulling content that’s already been through the system.
Unfortunately, these changes are going to be hard to implement. There’s already this cultural mindset that if something is online, then it’s up for grabs and probably true (since media literacy is on the decline). The early adopters of generative AI have already established a model where any and all content is taken and used without credit or providing sources, so the idea that they will start paying to do so now is pretty unlikely. At the risk of sounding catastrophizing, it will probably take an actual model collapse of the generative AI concept before companies come up with a proper solution. Hopefully, by then, we won’t be too late.
Browse through our Industry Reports Page (latest reports only). Log in to the InfoCenter to view research on generative AI, large language models, and data mining through our Artificial Intelligence Advisory Service. If you’re not a subscriber, contact us for more info by clicking here.