Or “How to build a better mousetrap for AI scrapers”
Check out Keypoint Intelligence’s Cybersecurity page!
There’s been a lot of push-back recently regarding artificial intelligence (AI). People are no longer impressed by AI’s output when hallucinations are becoming more common than ever, there’s no shortage of slop content in our personal lives and professional, and there are concerns about the ethics of using it in the first place.
Well, some people are starting to fight back. After a recent outcry after AI bots were accused of hammering websites like Reddit to scrape content for their large language models (LLMs), an anonymous software developer created a solution they called Nepenthes. Named after the genus of carnivorous pitcher plants, the solution is designed to utilize a Markov chain algorithm to generate nonsense content to feed into the LLM while trapping them in a digital maze. This poisons the LLM with junk data while keeping the bot from moving on to find a new website to scrape for content. This concept is not too far from networking tarpits designed to combat spambots and computer worms.
Nepenthes’ creation has already spawned similar types of AI “tarpits” like Iocaine, created by hacker and software developer Gergely Nagy. While unique in its code, the basic concept of trapping bots in digital mazes while feeding it garbage data is similar.
Shall We Play a Game?
While you may not agree with creating aggressive malware (which is what Napenthes’ creator openly labels their work) to combat AI bots, there is a clear need for something new to address this issue. The previous solution of granting limited permission to bots via a website’s robots exclusion protocol (robots.txt) may have worked prior to the AI explosion, but it doesn’t have the same protection today. There is now an arms race between websites trying to protect their work and AI bots multiplying in new forms to keep scraping data and overloading websites.
For example, Anthropic was called out publicly in the summer of 2024 after it was discovered it was ignoring robots.txt permissions. When sites began adding banned programs like “ANTHROPIC-AI” and “CLAUDE-WEB” to their robots.txt file to protect their content from being scraped, a new bot named “CLAUDEBOT” was discovered to be bypassing this measure to continue feeding data explicitly denied to Anthropic into their LLM. While the company has since agreed to respect blocks for its previous AI bots and apply them to CLAUDEBOT, the damage was already done.
Beyond stealing data, the tactic of creating new bots with new names makes it hard for websites to know what AI bots are still active and who to protect their content from. It also unfairly burdens the websites the bots steal from, as they experience spikes in data usage that have real world monetary costs to pay for the sudden need for bandwidth.
Keypoint Intelligence Opinion
We are at a point where AI is being pushed at full throttle into every corner of our lives—whether we want it or not. This wild drive to make everything AI-integrated is rubbing people the wrong way and will only continue to inspire new ways to fight back.
Our current weapons, though, are far from perfect. AI tarpits like Nepenthes and Iocaine require space on servers to run, which might not be feasible for smaller websites. There’s also concerns that these tarpits might be too simple and could be outmaneuvered when AI bots evolve.
Still, the answer for those who want to stop this enforcing “AI on everything” mentality is to not give up. We need regulation found in (literally) any other technology or industry to make sure that AI companies are only using data from trusted and viable sources that allow their content to be used, as well as having real means to opt-out for anyone who doesn’t want to participate. We need a way to source content/data used in LLMs, as well, to make sure that we can point back to where it came from in case there’s any disputes about it being protected material or even factually true.
AI has this great potential to be something revolutionary—but it’ll only remain as something with potential if we don’t do something to make it something people actually want to use and not a nuisance that needs to be trapped to keep it from slopping all over us. Until then, we’re going to keep seeing people cultivating their nepenthes gardens and stocking up on iocaine until something better comes along.
Stay ahead in the ever-evolving print industry by browsing our Report Store for the latest insights. Log in to the InfoCenter to view research and studies through our Workplace- and Production-based Advisory Services. Log in to bliQ for product-level research, reports, and specs. Not a subscriber? Contact us for more information.
Keep Reading
INFOGRAPHIC: Poisoning Data to Protect Artists
Generative AI and the Problem Factory