Click to Skip Ad
Closing in...

OpenAI’s web scraping GPTBot is under attack – here’s why

Published Aug 9th, 2023 7:09PM EDT
OpenAI announced GPT-4 on March 14.
Image: OpenAI

If you buy through a BGR link, we may earn an affiliate commission, helping support our expert product labs.

There is already a ton of controversy surrounding AI, especially with the use of ChatGPT in papers, articles, and elsewhere. However, OpenAI (the company that developed the ChatGPT chatbot) is kicking up even more controversy with a new GPTBot that scrapes the internet, learning from the content published on the world wide web.

It’s also likely that OpenAI knew the kind of controversy that this would cause, too, because it released the GPTBot without much fanfare or even an official announcement, though there is a support page for the bot that walks you through many of the details. Based on what it has shared, the bot appears to be designed as a web crawler, scraping content to learn more about the company’s language model.

So what’s the big deal? Why are so many people upset about this, and why are websites like The Verge scrambling to block the bot from scraping their content? Well, much of it comes down to the age-old consent variable. A lot of the content being shared on websites, especially blogs and things of that nature, is original content in some way.

ChatGPT homepage
ChatGPT homepage Image source: Stanislav Kogiku/SOPA Images/LightRocket via Getty Images

Someone has put their time and effort into writing or creating that content, and for many, the fact that a bot can just come by and scrape that information and knowledge and learn from it without any consent being involved is a huge problem. Additionally, AI is still very young and tends to paste the information it finds on the web, claiming it as its own, which is plagiarism, something that’s already rampant throughout the web without AI getting involved.

The other big problem is privacy. Because this bot is scraping the internet, it’s also scraping up information like usernames, emails, and other information that may have been shared in public places. That means that information could inadvertently be included somewhere it shouldn’t be, especially with the current copy/paste problems in AI models like that powering ChatGPT. We’ve already seen some privacy investigations into ChatGPT cropping up.

Luckily, OpenAI has enabled websites to block the GPTBot very easily, and that’s what many have done. But other bots do similar things, and there aren’t easy ways to block them. The blocking also doesn’t consider the thousands (possibly millions) of aggregating sites that rip off content daily. So it’s simply joining an already impossible battle that content creators and website owners are fighting.

We’ll likely see lawsuits concerning this, especially if OpenAI continues development on the GPTBot and pushes it harder as a tool for the language to learn from. These concerns are also underlined even more by the plethora of worries around AI already, as there are very few laws surrounding the advancement of AI systems and how they use data to learn and evolve.

Josh Hawkins has been writing for over a decade, covering science, gaming, and tech culture. He also is a top-rated product reviewer with experience in extensively researched product comparisons, headphones, and gaming devices.

Whenever he isn’t busy writing about tech or gadgets, he can usually be found enjoying a new world in a video game, or tinkering with something on his computer.