Why the Internet Needs Crawl Neutrality

Today, a single company, Google, controls nearly all of the world’s access to information on the Internet. Their monopoly in search means for billions of people, their gateway to knowledge, to products, and to their exploration of the Web is in the hands of a single company. Most agree that this lack of competition in research is bad for individuals, communities and democracy.

Unbeknownst to many, one of the biggest barriers to competition in search is the lack of exploration neutrality. The only way to build an independent search engine and have the ability to fairly compete with Big Tech is to first crawl the internet effectively and efficiently. However, the web is an actively hostile environment for beginner search engine crawlers, with most websites only allowing Google’s crawler and discriminating against other search engine crawlers like that of Neeva.

This critically important, yet often overlooked, issue has a huge impact on preventing emerging search engines like Neeva from providing users with genuine alternatives, further reducing competition in search. Similar to net neutrality, today we need an approach to crawl neutrality. Without a change in policy and behavior, search competitors will continue to fight with one hand tied behind their backs.

Let’s start at the beginning. Building a comprehensive web index is a prerequisite for competition in search. In other words, the first step in building the Neeva search engine is to “download the Internet” through Neeva’s crawler, called Neevabot.

This is where the trouble begins. For the most part, websites only allow unfettered access to Google and Bing crawlers while discriminating against other crawlers like Neeva’s. These sites disallow everything else in their robots.txt files or (more commonly) say nothing in robots.txt, but return errors instead of content to other crawlers. The intention may be to screen out malicious actors, but the consequence is to throw the baby out with the bathwater. And you can’t provide search results if you can’t crawl the web.

This forces startups to spend an inordinate amount of time and resources finding workarounds. For example, Neeva implements a policy of “crawling a site as long as the robots.txt file allows GoogleBot and does not specifically prohibit Neevabot”. Even after a workaround like this, parts of the web with useful search results remain inaccessible to many search engines.

As a second example, many websites often allow a non-Google crawler via robots.txt and block it in other ways, either by throwing various types of errors (503, 429, …) or by limiting the debit. Crawling these sites requires deploying workarounds such as “obfuscate by crawling using a bank of proxy IP addresses that rotate periodically”. Legit search engines like Neeva are loath to deploy conflicting workarounds like this.

These roadblocks are often aimed at malicious bots, but have the effect of stifling legitimate search competition. We at Neeva have put a lot of effort into creating a crawler that respects bitrate limits and crawls at the minimum bitrate necessary to create a great search engine. In the meantime, Google has carte blanche. It crawls 50B web pages per day. It visits every page on the web once every three days and taxes network bandwidth on all websites. It is the tax of the Internet monopolist.

For the lucky bots among us, a bunch of well-meaning supporters, webmasters, and editors can help you whitelist your bot. Thanks to them, Neeva’s crawling now runs at hundreds of millions of pages per day, on track to soon reach billions of pages per day. Even so, it still requires identifying the right people at those companies that you can talk to, emailing and cold calling, and hoping for goodwill from webmasters on webmaster aliases that are generally ignored. An interim fix that is not upgradable.

Getting permission to explore shouldn’t depend on who you know. There should be a level playing field for everyone who participates and plays by the rules. Google is a search monopoly. Websites and webmasters face an impossible choice. Let Google crawl them or they won’t appear prominently in Google results. As a result, Google’s search monopoly causes the Internet as a whole to reinforce the monopoly by giving preferential access to Googlebot.

The internet should not be allowed to distinguish between search engine robots based on who they are. Neeva’s crawler is able to crawl the web at the speed and depth of Google. There are no technical limits, just anti-competitive market forces that make fair competition more difficult. And if it’s too much extra work for webmasters to distinguish bad bots slowing down their websites from legitimate search engines, then those with carte blanche like GoogleBot should be required to share their data with responsible actors.

Regulators and policymakers must step in if they care about competition in research. The market needs crawl neutrality, similar to net neutrality.

Vivek Raghunathan is co-founder of Neeva, a private ad-free search engine. Asim Shankar is Neeva’s Chief Technology Officer.

About Andrew Estofan

Check Also

SoftBank eyes additional London listing for Arm’s IPO

SoftBank considered listing Arm in New York, but the UK government is continuing efforts to …