Without search engines, the internet would be one big pile of mush. Content left, right and center, but nothing tangible to point you in the correct direction. Google made itself incalculably valuable when it become the cartographer of the internet, and we just can’t thank them enough.
In the years since Google’s domination there have been other search engines: Yahoo!, MSN, Ask Jeeves…the list goes on and on. There are still plenty of other search engines out there right now. Google has the clear lion’s share with roughly 64% of all internet traffic going its way, but that number has declined over the last couple years.
This decline is due to the rise of other search engines in the mainstream, niche engines, and engines outside the U.S., all of which account for the rest of the world’s traffic.
The big three in the U.S. are Google, Bing, and Yahoo!! There are many others, and countries like China have their own market dominant search engines (Baidu, for instance).
In the land of internet, search is still the key metric and being battled over fiercely. Why?
What’s In the Search Engine
This may be obvious, but it’s worth stating: search engines are all about the content they present. The only reason people use the search engine is to find content they are looking for. It’s the Help Desk in the airport of the internet and without it you’d never find your gate.
Keep in mind that none of the found information is owned by the search engine. They just found it for you.
The search engine itself has no significant information…you’re shaking your head. I know. Google has become a resource of information in and of itself for some things: maps, pictures of houses from the street, strange doodle-esque games that appear and disappear, physical objects, etc. However, you’d be surprised how much content they use that is not actually theirs.
This is true for all engines.
You use a search engine to find information, not because they have it themselves.
Why You Scrape the Search Engine
Consider now why one would scrape a search engine. Scrape is an ugly word for crawl, suck, draw out of, or harvest (all of which are ugly words in and of themselves). To scrape a search engine is to harvest all the data on it.
You obviously can’t harvest all the data on Google, so you need to scrape for specific information at given intervals. This is essentially what you do when you’re after Big Data, using Scrapebox and a bunch of proxies.
Scraping search engines is an age-old tradition — at least as old as the internet. Because the search engines have categorized the data in such a good way, a dialed in scrape can turn up millions of results for keywords, URLs, and other metrics in a few hours.
You can then compile this data for research, resale, or any number of purposes.
How You Scrape the Search Engine
The why is simple, the how…a little less simple. But you’re here, on a proxy website, trying to find the easiest engine to scrape, so you probably have a clue.
In general it goes like this: download a scraper application like Scrapebox, load it up with proxies (free or paid), set your parameters for the scrape, and hit the “Go!” button.
That’s the simple version; it’s worth breaking it down a bit more.
Proxies for Scraping
The proxies part of this is essential. The issue with scraping search engines is that they don’t want you to do it. In essence you are churning through their information as quickly as possible to harvest data in an automated fashion, but they want you to browse like a normal human being.
There are a number of reasons search engines don’t want you to scrape. Google, the big dog, feels that it could slow down websites’ responsiveness, but we all know they just don’t want people to access all their data. So it goes.
Proxies come in here because they hide your original IP address, and can be rotated easily. They need to be rotated because the IP address is the indicator that a search engine will recognize as the scraper. It can’t be your actual IP address because you’d get in trouble with your ISP. If it’s a proxy IP address it might eventually get blocked, and then you could switch it out for another one.
Proxies are necessary. Everyone who scrapes uses them. Rotating proxies are the best, and give the best (and most consistent) results.
Parameters for the Scrape
This subject is a big one, and one I won’t get into significantly in this article. However, it’s important to realize that after you download the software and upload the proxies, you’ll need to adjust the parameters of the scrape.
This ties directly into why certain search engines are easier to scrape than others. When using software you’ll need to be mindful of two things: threads and timeouts.
The more threads you have, the more open connections to the search engine and the faster your scrape. This may sound great, but it also leaves your proxy IP very vulnerable to getting banned or blocked.
Limit your threads to reduce risk of getting blocked or banned.
The shorter your timeouts the more results you’ll get. But, just like threads, you have to be careful. Timeouts are literally how long a proxy IP waits for a response from the server to start a new request; a short timeout would be 1-10 seconds, a long one would be 60 seconds.
When you set it to short timeouts the software will ping the search engine every single second (or every 10 seconds, etc.). You don’t want to do this, as it will raise red flags.
Which Search Engine is Easiest to Scrape?
I won’t get into all the search engines out there — that’s too many. But I will give you a breakdown of the big three in the U.S., and some others that are commonly used.
One thing to remember is that all of these search engines are private companies. They don’t release “best of scraping” guides for users, and they certainly don’t post what their rules are. Scraping is a continual trial and error process, so please take my recommendations with a grain of salt.
If you’ve scraped before you’ve likely scraped Google. It is the head cartographer and can, with the right methods, yield the most fruitful scrapes around. I’ll get into more of the terminology in the example for Google, and then go into the other search engines.
I would classify Google as very difficult to scrape. Being top dog means Google has the largest reputation to defend, and it, in general, doesn’t want scrapers sniffing around.
It can’t stop the process; people scrape Google every hour of the day. But it can put up stringent defenses that stop people from scraping excessively.
Bot Detection and Captchas
The way Google (and other search engines) determine a proxy is by seeing if it is a bot or not. Bot is synonomous with crawler, scraper, harvester, etc. Bot is a nice term, though, because it implies the specific process that offends Google.
Google and other engines want humans to search the web, not bots. So, if your bot doesn’t act like a human, you will get booted.
This is called bot detection, and Google has great methods of detecting your bots.
When it does detect a bot it will throw up captchas initially. These are those annoying guessing games that try to tell if you’re human. They will most often stump your proxy IP and software, thereby stopping your scrape.
If you continue a new scrape with that IP, which Google has now flagged, it will likely get banned from Google, and then blacklisted.
Banned means you won’t be able to use it on Google; you’ll just get an error message. Blacklisted means the IP itself will go on a big list of “no’s!” that Google carries around in its wallet.
Your proxy provider will likely get upset if you get too many of their proxies blacklisted, so it’s best to stop scraping with that proxy IP before this happens.
The reality is that most of these search engines have a threshold. Google has a low one. I can’t typically scrape more than a few pages of Google — five at most — until I get my first captcha. Once that happens I reduce threads and increase timeout, and then go on until I get another captcha. After that I rotate my proxies.
Yahoo! is easier to scrape than Google, but still not very easy. And, because it’s used less often than Google and other engines, applications don’t always have the best system for scraping it.
You can try, but be sure to do so cautiously if you’re worried about your proxies. Set threads to low and timeouts high, and build up from there.
See if your application can handle it, and what kind of results you get. Yahoo! has a lower threshold than Google, but not necessarily one that allows you easy access.
Of the big three search engines in the U.S., Bing is the easiest to scrape. For whatever reason they don’t seem to care as much. For example, in one recent scrape I harvested 7 million URLs from Bing in a couple hours. Yes, that’s a lot.
For comparison, the same scrape on Google only let me get a few thousand URLs.
If you want to scrape happily and forever, use Bing.
It’s not entirely clear why this is the case, and we’ll never know. One idea is that Bing doesn’t want to block any visitors because it reduces overall page views, which means less impressions on ads overall. Scrape impressions on ads typically don’t add up to much, but the search engine might be opening the flood gates to compete.
Scraping Dogpile, DuckDuckGo, Ask.com
These are roughly the same as Yahoo! Not as difficult as Google, not nearly as easy as Bing.
Many search engines dislike scrapers by nature, and put robust measure in place to keep the number of scrapes low. The threshold rate for these lesser known, but pretty powerful engines, will kick you off soon enough. I don’t scrape them as often as Google, Yahoo!, or Bing, but when I do I typically grab tens of thousands of results before getting the boot.
Use Rotating Proxies
To be clear, the above scenarios and numbers are true when I use premium rotating proxies. When you scrape search engines, and you’re serious about it, I only recommend rotating proxies. They are much less of a hassle, and throw up flags way less than free, datacenter or shared proxies.
Which search engine is the easiest to scrape? Bing. By far. Trial and error over the years has made this a consistent fact for me.
I do encourage you to try all of them, though, and see what results you get. Make sure to control your threads and timeouts, and don’t scrape overly hard or in super robotic fashion.
I also recommend tailoring scraping settings (like retry rates) when you start to see captchas to maximize your yield of data. It’s important to avoid blacklisting proxies as much as possible. It ensures optimal performance for scraping, plus an optimal experience for you and for your provider.
Good luck, and stay sharp.