Captcha and location limitations are the simplest way to protect a website from data scraping. But such protection is relatively easy to avoid. Captcha solving services are software that send a simple puzzle to a human user who recognizes it and returns results to the scraper.
Proxies help to avoid location IP blocking (read more about that in our blog).
DataOx has faced exotic restrictions on a “complex” Canadian website that limited the hours of its website like a brick-and-mortar shop.
There is a category of websites in which owners order special protection from third-party vendors (like Imperva or Akamai). Such services create heavy restrictions on the amount of data you can extract from the website. They have special algorithms that “understand” that the web scraper is not human, but rather an automated process.
Good examples of such web sources are LinkedIn, Glassdoor, British Airways, and others.
It’s very difficult to extract large volumes of information from such sites. At DataOx, we develop a lot of workarounds using human behavior imitation, proxy rotation mechanisms, and “careful” scraping.