Complex Website Scraping
“Complex” web sources scraping is a challenge. Consult DataOx! Know how they can be scraped, what workarounds we develop to deal with large, protected, “weak” or old sites.
Ask us to scrap the website and receive free data samle in XLSX, CSV, JSON or Google Sheet in 3 days
Ask us to help
Scraping is the our field of expertise: we completed more than 800 scraping projects (including protected resources)
Table of contents
Estimated reading time: 4 minutes
What Does “Complex Website” Mean?
We divide “complex” websites into three categories:
- Large web sources
- Protected sites
- “Weak” or old websites
Large Web Sources
Websites with a lot of web pages (usually more than one thousand) are considered large. Big e-commerce sites or social networks fall into this category. If you want to scrape data from this kind of website, you will probably face the following difficulties:
Different HTML code in the same types of web pages
It happens because large sites, as a rule, get large gradually and contain as many old pages as new ones. Different programmers can work on these websites at different times. To scrape the same field (e.g., “service name”), we have to develop different parsers for particular pages. This process requires a good data quality assurance process to find incorrect data and check the HTML code of each web page.
Pagination and lazy loading
To scrape large websites, web crawlers have to go from one page to another. If sites have pagination (like most ecommerce websites have), you need to consider sorting and other nuances related to it. Lazy loading, when items load as you scroll down a page, can cause a lot of headaches. A lot of duplicates may occur in such cases.
Captcha and location limitations are the simplest way to protect a website from data scraping. But such protection is relatively easy to avoid. Captcha solving services are software that send a simple puzzle to a human user who recognizes it and returns results to the scraper. Proxies help to avoid location IP blocking (read more about that in our blog).
DataOx has faced exotic restrictions on a “complex” Canadian website that limited the hours of its website like a brick-and-mortar shop. There is a category of websites in which owners order special protection from third-party vendors (like Imperva or Akamai). Such services create heavy restrictions on the amount of data you can extract from the website. They have special algorithms that “understand” that the web scraper is not human, but rather an automated process. Good examples of such web sources are LinkedIn, Glassdoor, British Airways, and others. Generally, the sites that are extremely often scraped, tend to have extremely advanced anti-scraping mechanisms. Such sites should be scraped respectfully, but it does not mean you won’t be banned, then there are tools and tricks to avoid this: human-like crawling, user agent rotation, and proxy rotation.
The best way to pretend your scraping bots are real users is to use a real browser. Besides, try not to crawl in a predictable and repetitive way. Scraping content consequently, your chances to be identified by the bot protection mechanisms of the target site are higher. That’s why the orders should be randomized as much as possible as well as delays between requests. It’s also advisable not to chain all the requests in one large sequence. Keep in mind that the more unpredictable your scraper is, the more it’s human-like and the more likely it’ll operate successfully.
User Agent Rotation
The main trick in parsing a website that does not want to be parsed and eager to prevent its content theft, is to write unidentifiable script and randomizing user agents.
Multiple requests from the same IP are the main identifier that a site is being scraped. A proxy rotation service can help you eliminate this issue by changing IP addresses with every request you make. Depending on the project tasks and budget, either Datacenter or Residential IPs can be chosen. It’s very difficult to extract large volumes of information from complex sites, but we know how to do it. At DataOx, we develop a lot of workarounds using human behavior imitation, proxy rotation mechanisms, and “careful” scraping.
Dynamic Data is another category of information which is difficult to scrape since you have to deal with the information that changes very often. Yet monitoring dynamic data streams allows better insights and faster actions based on the information received up-to-the-moment. In such a way, the time between cause and effect can be significantly reduced. Through a continuous dynamic data extraction with search engine bots and other tools one can receive a comprehensive and high-volume database, however, data is often a time sensitive asset in such a case, so it’s vital to the process, analyze it and take into action. That’s why it’s vital not only to have reliable and matching storage for the data scraped, but also effective tools and mechanisms for its further usage.
Weak and Old Web Sources
There are websites that fall in the “complex” category but are easy to break and web scrape. These web sources were usually made a very long time ago or were not built for a lot of visitors. The major problem with scraping such websites is the risk of breaking the website due to the weak servers. For instance, if the website capacity is 1,000 simultaneous visits, but our scrapers exceed it to 1,001, the website will be down. To avoid that, we had to develop a mechanism that researches websites before web scraping. To summarize, “complex” websites can affect the amount of data you can get from them in a particular period. At DataOx, we haven’t faced any protected or large web source that we can’t scrape, but each one requires a custom approach. What else is important, we know that redundant data can be a real problem, particularly when you extract at a large scale. So, we take care to deliver cleaned and accurate data to our clients, removing redundant, duplicate, irrelevant or faulty information from the datasets we provide to them.
DataOx acts as a data delivery service, then you only get data, clean, accurate, and up-to-date sent to you once or as scheduled, or our scraping experts can help you to develop a custom solution for web scraping complex websites. Just schedule a free consultation.
Publishing date: Sun Apr 23 2023
Last update date: Wed Apr 19 2023