Two Kinds of Website Information
Let’s talk about how web data can be valuable for your business. We classify all website data into two big categories: internal and external website data. Internal data is all the information that websites contain: text, pictures, videos, documents, and other files. It is also called website content. Other internal data could be URLs, HTML code, and metadata. External data is information about a website from other sources: statistical information (e.g., from archive.org), traffic data, rankings, and more.
In this article, we will look at internal data, how to extract it, and what value it can bring to your business.
Text data is text content: articles, comments, posts, descriptions of goods and services, prices, contacts, and much more. This data is commonly scraped and transformed for further purposes. Text content takes a relatively small amount of storage, so scraping text data takes less effort when you’re scraping on a large scale.
But as a rule, text data needs to be processed: parsed, cleansed, transformed, and checked for quality assurance.
Pictures and video content
Scraping media files takes more web scraping resources: proxies, services, and storage prices. You can read more about scraping images and video files in our data types articles.
We have done a lot of projects that require document scraping, mostly from government websites: legal information, statutes, and statistical information, for example. A lot of valuable business information can also be collected from the US Securities and Exchange Commision website.
We understand that a lot of US government websites and documents have different formats, and as a rule, such documents should be cleansed after web scraping.
We know from experience that the older the website, the more difficult it is to scrape.
Metadata, URLs, and sitemap
This type of data can be scraped and is valuable for SEO tasks. Search engines also use this kind of website data. Need content from your old website moved to a new one? We can scrape every URL and all content from your old website to build a new one without missing any information.
We get a few requests from our clients to do web crawling through the entire internet, find specific information on a website, then scrape it. For instance, we can find a web source using WordPress or another content management system (CMS) through the site’s HTML code. We can scrape for a particular topic or keyword mentioned in a forum or article like file names or even people’s last names—any target information.
Another quite common request is to scrape comments and reviews about a particular good, service, or event.