Introduction to Web Scraping – Day 0
What is Web Scraping?
The dictionary meaning of word Scrapping means getting something from the web. The question arises here that what we get and how to get. The answer to the first one is the data. The answer to the second question is a bit tricky because there are a lot of ways to get data.
In general, we may get data from the database or data file and other resources. But what if we need a large amount of data that is available online. So we can do copy and paste but it is time-wasting so we go for web scraping.
Web Scraping is also called data mining or web harvesting. It is the process of constructing or making an agent or medium which can extract, parse, download and organize useful information from the web automatically.
In this series of blogs, we will teach you various concepts of and try to make you comfortable with the concepts of Web Scraping.
The origin of Web Scraping is screen scrapping, which was used to integrate non-web based applications or native windows applications. Originally screen scraping was used prior to the wide use of the World Wide Web, but it could not scale up WWW expanded. This made it necessary to automate the approach of screen scraping and the technique called Web Scraping.
The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web Scrapers can do anything like ordering online food, scanning online shopping website for you and buying a ticket of a match at the moment etc.
- E-commerce Website – Web Scrapers can collect the data especially related to the price of a specific product from various e-commerce website for comparison.
- Content Aggregators – Web Scraping is used widely by content aggregators like news aggregators for providing updated data to their users.
- Search Engine Optimization (SEO) – Web Scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell the business how they rank for search keywords that matter to them.
- Data for Machine Learning Projects – Retrieval of data for machine learning projects depends upon Web Scraping.
A Web Scraper consists of the following components –
- Web Crawler Module – A very necessary component of web scraper, web crawler module, is used to navigate the target website by making HTTP or HTTPs request to the URLs. The crawler downloads the unstructured data (HTML contents) and passes it to the extractor .
- Extractor – The extractor processes the fetched HTML content and extracts the data into a semistructured format. This is also called as a parser module and uses different parsing, DOM parsing.
- Data Transformation and Cleaning Module – The data extracted above is not suitable for ready use. It must pass through some cleaning module so that we can use it. The methods like string manipulation or regular expression can be used for this purpose.
- Storage Module – After extracting the data, we need to store it as per our requirement. The storage module will output the data in a standard format that can be stored in a database or JSON or CSV format.
In the next blog, we will teach you to make your system prepare for Web Scraping, i.e, installation of modules and libraries for the web scrapping. After that, we will move into real website scraping.