Installing modules for Web Scraping – Day 1

Published by BrighterBees on

web scrapping

Getting Started

For installing the module for Web Scraping we need an IDLE or an environment like Python IDLE, code editors (VS code, sublime and Atom ), Pycharm and jupyter notebook. Python IDLE is commonly used but not such famous as jupyter notebook. You can use both use jupyter in two ways, i.e., you can install it on your OS or you can use the browser version. We will prefer the browser one because it saves your space.

  • Python IDLE – Every Python installation comes with an Integrated Development and Learning Environment, which you’ll see shortened to IDLE or even IDE. These are a class of applications that help you write code more efficiently. While there are many IDEs for you to choose from, Python IDLE is very bare-bones, which makes it the perfect tool for a beginning programmer. Python IDLE comes included in Python installations on Windows and Mac. If you’re a Linux user, then you should be able to find and download Python IDLE using your package manager. Once you’ve installed it, you can then use Python IDLE as an interactive interpreter or as a file editor.
  • Jupyter Notebook – The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.

You can get the download link here for above software –

popular IDLEs

 

Modules for Web Scraping

So the question arises here that what are modules. Modules refer to a file containing Python statements and definitions for different tasks. In the beginning, we will use only requests and beautifulsoup module for Web Scraping.

  • requestsThe requests module allows you to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).
  • beautifulsoupBeautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

Till here you are familiar with the IDLE, Code editors and modules. Now we will continue to practical work. In the first method, we will make a virtual environment, install the modules and work on it. In the second method, we will use Jupyter Notebook which has pre-installed modules, we only need to write code. So Let’s begin with the first one.

Method One –

Note: Python IDLE should be installed in your system.

We will start by making the virtual environment, it is optional you can skip it.

Step : 1 Open your CMD (Command Prompt) and create a directory with name web-scraping.

>>> cd web-scraping

Step : 2 Installing virtual environment module

>>> pip install virtualenv

Step : 3 Making virtual environment

>>> virtualenv myenv

Step : 4 Activating the environment

>>> myenv\Scripts\activate

Step : 5 Installing requests module

>>> pip install requests

Step : 6 Installing beautifulsoup module

>>> pip install bs4

Step : 7 Create a python file code.py ,  open it in code editor like VS code or sublime and write code

>>> import requests

>>> from bs4 import BeautifulSoup

>>> url = ‘https://pythonprogramming.net/introduction-scraping-parsing-beautiful-soup-tutorial/’
>>> print(requests.get(url).content[:100])

# Above code print the site content including spaces, symbols and numbers.

>>> raw_content = requests.get(url).content

>>> soup = BeautifulSoup(raw_content , “html.parser”)

>>> print(soup)

# By using bs4 , raw content is flitered and it will return pure HTML content.

Method Two

Open your installed jupyter notebook or open the browser one jupyter notebook and write the code

Step: 1 Import the request module

>>> import requests

Step: 2 Import the BeautifulSoup module

>>> from bs4 import BeautifulSoup

Step : 3 Put the url of the website

>>> url = ‘https://pythonprogramming.net/introduction-scraping-parsing-beautiful-soup-tutorial/’

Step : 4 Get the html content of the website at the end

>>> print(requests.get(url).content[:100])

>>> raw_content = requests.get(url).content

>>> soup = BeautifulSoup(raw_content , “html.parser”)

>>> print(soup)

 

 

Till this blog you have learned about web scraping and libraries needed for web scraping. Subscribe us to get more content like this. If you have any problem regarding blog comment below .

If want to know about Web Scraping click here.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

STAY CONNECT WITH US