Beautiful Soup - Installation



Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

BeautifulSoup package is not a part of Python's standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python's recommended method.

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

We shall use venv module in Python's standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.

Use the following command to create virtual environment in Windows

C:\uses\user\>python -m venv myenv

On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment

mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y
mvl@GNVBGL3:~ $ sudo apt install python3-venv

Then use the following command to create a virtual environment

mvl@GNVBGL3:~ $ sudo python3 -m venv myenv

You need to activate the virtual environment. On Windows use the command

C:\uses\user\>cd myenv
C:\uses\user\myenv>scripts\activate
(myenv) C:\Users\users\user\myenv>

On Ubuntu Linux, use following command to activate the virtual environment

mvl@GNVBGL3:~$ cd myenv
mvl@GNVBGL3:~/myenv$ source bin/activate
(myenv) mvl@GNVBGL3:~/myenv$

Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.

(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 
143.0/143.0 KB 325.2 kB/s eta 0:00:00
Collecting soupsieve>1.2
  Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1 

Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.

If you don't have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install 

To check if Beautifulsoup is properly install, enter following commands in Python terminal −

>>> import bs4
>>> bs4.__version__
'4.12.2'

If the installation hasn't been successful, you will get ModuleNotFoundError.

You will also need to install requests library. It is a HTTP library for Python.

pip3 install requests

Installing a Parser

By default, Beautiful Soup supports the HTML parser included in Python's standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To install lxml or html5lib parser, use the command:

pip3 install lxml
pip3 install html5lib

These parsers have their advantages and disadvantages as shown below −

Parser: Python's html.parser

Usage − BeautifulSoup(markup, "html.parser")

Advantages

  • Batteries included
  • Decent speed
  • Lenient (As of Python 3.2)

Disadvantages

  • Not as fast as lxml, less lenient than html5lib.

Parser: lxml's HTML parser

Usage − BeautifulSoup(markup, "lxml")

Advantages

  • Very fast
  • Lenient

Disadvantages

  • External C dependency

Parser: lxml's XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Or BeautifulSoup(markup, "xml")

Advantages

  • Very fast
  • The only currently supported XML parser

Disadvantages

  • External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Advantages

  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5

Disadvantages

  • Very slow
  • External Python dependency
Advertisements