Beautiful Soup - Specifying the Parser



A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features.

BeautifulSoup(markup, features)

Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as "lxml", "lxml-xml", "html.parser", or "html5lib; or type of markup to be used ("html", "html5", "xml").

If the features argument is not given, BeautifulSoup chooses the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.

You can specify one of the following −

The type of markup you want to parse. Beautiful Soup currently supports are "html", "xml", and "html5".

The name of the parser library to be used. Currently supported options are "lxml", "html5lib", and "html.parser" (Python's built-in HTML parser).

To install lxml or html5lib parser, use the command −

pip3 install lxml
pip3 install html5lib

These parsers have their advantages and disadvantages as shown below −

Parser: Python's html.parser

Usage − BeautifulSoup(markup, "html.parser")

Advantages

  • Batteries included
  • Decent speed
  • Lenient (As of Python 3.2)

Disadvantages

  • Not as fast as lxml, less lenient than html5lib.

Parser: lxml's HTML parser

Usage − BeautifulSoup(markup, "lxml")

Advantages

  • Very fast
  • Lenient

Disadvantages

  • External C dependency

Parser: lxml's XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Or BeautifulSoup(markup, "xml")

Advantages

  • Very fast
  • The only currently supported XML parser

Disadvantages

  • External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Advantages

  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5

Disadvantages

  • Very slow
  • External Python dependency

Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here's a short document, parsed as HTML −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a><b /></a>", "html.parser")
print (soup)

Output

<a><b></b></a>

An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair.

The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a><b /></a>", "xml")
print (soup)

Output

<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>

In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another.

However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when "<a></p>" is parsed with different parsers −

lxml parser

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "lxml")
print (soup)

Output

<html><body><a></a></body></html>

Note that the dangling </p> tag is simply ignored.

html5lib parser

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "html5lib")
print (soup)

Output

<html><head></head><body><a><p></p></a></body></html>

The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document.

Built-in html parser

Example

Built in from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "html.parser")
print (soup)

Output

<a></a>

This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn't even bother to add an <html> tag.

The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way.

Advertisements