Beautiful Soup - Output Formatting



If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.

An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −

< less than &lt; &#60;
> greater than &gt; &#62;
& ampersand &amp; &#38;
" double quote &quot; &#34;
' single quote &apos; &#39;
" Left Double quote &ldquo; &#8220;
" Right double quote &rdquo; &#8221;
£ Pound &pound; &#163;
¥ yen &yen; &#165;
euro &euro; &#8364;
© copyright &copy; &#169;

By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&amp;", "&lt;", and "&gt;"

For others, they'll be converted to Unicode characters.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (str(soup))

Output

Hello "World!"

If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won't get the HTML entities back −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode())

Output

b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'

To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.

formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML

formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

formatter="html5" − it's similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br"

formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML

Example

from bs4 import BeautifulSoup

french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))

Output

minimal: 
<p>
 Il a dit <<Sacré bleu!>>
</p>

html:
<p>
 Il a dit <<Sacré bleu!>>
</p>

None:
<p>
 Il a dit <<Sacré bleu!>>
</p>

In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.

HTMLFormatter class − Used to customize the formatting rules for HTML documents.

XMLFormatter class − Used to customize the formatting rules for XML documents.

Advertisements