Beautiful Soup - Remove Empty Tags



In HTML, many of the tags have an opening and closing tag. Such tags are mostly used for defining the formatting properties, such as <b> and </b>, <h1> and </h1> etc. There are some self-closing tags also which don't have a closing tag and no textual part. For example <img>, <br>, <input> etc. However, while composing HTML, tags such as <p></p> without any text may be inadvertently inserted. We need to remove such empty tags with the help of Beautiful Soup library functions.

Removing textual tags without any text between opening and closing symbols is easy. You can call extract() method on a tag if length of its inner text is 0.

for tag in tags:
   if (len(tag.get_text(strip=True)) == 0):
      tag.extract()

However, this would remove tags such as <hr>, <img>, and <input> etc. These are all self-closing or singleton tags. You would not like to close tags that have one or more attributes even if there is no text associated with it. So, you'll have to check if a tag has any attributes and the get_text() returns none.

In the following example, there are both situations where an empty textual tag and some singleton tags are present in the HTML string. The code retains the tags with attributes but removes ones without any text embedded.

Example

html ='''
<html>
   <body>
      <p>Paragraph</p>
      <embed type="image/jpg" src="Python logo.jpg" width="300" height="200">
      <hr>
      <b></b>
      <p>
      <a href="#">Link</a>
      <ul>
      <li>One</li>
      </ul>
      <input type="text" id="fname" name="fname">
      <img src="img_orange_flowers.jpg" alt="Flowers">
   </body>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags =soup.find_all()

for tag in tags:
   if (len(tag.get_text(strip=True)) == 0): 
      if len(tag.attrs)==0:
         tag.extract()
print (soup)

Output

<html>
<body>
<p>Paragraph</p>
<embed height="200" src="Python logo.jpg" type="image/jpg" width="300"/>

<p>
<a href="#">Link</a>
<ul>
<li>One</li>
</ul>
<input id="fname" name="fname" type="text"/>
<img alt="Flowers" src="img_orange_flowers.jpg"/>
</p>
</body>
</html>

Note that the original html code has a <p> tag without its enclosing </p>. The parser automatically inserts the closing tag. The position of the closing tag may change if you change the parser to lxml or html5lib.

Advertisements