Beautiful Soup - Remove HTML Tags



In this chapter, let us see how we can remove all tags from a HTML document. HTML is a markup language, made up of predefined tags. A tag marks a certain text associated with it so that the browser renders it as per its predefined meaning. For example, the word Hello marked with <b> tag for example <b>Hello</b), is rendered in bold face by the browser.

If we want to filter out the raw text between different tags in a HTML document, we can use any of the two methods - get_text() or extract() in Beautiful Soup library.

The get_text() method collects all the raw text part from the document and returns a string. However, the original document tree is not changed.

In the example below, the get_text() method removes all the HTML tags.

Example

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
 DJs flock by when MTV ax quiz prog.
 Junk MTV quiz graced by fox whelps.
 Bawds jog, flick quartz, vex nymphs.

Not that the soup object in the above example still contains the parsed tree of the HTML document.

Another approach is to collect the string enclosed in a Tag object before extracting it from the soup object. In HTML, some tags don't have a string property (we can say that tag.string is None for some tags such as <html> or <body>). So, we concatenate strings from all other tags to obtain the plain text out of the HTML document.

Following program demonstrates this approach.

Example

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()

string=''
for tag in tags:
   #print (tag.name, tag.string)
   if tag.string != None:
      string=string+tag.string+'\n'
   tag.extract()
print ("Document text after removing tags:")
print (string)
print ("Document:")
print (soup)

Output

Document text after removing tags:
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

Document:

The clear() method removes the inner string of a tag object but doesn't return it. Similarly the decompose() method destroys the tag as well as all its children elements. Hence, these methods are not suitable to retrieve the plain text from HTML document.

Advertisements