Beautiful Soup - Scrape Nested Tags



The arrangement of tags or elements in a HTML document is hierarchical nature. The tags are nested upto multiple levels. For example, the <head> and <body> tags are nested inside <html> tag. Similarly, one or more <li> tags may be inside a <ul> tag. In this chapter, we shall find out how to scrape a tag that has one or more children tags nested in it.

Let us consider the following HTML document −

<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>

In this case, the two <div> tags and a <p> tag has one or more child elements nested inside. Whereas, the <img> and <b> tag donot have any children tags.

The findChildren() method returns a ResultSet of all the children under a tag. So, if a tag doesn't have any children, the ResultSet will be an empty list like [].

Taking this as a cue, the following code finds out the tags under each tag in the document tree and displays the list.

Example

html = """
   <div id="outer">
      <div id="inner">
         <p>Hello<b>World</b></p>
         <img src='logo.jpg'>
      </div>
   </div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
   print ("Tag: {} attributes: {}".format(tag.name, tag.attrs))
   print ("Child tags: ", tag.findChildren())
   print()

Output

Tag: div attributes: {'id': 'outer'}
Child tags:  [<div id="inner">
<p>Hello<b>World</b></p>
<img src="logo.jpg"/>
</div>, <p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: div attributes: {'id': 'inner'}
Child tags:  [<p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: p attributes: {}
Child tags:  [<b>World</b>]

Tag: b attributes: {}
Child tags:  []

Tag: img attributes: {'src': 'logo.jpg'}
Child tags:  []
Advertisements