Beautiful Soup - Remove Child Elements



HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it.

There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it.

Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each.

Consider the following code segment −

soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print (soup)

This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that.

Now the following code −

soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
   for t in tag.findChildren():
      t.extract() 

In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself.

More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table.

The following HTML script ha a table with first <tr> element having headers marked by <th> tag.

<html>
   <body>
      <h2>Beautiful Soup - Remove Child Elements</h2>
      <table border="1">
         <tr class='header'>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

We can use the following Python code to remove all the children elements of <tr> tag with <th> cells.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class':'header'})

for tag in tags:
   for t in tag.findChildren():
      t.extract()

print (soup)

Output

<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">

</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>

It can be seen that the <th> elements have been removed from the parsed tree

Advertisements