Beautiful Soup - Modifying the Tree



One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents.

Beautiful Soup library has different functions to perform the following operations −

  • Add contents or a new tag to an existing tag of the document

  • Insert contents before or after an existing tag or string

  • Clear the contents of an already existing tag

  • Modify the contents of a tag element

Add content

You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python's list object.

In the following example, the HTML script has a <p> tag. With append(), additional text is appended.

Example

from bs4 import BeautifulSoup

markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p

tag.append(" World")
print (soup) 

Output

<p>Hello</p>
<p>Hello World</p>

With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.

Example

from bs4 import BeautifulSoup, Tag

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify()) 

Output

<b>
   Hello
   <i>
      World
   </i>
</b>

If you have to add a string to the document, you can append a NavigableString object.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())

Output

<b>
   Hello
   World
</b>

From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag.

Example

from bs4 import BeautifulSoup

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
vals = ['World.', 'Welcome to ', 'TutorialsPoint']
tag.extend(vals)
print (soup.prettify())

Output

<b>
   Hello
   World.
   Welcome to
   TutorialsPoint
</b>

Insert Contents

Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.

In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert(1, "Tutorial ")
print (soup.prettify())

Output

<b>
   Excellent
   Tutorial
</b>
<u>
   from TutorialsPoint
</u>

Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string "Python Tutorial" is added after the <b> tag.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert_after("Python Tutorial")
print (soup.prettify())

Output

<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

On the other hand, insert_before() method is used below, to add "Here is an " text before the <b> tag.

tag.insert_before("Here is an ")
print (soup.prettify())

Output

Here is an
<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

Clear the Contents

Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features.

The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.find('u')

tag.clear()
print (soup.prettify())

Output

<b>
   Excellent
</b>
<u>
</u>

It can be seen that the clear() method removes the contents, keeping the tag intact.

For the following example, we parse the following HTML document and call clear() metho on all tags.

<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs./p>
   </body>
</html>

Here is the Python code using clear() method

Example

from bs4 import BeautifulSoup

fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
   tag.clear()
print (soup.prettify())

Output

<html>
</html>

The extract() method removes either a tag or a string from the document tree, and returns the object that was removed.

Example

from bs4 import BeautifulSoup

fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
   obj = tag.extract()
   print ("Extracted:",obj)

print (soup)

Output

Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>

You can extract either a tag or a string. The following example shows antag being extracted.

Example

html = '''
   <ol id="HR">
   <li>Rani</li>
   <li>Ankita</li>
   </ol>
'''
from bs4 import BeautifulSoup


soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('ol')
obj.find_next().extract()
print (soup)

Output

<ol id="HR">
   <li>Ankita</li>
</ol>

Change the extract() statement to remove inner text of first <li> element.

Example

obj.find_next().string.extract()

Output

<ol id="HR">
   <li>Ankita</li>
</ol>

There is another method decompose() that removes a tag from the tree, then completely destroys it and its contents −

Example

html = '''
   <ol id="HR">
      <li>Rani</li>
      <li>Ankita</li>
   </ol>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag1=soup.find('ol')
tag2 = soup.find('li')
tag2.decompose()
print (soup)
print (tag2.decomposed)

Output

<ol id="HR">

<li>Ankita</li>
</ol>

The decomposed property returns True or False - whether an element has been decomposed or not.

Modify the Contents

We shall look at the replace_with() method that allows contents of a tag to be replaced.

Just as a Python string, which is immutable, the NavigableString also can't be modified in place. However, use replace_with() to replace the inner string of a tag with another.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

Output

OnLine Tutorials Library

Here is another example to show the use of replace_with(). Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().2524

Example

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

Output

<html><body><book><title>Python</title></book></body></html>

The wrap() method wraps an element in the tag you specify. It returns the new wrapper.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello Python</p>", 'html.parser')
tag = soup.p
newtag = soup.new_tag('b')
tag.string.wrap(newtag)

print (soup)

Output

<p><b>Hello Python</b></p>

On the other hand, the unwrap() method replaces a tag with whatever's inside that tag. It's good for stripping out markup.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello <b>Python</b></p>", 'html.parser')
tag = soup.p
tag.b.unwrap()

print (soup)

Output

<p>Hello Python</p>
Advertisements