Beautiful Soup - Remove all Scripts



One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.

The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.

To include an external Javascript file, the syntax used is −

<head>
   <script src="javascript.js"></script>
</head>

You can then invoke the functions defined in this file from inside HTML.

Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.

<body>
   <p>Hello World</p>
   <script>
      alert("Hello World")
   </script>
</body>

To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.

Example

html = '''
<html>
   <head>
      <script src="javascript.js"></scrript>
   </head>
   <body>
      <p>Hello World</p>
      <script>
      alert("Hello World")
      </script>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all('script'):
   tag.extract()

print (soup)

Output

<html>
<head>

</head>
</html>

You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −

[tag.decompose() for tag in soup.find_all('script')]
Advertisements