Beautiful Soup - Remove all Scripts

One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.

The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.

To include an external Javascript file, the syntax used is −

   <script src="javascript.js"></script>

You can then invoke the functions defined in this file from inside HTML.

Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.

   <p>Hello World</p>
      alert("Hello World")

To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.


html = '''
      <script src="javascript.js"></scrript>
      <p>Hello World</p>
      alert("Hello World")
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all('script'):

print (soup)




You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −

[tag.decompose() for tag in soup.find_all('script')]