Beautiful Soup - Remove all Styles



This chapter explains how to remove all styles from a HTML document. Cascaded style sheets (CSS) are used to control the appearance of different aspects of a HTML document. It includes styling the rendering of text with a specific font, color, alignment, spacing etc. CSS is applied to HTML tags in different ways.

One is to define different styles in a CSS file and include in the HTML script with the <link> tag in the <head> section in the document. For example,

Example

<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
   . . .
   . . .
   </body>
</html>

The different tags in the body part of the HTML script will use the definitions in mystyle.css file

Another approach is to define the style configuration inside the <head> part of the HTML document itself. Tags in the body part will be rendered by using the definitions provided internally.

Example of internal styling −

<html>
<head>
   <style>
      p {
         text-align: center;
         color: red;
      } 
   </style>
</head>
   <body>
      <p>para1.</p>
      <p id="para1">para2</p>
      <p>para3</p>
   </body>
</html>

In either cases, to remove the styles programmatically, simple remove the head tag from the soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

Third approach is to define the styles inline by including style attribute in the tag itself. The style attribute may contain one or more style attribute definitions such as color, size etc. For example

<body>
   <h1 style="color:blue;text-align:center;">This is a heading</h1>
   <p style="color:red;">This is a paragraph.</p>
</body>

To remove such inline styles from a HTML document, you need to check if attrs dictionary of a tag object has style key defined in it, and if yes delete the same.

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup)

The following code removes the inline styles as well as removes the head tag itself, so that the resultant HTML tree will not have any styles left.

html = '''
<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
      <h1 style="color:blue;text-align:center;">This is a heading</h1>
      <p style="color:red;">This is a paragraph.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup.prettify())

Output

<html>
 <body>
  <h1>
   This is a heading
  </h1>
  <p>
   This is a paragraph.
  </p>
 </body>
</html>
Advertisements