Beautiful Soup - Get all HTML Tags



Tags in HTML are like keywords in a traditional programming language like Python or Java. Tags have a predefined behaviour according to which the its content is rendered by the browser. With Beautiful Soup, it is possible to collect all the tags in a given HTML document.

The simplest way to obtain a list of tags is to parse the web page into a soup object, and call find_all() methods without any argument. It returns a list generator, giving us a list of all the tags.

Let us extract the list of all tags in Google's homepage.

Example

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

tags = soup.find_all()
print ([tag.name for tag in tags])

Output

['html', 'head', 'meta', 'meta', 'title', 'script', 'style', 'style', 'script', 'body', 'script', 'div', 'div', 'nobr', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'u', 'div', 'nobr', 'span', 'span', 'span', 'a', 'a', 'a', 'div', 'div', 'center', 'br', 'div', 'img', 'br', 'br', 'form', 'table', 'tr', 'td', 'td', 'input', 'input', 'input', 'input', 'input', 'div', 'input', 'br', 'span', 'span', 'input', 'span', 'span', 'input', 'script', 'input', 'td', 'a', 'input', 'script', 'div', 'div', 'br', 'div', 'style', 'div', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'span', 'div', 'div', 'a', 'a', 'a', 'a', 'p', 'a', 'a', 'script', 'script', 'script']

Naturally, you may get such a list where one certain tag may appear more than once. To obtain a list of unique tags (avoiding the duplication), construct a set from the list of tag objects.

Change the print statement in above code to

Example

print ({tag.name for tag in tags})

Output

{'body', 'head', 'p', 'a', 'meta', 'tr', 'nobr', 'script', 'br', 'img', 'b', 'form', 'center', 'span', 'div', 'input', 'u', 'title', 'style', 'td', 'table', 'html'}

To obtain tags with some text associated with them, check the string property and print if it is not None

tags = soup.find_all()
for tag in tags:
   if tag.string is not None:
      print (tag.name, tag.string)

There may be some singleton tags without text but with one or more attributes as in the <img> tag. Following loop constructs lists out such tags.

In the following code, the HTML string is not a complete HTML document in the sense that thr <html> and <body> tags are not given. But the html5lib and lxml parsers add these tags on their own while parsing the document tree. Hence, when we extract the tag list, the additional tags will also be seen.

Example

html = '''
<h1 style="color:blue;text-align:center;">This is a heading</h1>
<p style="color:red;">This is a paragraph.</p>
<p>This is another paragraph</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")

tags = soup.find_all()
print ({tag.name for tag in tags} )

Output

{'head', 'html', 'p', 'h1', 'body'}
Advertisements