Beautiful Soup - Find all Headings



In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document.

We shall use the following HTML script (saved as index.html) in the code examples in this chapter −

<html>
   <head>
      <title>BeautifulSoup - Scraping Headings</title>
   </head>
   <body>
      <h2>Scraping Headings</h2>
      <b>The quick, brown fox jumps over a lazy dog.</b>
      <h3>Paragraph Heading</h3>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <h3>List heading</h3>
      <ul>
         <li>Junk MTV quiz graced by fox whelps.</li>
         <li>Bawds jog, flick quartz, vex nymphs.</li>
      </ul>
   </body>
</html>

Example 1

In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags.

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

headings = ['h1','h2','h3', 'h4', 'h5', 'h6']
tags = soup.find_all()
heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings]
print (heads)

Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads.

Output

[('h2', 'Scraping Headings'), ('h3', 'Paragraph Heading'), ('h3', 'List heading')]

Example 2

You can pass a regex expression to the find_all() method. Take a look at the following regex.

re.compile('^h[1-6]$')

This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below −

from bs4 import BeautifulSoup
import re

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

tags = soup.find_all(re.compile('^h[1-6]$'))
print (tags)

Output

[<h2>Scraping Headings</h2>, <h3>Paragraph Heading</h3>, <h3>List heading</h3>]
Advertisements