Beautiful Soup - Scraping Paragraphs from HTML



One of the frequently appearing tags in a HTML document is the <p> tag that marks a paragraph text. With Beautiful Soup, you can easily extract paragraph from the parsed document tree. In this chapter, we shall discuss the following ways of scraping paragraphs with the help of BeautifulSoup library.

  • Scraping HTML paragraph with <p> tag

  • Scraping HTML paragraph with find_all() method

  • Scraping HTML paragraph with select() method

We shall use the following HTML document for these exercises −

<html>
   <head>
      <title>BeautifulSoup - Scraping Paragraph</title>
   </head>
   <body>
      <p id='para1'>The quick, brown fox jumps over a lazy dog.</p>
      <h2>Hello</h2>
      <p>DJs flock by when MTV ax quiz prog.</p>
      
      <p>Junk MTV quiz graced by fox whelps.</p>
      
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>

Scraping by <p> tag

Easiest way to search a parse tree is to search the tag by its name. Hence, the expression soup.p points towards the first <p> tag in the scouped document.

para = soup.p

To fetch all the subsequent <p> tags, you can run a loop till the soup object is exhausted of all the <p> tags. The following program displays the prettified output of all the paragraph tags.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

para = soup.p 
print (para.prettify())
while True:
   p = para.find_next('p')
   if p is None:
      break
   print (p.prettify())
   para=p

Output

<p>
 The quick, brown fox jumps over a lazy dog.
</p>

<p>
 DJs flock by when MTV ax quiz prog.
</p>

<p>
 Junk MTV quiz graced by fox whelps.
</p>

<p>
 Bawds jog, flick quartz, vex nymphs.
</p>

Using find_all() method

The find_all() methods is more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to this method. In this case, we want to fetch the contents of a <p> tag.

In the following code, find_all() method returns a list of all elements in the <p> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

paras = soup.find_all('p') 
for para in paras:
   print (para.prettify())

Output

<p>
 The quick, brown fox jumps over a lazy dog.
</p>

<p>
 DJs flock by when MTV ax quiz prog.
</p>

<p>
 Junk MTV quiz graced by fox whelps.
</p>

<p>
 Bawds jog, flick quartz, vex nymphs.
</p>

We can use another approach to find all <p> tags. To begin with, obtain list of all tags using find_all() and check Tag.name of each equals ='p'.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
paras = [tag.contents for tag in tags if tag.name=='p']
print (paras)

The find_all() method also has attrs parameter. It is useful when you want to extract the <p> tag with specific attributes. For example, in the given document, the first <p> element has id='para1'. To fetch it, we need to modify the tag object as −

paras = soup.find_all('p', attrs={'id':'para1'})

Using select() method

The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the <p> tag to select() method. The select_one() method is also available. It fetches the first occurrence of the <p> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

paras = soup.select('p')
print (paras)

Output

[
<p>The quick, brown fox jumps over a lazy dog.</p>, 
<p>DJs flock by when MTV ax quiz prog.</p>, 
<p>Junk MTV quiz graced by fox whelps.</p>, 
<p>Bawds jog, flick quartz, vex nymphs.</p>
]

To filter out <p> tags with a certain id, use a for loop as follows −

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')
tags = soup.select('p')
for tag in tags:
   if tag.has_attr('id') and tag['id']=='para1':
      print (tag.contents)

Output

['The quick, brown fox jumps over a lazy dog.']
Advertisements