Beautiful Soup - Scraping List from HTML



Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document.

Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag.

HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags.

We shall use the following HTML document −

<html>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ol id="HR">
         <li>Rani</li>
         <li>Ankita</li>
         </ol>
      </ul>
   </body>
</html>

Scraping lists by Tag

In the above HTML document, we have a top-level <ul> list, inside which there's another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.ul

print (lst)

Output

<ul id="dept">
<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>

Change value of lst to point to <ol> element to get the inner list.

lst=soup.ol

Output

<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>

Using select() method

The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.select("ol")

print (lst)

Output

[<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>]

Using find_all() method

The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag.

In the following code, find_all() method returns a list of all elements in the <ul> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul")

print (lst)

We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id="acc".

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul", {"id":"acc"})

print (lst)

Output

[<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>]

Here's another example. We collect all elements with <li> tag with the inner text starting with 'A'. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True.

Example

from bs4 import BeautifulSoup

def startingwith(ch):
   return ch.startswith('A')

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all('li',string=startingwith)

print (lst)

Output

[<li>Accounts</li>, <li>Anand</li>, <li>Ankita</li>]
Advertisements