Beautiful Soup - Navigating by Tags



One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.

Easiest way to search a parse tree is to search the tag by its name.

soup.head

The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.

Consider the following HTML page to be scraped:
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>

Following code extracts the contents of <head> element

Example

from bs4 import BeautifulSoup
with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print(soup.head)

Output

<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>

soup.body

Similarly, to return the contents of body part of HTML page, use soup.body

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print (soup.body)

Output

<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>

You can also extract specific tag (like first <h1> tag) in the <body> tag.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.body.h1)

Output

<h1>Tutorialspoint Online Library</h1>

soup.p

Our HTML file contains a <p> tag. We can extract the contents of this tag

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.p)

Output

<p><b>It's all Free</b></p>

Tag.contents

A Tag object may have one or more PageElements. The Tag object's contents property returns a list of all elements included in it.

Let us find the elements in <head> tag of our index.html file.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.head
print (tag.contents)

Output

['\n',
<title>TutorialsPoint</title>,
'\n',
<script>
document.write("Welcome to TutorialsPoint");
</script>,
'\n']

Tag.children

The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.

The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.

To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

The following Python code gives a list of all the children elements of top level <ul> tag.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.ul
print (list(tag.children))

Output

['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']

Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.

Example

for child in tag.children:
   print (child)

Output

<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

Tag.find_all()

This method returns a result set of contents of all the tags matching with the argument tag provided.

Let us consider the following HTML page(index.html) for this −

<html>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
      <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
      <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
      <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
      <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
      <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
   </body>
</html>

The following code lists all the elements with <a> tag

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

result = soup.find_all("a")
print (result)

Output

[
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
]
Advertisements