Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Selected Reading

Beautiful Soup - Navigating By Tags



One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.

Easiest way to search a parse tree is to search the tag by its name.

soup.head

The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.

Following code extracts the contents of <head> element

Example - Extracting Head

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
print(soup.head)

Output

<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>

soup.body

Similarly, to return the contents of body part of HTML page, use soup.body

Example - Extracting html Body

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
print (soup.body)

Output

<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>

You can also extract specific tag (like first <h1> tag) in the <body> tag.

Example - Extracting specific tag

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.body.h1)

Output

<h1>Tutorialspoint Online Library</h1>

soup.p

Our HTML file contains a <p> tag. We can extract the contents of this tag

Example - Extract content of <p> tag

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.p)

Output

<p><b>It's all Free</b></p>

Tag.contents

A Tag object may have one or more PageElements. The Tag object's contents property returns a list of all elements included in it.

Let us find the elements in <head> tag of our index.html file.

Example - Finding elements of <head> tag

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.head
print (tag.contents)

Output

['\n',
<title>TutorialsPoint</title>,
'\n',
<script>
document.write("Welcome to TutorialsPoint");
</script>,
'\n']

Tag.children

The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.

The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.

To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.

The following Python code gives a list of all the children elements of top level <ul> tag.

Example - Getting List of chidren elements

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.ul
print (list(tag.children))

Output

['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']

Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.

Example - Traversing a hiearchy

from bs4 import BeautifulSoup

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

tag = soup.ul
for child in tag.children:
   print (child)

Output

<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

Tag.find_all()

This method returns a result set of contents of all the tags matching with the argument tag provided.

The following code lists all the elements with <a> tag

Example - List all elements with <a> tag

from bs4 import BeautifulSoup

html = """
<html>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
      <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
      <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
      <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
      <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
      <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

result = soup.find_all("a")
print (result)

Output

[
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
]
Advertisements