- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
Beautiful Soup Useful Resources
Beautiful Soup - Navigating By Tags
One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.
Easiest way to search a parse tree is to search the tag by its name.
soup.head
The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.
Following code extracts the contents of <head> element
Example - Extracting Head
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.head)
Output
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
soup.body
Similarly, to return the contents of body part of HTML page, use soup.body
Example - Extracting html Body
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
print (soup.body)
Output
<body> <h1>Tutorialspoint Online Library</h1> <p><b>It's all Free</b></p> </body>
You can also extract specific tag (like first <h1> tag) in the <body> tag.
Example - Extracting specific tag
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.body.h1)
Output
<h1>Tutorialspoint Online Library</h1>
soup.p
Our HTML file contains a <p> tag. We can extract the contents of this tag
Example - Extract content of <p> tag
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.p)
Output
<p><b>It's all Free</b></p>
Tag.contents
A Tag object may have one or more PageElements. The Tag object's contents property returns a list of all elements included in it.
Let us find the elements in <head> tag of our index.html file.
Example - Finding elements of <head> tag
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
print (tag.contents)
Output
['\n',
<title>TutorialsPoint</title>,
'\n',
<script>
document.write("Welcome to TutorialsPoint");
</script>,
'\n']
Tag.children
The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it.
The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.
To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.
The following Python code gives a list of all the children elements of top level <ul> tag.
Example - Getting List of chidren elements
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tag = soup.ul
print (list(tag.children))
Output
['\n', <li>Accounts</li>, '\n', <ul> <li>Anand</li> <li>Mahesh</li> </ul>, '\n', <li>HR</li>, '\n', <ul> <li>Rani</li> <li>Ankita</li> </ul>, '\n']
Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.
Example - Traversing a hiearchy
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tag = soup.ul
for child in tag.children:
print (child)
Output
<li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul>
Tag.find_all()
This method returns a result set of contents of all the tags matching with the argument tag provided.
The following code lists all the elements with <a> tag
Example - List all elements with <a> tag
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all("a")
print (result)
Output
[ <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a> ]