Beautiful Soup - Quick Guide

Beautiful Soup - Overview

In today's world, we have tons of unstructured data/information (mostly web data) available freely. Sometimes the freely available data is easy to read and sometimes not. No matter how your data is available, web scraping is very useful tool to transform unstructured data into structured data that is easier to read and analyze. In other words, web scraping is a way to collect, organize and analyze this enormous amount of data. So let us first understand what is web-scraping.

Introduction to Beautiful Soup

The Beautiful Soup is a python library which is named after a Lewis Carroll poem of the same name in "Alice's Adventures in the Wonderland". Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable XML structures.

In short, Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.

HTML tree Structure

Before we look into the functionality provided by Beautiful Soup, let us first understand the HTML tree structure.

The root element in the document tree is the html, which can have parents, children and siblings and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.

Let us suppose the webpage is as shown below −

Which translates to an html document as follows −

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>

Which simply means, for above html document, we have a html tree structure as follows −

Beautiful Soup - web-scraping

Scraping is simply a process of extracting (from various means), copying and screening of data.

When we scrape or extract data or feeds from the web (like from web-pages or websites), it is termed as web-scraping.

So, web scraping (which is also known as web data extraction or web harvesting) is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet.

Why Web-scraping?

Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways −

Data for Research

Smart analyst (like researcher or journalist) uses web scrapper instead of manually collecting and cleaning data from the websites.

Products, prices & popularity comparison

Currently there are couple of services which use web scrappers to collect data from numerous online sites and use it to compare products popularity and prices.

SEO Monitoring

There are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are used for competitive analysis and for pulling data from your client's websites.

Search engines

There are some big IT companies whose business solely depends on web scraping.

Sales and Marketing

The data gathered through web scraping can be used by marketers to analyze different niches and competitors or by the sales specialist for selling content marketing or social media promotion services.

Why Python for Web Scraping?

Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily.

Below are some of the points on why to choose python for web scraping −

Ease of Use

As most of the developers agree that python is very easy to code. We don't have to use any curly braces "{ }" or semi-colons ";" anywhere, which makes it more readable and easy-to-use while developing web scrapers.

Huge Library Support

Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc.

Easily Explicable Syntax

Python is a very readable programming language as python syntax are easy to understand. Python is very expressive and code indentation helps the users to differentiate different blocks or scopes in the code.

Dynamically-typed language

Python is a dynamically-typed language, which means the data assigned to a variable tells, what type of variable it is. It saves lot of time and makes work faster.

Huge Community

Python community is huge which helps you wherever you stuck while writing code.

Beautiful Soup - Installation

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

BeautifulSoup package is not a part of Python's standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python's recommended method.

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

We shall use venv module in Python's standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.

Use the following command to create virtual environment in Windows

C:\uses\user\>python -m venv myenv

On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment

mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y
mvl@GNVBGL3:~ $ sudo apt install python3-venv

Then use the following command to create a virtual environment

mvl@GNVBGL3:~ $ sudo python3 -m venv myenv

You need to activate the virtual environment. On Windows use the command

C:\uses\user\>cd myenv
C:\uses\user\myenv>scripts\activate
(myenv) C:\Users\users\user\myenv>

On Ubuntu Linux, use following command to activate the virtual environment

mvl@GNVBGL3:~$ cd myenv
mvl@GNVBGL3:~/myenv$ source bin/activate
(myenv) mvl@GNVBGL3:~/myenv$

Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.

(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 
143.0/143.0 KB 325.2 kB/s eta 0:00:00
Collecting soupsieve>1.2
  Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1

Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.

If you don't have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install

To check if Beautifulsoup is properly install, enter following commands in Python terminal −

>>> import bs4
>>> bs4.__version__
'4.12.2'

If the installation hasn't been successful, you will get ModuleNotFoundError.

You will also need to install requests library. It is a HTTP library for Python.

pip3 install requests

Installing a Parser

By default, Beautiful Soup supports the HTML parser included in Python's standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

To install lxml or html5lib parser, use the command:

pip3 install lxml
pip3 install html5lib

These parsers have their advantages and disadvantages as shown below −

Parser: Python's html.parser

Usage − BeautifulSoup(markup, "html.parser")

Advantages

Batteries included
Decent speed
Lenient (As of Python 3.2)

Disadvantages

Not as fast as lxml, less lenient than html5lib.

Parser: lxml's HTML parser

Usage − BeautifulSoup(markup, "lxml")

Advantages

Very fast
Lenient

Disadvantages

Parser: lxml's XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Or BeautifulSoup(markup, "xml")

Advantages

Very fast
The only currently supported XML parser

Disadvantages

External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Advantages

Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5

Disadvantages

Very slow
External Python dependency

Beautiful Soup - Souping the Page

It is time to test our Beautiful Soup package in one of the html pages (taking web page - https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.

In the below code, we are trying to extract the title from the webpage −

Example

from bs4 import BeautifulSoup
import requests


url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

print(soup.title)

Output

<title>Online Courses and eBooks Library<title>

One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −

for link in soup.find_all('a'):
   print(link.get('href'))

Output

Shown below is the partial output of the above loop −

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
…
…

To parse a web page stored locally in the current working directory, obtain the file object pointing to the html file, and use it as argument to the BeautifulSoup() constructor.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

print(soup)

Output

<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>

You can also use a string that contains HTML script as constructor's argument as follows −

from bs4 import BeautifulSoup

html = '''
<html>
   <head>
      <title>Hello World</title>
   </head>
   <body>
      <h1 style="text-align:center;">Hello World</h1>
   </body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')

print(soup)

Beautiful Soup uses the best available parser to parse the document. It will use an HTML parser unless specified otherwise.

Beautiful Soup - Kinds of objects

When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package.

Tag
NavigableString
BeautifulSoup
Comments

Tag Object

A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (type(tag))

Output

<class 'bs4.element.Tag'>

Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

Name (tag.name)

Every tag contains a name and can be accessed through '.name' as suffix. tag.name will return the type of tag it is.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (tag.name)

Output

html

However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)

Output

<strong><body><b class="boldest">TutorialsPoint</b></body></strong>

Attributes (tag.attrs)

A tag object can have any number of attributes. In the above example, the tag <b class="boldest"> has an attribute 'class' whose value is "boldest". Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by "attrs". You can access the attributes either through accessing the keys too.

In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by "attr".

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)

Output

{'type': 'text', 'name': 'name', 'value': 'Raju'}

We can do all kind of modifications to our tag's attributes (add/remove/modify), using dictionary operators or methods.

In the following example, the value tag is updated. The updated HTML string shows changes.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)
tag['value']='Ravi'
print (soup)

Output

<html><body><input name="name" type="text" value="Ravi"/></body></html>

We add a new id tag, and delete the value tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

tag['id']='nm'
del tag['value']
print (soup)

Output

<html><body><input id="nm" name="name" type="text"/></body></html>

Multi-valued attributes

Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include 'rel', 'rev', 'headers', 'accesskey' and 'accept-charset'. The multi-valued attributes in beautiful soup are shown as list.

Example

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

Output

css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']

However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −

Example

from bs4 import BeautifulSoup

id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml')
print ("id_soup.p['id']:", id_soup.p['id'])
print ("type(id_soup.p['id']):", type(id_soup.p['id']))

Output

id_soup.p['id']: body bold
type(id_soup.p['id']): <class 'str'>

NavigableString object

Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold.

The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use ".string" with tag.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')

print (soup.string)

print (type(soup.string))

Output

Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>

A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
string = str(tag.string)
print (string)

Output

Hello, Tutorialspoint!

Just as a Python string, which is immutable, the NavigableString also can't be modified in place. However, use replace_with() to replace the inner string of a tag with another.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

Output

OnLine Tutorials Library

BeautifulSoup object

The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree.

Example

from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

print (soup)
print (soup.name)
print ('type:',type(soup))

Output

<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>

The name property of BeautifulSoup object always returns [document].

Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().

Example

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

Output

<html><body><book><title>Python</title></book></body></html>

Comment object

Any text written between  in HTML as well as XML document is treated as comment. BeautifulSoup can detect such commented text as a Comment object.

Example

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))

Output

This is a comment text in HTML <class 'bs4.element.Comment'>

The Comment object is a special type of NavigableString object. The prettify() method displays the comment text with special formatting −

Example

print (soup.b.prettify())

Output

<b>
   <!--This is a comment text in HTML-->
</b>

Beautiful Soup - Inspect Data Source

In order to scrape a web page with BeautifulSoup and Python, your first step for any web scraping project should be to explore the website that you want to scrape. So, first visit the website to understand the site structure before you start extracting the information that's relevant for you.

Let us visit TutorialsPoint's Python Tutorial home page. Open https://www.tutorialspoint.com/python3/index.htm in your browser.

Use Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed.

If using Chrome browser, open the Developer Tools from the top-right menu button (⋮) and selecting More Tools → Developer Tools.

With Developer tools, you can explore the site's document object model (DOM) to better understand your source. Select the Elements tab in developer tools. You'll see a structure with clickable HTML elements.

The Tutorial page shows the table of contents in the left sidebar. Right click on any chapter and choose Inspect option.

For the Elements tab, locate the tag that corresponds to the TOC list, as shown in the figure below −

Right click on the HTML element, copy the HTML element, and paste it in any editor.

The HTML script of the <ul>..</ul> element is now obtained.

<ul class="toc chapters">
   <li class="heading">Python 3 Basic Tutorial</li>
   <li class="current-chapter"><a href="/python3/index.htm">Python 3 - Home</a></li>
   <li><a href="/python3/python3_whatisnew.htm">What is New in Python 3</a></li>
   <li><a href="/python3/python_overview.htm">Python 3 - Overview</a></li>
   <li><a href="/python3/python_environment.htm">Python 3 - Environment Setup</a></li>
   <li><a href="/python3/python_basic_syntax.htm">Python 3 - Basic Syntax</a></li>
   <li><a href="/python3/python_variable_types.htm">Python 3 - Variable Types</a></li>
   <li><a href="/python3/python_basic_operators.htm">Python 3 - Basic Operators</a></li>
   <li><a href="/python3/python_decision_making.htm">Python 3 - Decision Making</a></li>
   <li><a href="/python3/python_loops.htm">Python 3 - Loops</a></li>
   <li><a href="/python3/python_numbers.htm">Python 3 - Numbers</a></li>
   <li><a href="/python3/python_strings.htm">Python 3 - Strings</a></li>
   <li><a href="/python3/python_lists.htm">Python 3 - Lists</a></li>
   <li><a href="/python3/python_tuples.htm">Python 3 - Tuples</a></li>
   <li><a href="/python3/python_dictionary.htm">Python 3 - Dictionary</a></li>
   <li><a href="/python3/python_date_time.htm">Python 3 - Date & Time</a></li>
   <li><a href="/python3/python_functions.htm">Python 3 - Functions</a></li>
   <li><a href="/python3/python_modules.htm">Python 3 - Modules</a></li>
   <li><a href="/python3/python_files_io.htm">Python 3 - Files I/O</a></li>
   <li><a href="/python3/python_exceptions.htm">Python 3 - Exceptions</a></li>
</ul>

We can now load this script in a BeautifulSoup object to parse the document tree.

Beautiful Soup - Scrape HTML Content

The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process.

Web scraping is is sometimes prohibited by the use of 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked.

Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed.

The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are −

urllib.error defines the exceptions and errors raised by the urllib.request command.
urllib.parse is used for parsing URLs.
urllib.robotparser is used for parsing robots.txt files.

Use the urlopen() function in urllib module to read the content of a web page from a website.

import urllib.request
response =  urllib.request.urlopen('http://python.org/') 
html = response.read()

You can also use the requests library for this purpose. You need to install it before using.

pip3 install requests

In the below code, the homepage of http://www.tutorialspoint.com is scraped −

from bs4 import BeautifulSoup
import requests


url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

The content obtained by either of the above two methods are then parsed with Beautiful Soup.

Beautiful Soup - Navigating by Tags

One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.

Easiest way to search a parse tree is to search the tag by its name.

soup.head

The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page.

Consider the following HTML page to be scraped:
<html>
   <head>
      <title>TutorialsPoint</title>
      <script>
         document.write("Welcome to TutorialsPoint");
      </script>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>

Following code extracts the contents of <head> element

Example

from bs4 import BeautifulSoup
with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print(soup.head)

Output

<head>
<title>TutorialsPoint</title>
<script>
document.write("Welcome to TutorialsPoint");
</script>
</head>

soup.body

Similarly, to return the contents of body part of HTML page, use soup.body

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print (soup.body)

Output

<body>
<h1>Tutorialspoint Online Library</h1>
<p><b>It's all Free</b></p>
</body>

You can also extract specific tag (like first <h1> tag) in the <body> tag.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.body.h1)

Output

<h1>Tutorialspoint Online Library</h1>

soup.p

Our HTML file contains a <p> tag. We can extract the contents of this tag

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.p)

Output

<p><b>It's all Free</b></p>

Tag.contents

A Tag object may have one or more PageElements. The Tag object's contents property returns a list of all elements included in it.

Let us find the elements in <head> tag of our index.html file.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.head
print (tag.contents)

Output

['\n',
<title>TutorialsPoint</title>,
'\n',
<script>
document.write("Welcome to TutorialsPoint");
</script>,
'\n']

Tag.children

The Tag object has a children property that returns a list iterator object containing the enclosed PageElements.

To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it.

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

The following Python code gives a list of all the children elements of top level <ul> tag.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.ul
print (list(tag.children))

Output

['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']

Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.

Example

for child in tag.children:
   print (child)

Output

<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

Tag.find_all()

This method returns a result set of contents of all the tags matching with the argument tag provided.

Let us consider the following HTML page(index.html) for this −

<html>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p><b>It's all Free</b></p>
      <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
      <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
      <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
      <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
      <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
   </body>
</html>

The following code lists all the elements with <a> tag

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

result = soup.find_all("a")
print (result)

Output

[
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
]

Beautiful Soup - Find Elements by ID

In an HTML document, usually each element is assigned a unique ID. This enables the value of an element to be extracted by a front-end code such as JavaScript function.

With BeautifulSoup, you can find the contents of a given element by its ID. There are two methods by which this can be achieved - find() as well as find_all(), and select()

Using find() method

The find() method of BeautifulSoup object searches for first element that satisfies the given criteria as an argument.

Let us use the following HTML script (as index.html) for the purpose

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <form>
         <input type = 'text' id = 'nm' name = 'name'>
         <input type = 'text' id = 'age' name = 'age'>
         <input type = 'text' id = 'marks' name = 'marks'>
      </form>
   </body>
</html>

The following Python code finds the element with its id as nm

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.find(id = 'nm')
print (obj)

Output

<input id="nm" name="name" type="text"/>

Using find_all()

The find_all() method also accepts a filter argument. It returns a list of all the elements with the given id. In a certain HTML document, usually a single element with a particular id. Hence, using find() instead of find_all() is preferrable to search for a given id.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.find_all(id = 'nm')
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Note that the find_all() method returns a list. The find_all() method also has a limit parameter. Setting limit=1 to find_all() is equivalent to find()

obj = soup.find_all(id = 'nm', limit=1)

Using select() method

The select() method in BeautifulSoup class accepts CSS selector as an argument. The # symbol is the CSS selector for id. It followed by the value of required id is passed to select() method. It works as the find_all() method.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.select("#nm")
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Using select_one()

Like the find_all() method, the select() method also returns a list. There is also a select_one() method to return the first tag of the given argument.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.select_one("#nm")
print (obj)

Output

<input id="nm" name="name" type="text"/>

Beautiful Soup - Find Elements by Class

CSS (cascaded Style sheets) is a tool for designing the appearance of HTML elements. CSS rules control the different aspects of HTML element such as size, color, alignment etc.. Applying styles is more effective than defining HTML element attributes. You can apply styling rules to each HTML element. Instead of applying style to each element individually, CSS classes are used to apply similar styling to groups of HTML elements to achieve uniform web page appearance. In BeautifulSoup, it is possible to find tags styled with CSS class. In this chapter, we shall use the following methods to search for elements for a specified CSS class −

find_all() and find() methods
select() and select_one() methods

Class in CSS

A class in CSS is a collection of attributes specifying the different features related to appearance, such as font type, size and color, background color, alignment etc. Name of the class is prefixed with a dot (.) while declaring it.

.class {  
   css declarations;  
}

A CSS class may be defined inline, or in a separate css file which needs to be included in the HTML script. A typical example of a CSS class could be as follows −

.blue-text {
   color: blue;
   font-weight: bold;
}

You can search for HTML elements defined with a certain class style with the help of following BeautifulSoup methods.

For the purpose of this chapter, we shall use the following HTML page −

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h2 class="heading">Departmentwise Employees</h2>
      <ul>
         <li class="mainmenu">Accounts</li>
         <ul>
            <li class="submenu">Anand</li>
            <li class="submenu">Mahesh</li>
         </ul>
         <li class="mainmenu">HR</li>
         <ul>
            <li class="submenu">Rani</li>
            <li class="submenu">Ankita</li>
         </ul>
      </ul>
   </body>
</html>

Using find() and find_all()

To search for elements with a certain CSS class used in a tag, use attrs property of Tag object as follows −

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.find_all(attrs={"class": "mainmenu"})
print (obj)

Output

[<li class="mainmenu">Accounts</li>, <li class="mainmenu">HR</li>]

The result is a list of all the elements with mainmenu class

To fetch the list of elements with any of the CSS classes mentioned in in attrs property, change the find_all() statement to −

obj = soup.find_all(attrs={"class": ["mainmenu", "submenu"]})

This results into a list of all the elements with any of CSS classes used above.

[
   <li class="mainmenu">Accounts</li>, 
   <li class="submenu">Anand</li>, 
   <li class="submenu">Mahesh</li>, 
   <li class="mainmenu">HR</li>, 
   <li class="submenu">Rani</li>, 
   <li class="submenu">Ankita</li>
]

Using select() and select_one()

You can also use select() method with the CSS selector as the argument. The (.) symbol followed by the name of the class is used as the CSS selector.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.select(".heading")
print (obj)

Output

[<h2 class="heading">Departmentwise Employees</h2>]

The select_one() method returns the first element found with the given class.

obj = soup.select_one(".submenu")

Beautiful Soup - Find Elements by Attribute

Both find() and find_all() methods are meant to find one or all the tags in the document as per the arguments passed to these methods. You can pass attrs parameter to these functions. The value of attrs must be a dictionary with one or more tag attributes and their values.

For the purpose of checking the behaviour of these methods, we shall use the following HTML document (index.html)

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <form>
         <input type = 'text' id = 'nm' name = 'name'>
         <input type = 'text' id = 'age' name = 'age'>
         <input type = 'text' id = 'marks' name = 'marks'>
      </form>
   </body>
</html>

Using find_all()

The following program returns a list of all the tags having input type="text" attribute.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.find_all(attrs={"type":'text'})
print (obj)

Output

[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]

Using find()

The find() method returns the first tag in the parsed document that has the given attributes.

obj = soup.find(attrs={"name":'marks'})

Using select()

The select() method can be called by passing the attributes to be compared against. The attributes must be put in a list object. It returns a list of all tags that have the given attribute.

In the following code, the select() method returns all the tags with type attribute.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.select("[type]")
print (obj)

Output

[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]

Using select_one()

The select_one() is method is similar, except that it returns the first tag satisfying the given filter.

obj = soup.select_one("[name='marks']")

Output

<input id="marks" name="marks" type="text"/>

Beautiful Soup - Searching the Tree

In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions - going up and down, sideways, and back and forth.

We shall use the following HTML string in all the examples in this chapter −

html = """
<html><head><title>TutorialsPoint</title></head>
   <body>
      <p class="title"><b>Online Tutorials Library</b></p>

      <p class="story">TutorialsPoint has an excellent collection of tutorials on:
      <a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
      <a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
      <a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
      Enhance your Programming skills.</p>

      <p class="tutorial">...</p>
"""

The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

print (soup.head.prettify())

Output

<head>
   <title>
      TutorialsPoint
   </title>
</head>

Going down

A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.head 
print (list(tag.children))

Output

[<title>TutorialsPoint</title>]

The returned object is a list, although in this case, there is only a single child tag enclosed in head element.

.children

The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.body 
print (list(tag.children))

Output

['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n', 
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']

Instead of getting them as a list, you can iterate over a tag's children using the .children generator −

Example

tag = soup.body 
for child in tag.children:
   print (child)

Output

<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>


<p class="tutorial">...</p>

.descendents

The .contents and .children attributes only consider a tag's direct children. The .descendants attribute lets you iterate over all of a tag's children, recursively: its direct children, the children of its direct children, and so on.

The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

print (soup.descendants)

The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
   print (element)

Output

<title>TutorialsPoint</title>
TutorialsPoint

The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag's child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = list(soup.descendants)
print (len(tags))

Output

Going Up

Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag

.parent

every tag and every string has a parent tag that contains it. You can access an element's parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
print (tag.parent)

Output

<head><title>TutorialsPoint</title></head>

Since the title tag contains a string (NavigableString), the parent for the string is title tag itself.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.title
string = tag.string
print (string.parent)

Output

<title>TutorialsPoint</title>

.parents

You can iterate over all of an element's parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.a 
print (tag.string)

for parent in tag.parents:
   print (parent.name)

Output

Python
p
body
html
[document]

Sideways

The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet

<p>
   <b>
      Hello
   </b>
   <i>
      Python
   </i>
</p>

In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level.

.next_sibling and .previous_sibling

These attributes respectively return the next tag at the same level, and the previous tag at same level.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.next_sibling)

tag2 = soup.i 
print ("previous:",tag2.previous_sibling)

Output

next: <i>Python</i>
previous: <b>Hello</b>

Since the <b> tag doesn't have a sibling to its left, and <i> tag doesn't have a sibling to its right, it returns Nobe in both cases.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.previous_sibling)

tag2 = soup.i 
print ("previous:",tag2.next_sibling)

Output

next: None
previous: None

.next_siblings and .previous_siblings

If there are two or more siblings to the right or left of a tag, they can be navigated with the help of the .next_siblings and .previous_siblings attributes respectively. Both of them return generator object so that a for loop can be used to iterate.

Let us use the following HTML snippet for this purpose −

<p>
   <b>
      Excellent
   </b>
   <i>
      Python
   </i>
   <u>
      Tutorial
   </u>
</p>

Use the following code to traverse next and previous sibling tags.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.b 
print ("next siblings:")
for tag in tag1.next_siblings:
   print (tag)
print ("previous siblings:")
tag2 = soup.u 
for tag in tag2.previous_siblings:
   print (tag)

Output

next siblings:
<i>Python</i>
<u>Tutorial</u>
previous siblings:
<i>Python</i>
<b>Excellent</b>

Back and forth

In Beautiful Soup, the next_element property returns the next string or tag in the parse tree. On the other hand, the previous_element property returns the previous string or tag in the parse tree. Sometimes, the return value of next_element and previous_element attributes is similar to next_sibling and previous_sibling properties.

.next_element and .previous_element

Example

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>

<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>

<p class="tutorial">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find("a", id="link3")
print (tag.next_element)

tag = soup.find("a", id="link1")
print (tag.previous_element)

Output

PHP
TutorialsPoint has an excellent collection of tutorials on:

The next_element after <a> tag with id = "link3" is the string PHP. Similarly, the previous_element returns the string before <a> tag with id = "link1".

.next_elements and .previous_elements

These attributes of the Tag object return generator respectively of all tags and strings after and before it.

Next elements example

tag = soup.find("a", id="link1")
for element in tag.next_elements:
   print (element)

Output

Python
,

<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
 and

<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.


<p class="tutorial">...</p>
...

Previous elements example

tag = soup.find("body")
for element in tag.previous_elements:
   print (element)

Output

<html><head><title>TutorialsPoint</title></head>

Beautiful Soup - Modifying the Tree

One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents.

Beautiful Soup library has different functions to perform the following operations −

Add contents or a new tag to an existing tag of the document
Insert contents before or after an existing tag or string
Clear the contents of an already existing tag
Modify the contents of a tag element

Add content

You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python's list object.

In the following example, the HTML script has a <p> tag. With append(), additional text is appended.

Example

from bs4 import BeautifulSoup

markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p

tag.append(" World")
print (soup)

Output

<p>Hello</p>
<p>Hello World</p>

With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.

Example

from bs4 import BeautifulSoup, Tag

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify())

Output

<b>
   Hello
   <i>
      World
   </i>
</b>

If you have to add a string to the document, you can append a NavigableString object.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())

Output

<b>
   Hello
   World
</b>

From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag.

Example

from bs4 import BeautifulSoup

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
vals = ['World.', 'Welcome to ', 'TutorialsPoint']
tag.extend(vals)
print (soup.prettify())

Output

<b>
   Hello
   World.
   Welcome to
   TutorialsPoint
</b>

Insert Contents

Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.

In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert(1, "Tutorial ")
print (soup.prettify())

Output

<b>
   Excellent
   Tutorial
</b>
<u>
   from TutorialsPoint
</u>

Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string "Python Tutorial" is added after the <b> tag.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert_after("Python Tutorial")
print (soup.prettify())

Output

<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

On the other hand, insert_before() method is used below, to add "Here is an " text before the <b> tag.

tag.insert_before("Here is an ")
print (soup.prettify())

Output

Here is an
<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

Clear the Contents

Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features.

The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.find('u')

tag.clear()
print (soup.prettify())

Output

<b>
   Excellent
</b>
<u>
</u>

It can be seen that the clear() method removes the contents, keeping the tag intact.

For the following example, we parse the following HTML document and call clear() metho on all tags.

<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs./p>
   </body>
</html>

Here is the Python code using clear() method

Example

from bs4 import BeautifulSoup

fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
   tag.clear()
print (soup.prettify())

Output

<html>
</html>

The extract() method removes either a tag or a string from the document tree, and returns the object that was removed.

Example

from bs4 import BeautifulSoup

fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
   obj = tag.extract()
   print ("Extracted:",obj)

print (soup)

Output

Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>

You can extract either a tag or a string. The following example shows antag being extracted.

Example

html = '''
   <ol id="HR">
   <li>Rani</li>
   <li>Ankita</li>
   </ol>
'''
from bs4 import BeautifulSoup


soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('ol')
obj.find_next().extract()
print (soup)

Output

<ol id="HR">
   <li>Ankita</li>
</ol>

Change the extract() statement to remove inner text of first <li> element.

Example

obj.find_next().string.extract()

Output

<ol id="HR">
   <li>Ankita</li>
</ol>

There is another method decompose() that removes a tag from the tree, then completely destroys it and its contents −

Example

html = '''
   <ol id="HR">
      <li>Rani</li>
      <li>Ankita</li>
   </ol>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag1=soup.find('ol')
tag2 = soup.find('li')
tag2.decompose()
print (soup)
print (tag2.decomposed)

Output

<ol id="HR">

<li>Ankita</li>
</ol>

The decomposed property returns True or False - whether an element has been decomposed or not.

Modify the Contents

We shall look at the replace_with() method that allows contents of a tag to be replaced.

Just as a Python string, which is immutable, the NavigableString also can't be modified in place. However, use replace_with() to replace the inner string of a tag with another.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

Output

OnLine Tutorials Library

Here is another example to show the use of replace_with(). Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().2524

Example

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

Output

<html><body><book><title>Python</title></book></body></html>

The wrap() method wraps an element in the tag you specify. It returns the new wrapper.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello Python</p>", 'html.parser')
tag = soup.p
newtag = soup.new_tag('b')
tag.string.wrap(newtag)

print (soup)

Output

<p><b>Hello Python</b></p>

On the other hand, the unwrap() method replaces a tag with whatever's inside that tag. It's good for stripping out markup.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello <b>Python</b></p>", 'html.parser')
tag = soup.p
tag.b.unwrap()

print (soup)

Output

<p>Hello Python</p>

Beautiful Soup - Parsing a Section of a Document

Let's say you want to use Beautiful Soup look at a document's <a> tags only. Normally you would parse the tree and use find_all() method with the required tag as the argument.

soup = BeautifulSoup(fp, "html.parser")

tags = soup.find_all('a')

But that would be time consuming as well as it will take up more memory unnecessarily. Instead, you can create an object of SoupStrainer class and use it as value of parse_only argument to BeautifulSoup constructor.

A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.

product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)

Above lines of code will parse only the titles from a product site, which might be inside a tag field.

Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −

Example

from bs4 import BeautifulSoup, SoupStrainer

#Only "a" tags
only_a_tags = SoupStrainer("a")

#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

#parse only where string length is less than 10
def is_short_string(string):
   return len(string) < 10

only_short_strings = SoupStrainer(string=is_short_string)

The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, text, and **kwargs.

Note that this feature won't work if you're using the html5lib parser, because the whole document will be parsed in that case, no matter what. Hence, you should use either the inbuilt html.parser or lxml parser.

You can also pass a SoupStrainer into any of the methods covered in Searching the tree.

from bs4 import SoupStrainer

a_tags = SoupStrainer("a")
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(a_tags)

Beautiful Soup - Find all Children of an Element

The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The top level element is called as parent. The elements nested inside the parent are its children. With the help of Beautiful Soup, we can find all the children elements of a parent element. In this chapter, we shall find out how to obtain the children of a HTML element.

There are two provisions in BeautifulSoup class to fetch the children elements.

The .children property
The findChildren() method

Examples in this chapter use the following HTML script (index.html)

<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul id="dept">
<li>Accounts</li>
   <ul id='acc'>
   <li>Anand</li>
   <li>Mahesh</li>
   </ul>
<li>HR</li>
   <ul id="HR">
   <li>Rani</li>
   <li>Ankita</li>
   </ul>
</ul>
</body>
</html>

Using .children property

The .children property of a Tag object returns a generator of all the child elements in a recursive manner.

The following Python code gives a list of all the children elements of top level <ul> tag. We first obtain the Tag element corresponding to the <ul> tag, and then read its .children property

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.ul
print (list(tag.children))

Output

['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']

Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy.

for child in tag.children:
   print (child)

Output

<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

Using findChildren() method

The findChildren() method offers a more comprehensive alternative. It returns all the child elements under any top level tag.

In the index.html document, we have two nested unordered lists. The top level <ul> element has id = "dept" and the two enclosed lists are having id = "acc' and "HR' respectively.

In the following example, we first instantiate a Tag object pointing to top level <ul> element and extract the list of children under it.

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find("ul", {"id": "dept"})

children = tag.findChildren()
 
for child in children:
   print(child)

Note that the resultset includes the children under an element in a recursive fashion. Hence, in the following output, you'll find the entire inner list, followed by individual elements in it.

<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>Anand</li>
<li>Mahesh</li>
<li>HR</li>
<ul id="HR">
<li>Rani</li>
<li>Ankita</li>
</ul>
<li>Rani</li>
<li>Ankita</li>

Let us extract the children under an inner <ul> element with id='acc'. Here is the code −

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find("ul", {"id": "acc"})

children = tag.findChildren()
 
for child in children:
	print(child)

When the above program is run, you'll obtain the <li>elements under the <ul> with id as acc.

Output

<li>Anand</li>
<li>Mahesh</li>

Thus, BeautifulSoup makes it very easy to parse the children elements under any top level HTML element.

Beautiful Soup - Find Element using CSS Selectors

In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and the other find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. However, the find*() methods search for the PageElements according to the Tag name and its attributes, the select() method searches the document tree for the given CSS selector.

Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria.

Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().The parameters for select() method are as follows −

select(selector, limit, **kwargs)

selector − A string containing a CSS selector.

limit − After finding this number of results, stop looking.

kwargs − Keyword arguments to be passed.

If the limit parameter is set to 1, it becomes equivalent to select_one() method. While the select() method returns a ResultSet of Tag objects, the select_one() method returns a single Tag object.

Soup Sieve Library

Soup Sieve is a CSS selector library. It has been integrated with Beautiful Soup 4, so it is installed along with Beautiful Soup package. It provides ability to select, match, and filter he document tree tags using modern CSS selectors. Soup Sieve currently implements most of the CSS selectors from the CSS level 1 specifications up to CSS level 4, except for some that are not yet implemented.

The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −

Type selector

Matching elements is done by node name. For example −

tags = soup.select('div')

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Universal selector (*)

It matches elements of any type. Example −

tags = soup.select('*')

ID selector

It matches an element based on its id attribute. The symbol # denotes the ID selector. Example −

tags = soup.select("#nm")

Example

from bs4 import BeautifulSoup

html = '''
   <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
      <input type = 'text' id = 'marks' name = 'marks'>
   </form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Class selector

It matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −

tags = soup.select(".submenu")

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Attribute Selectors

The attribute selector matches an element based on its attributes.

soup.select('[attr]')

Example

from bs4 import BeautifulSoup

html = '''
   <h1>Tutorialspoint Online Library</h1>
   <p><b>It's all Free</b></p>
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a> 
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('[href]'))

Output

[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>]

Pseudo Classes

CSS specification defines a number of pseudo CSS classes. A pseudo-class is a keyword added to a selector so as to define a special state of the selected elements. It adds an effect to the existing elements. For example, :link selects a link (every <a> and <area> element with an href attribute) that has not yet been visited.

The pseudo-class selectors nth-of-type and nth-child are very widely used.

:nth-of-type()

The selector :nth-of-type() matches elements of a given type, based on their position among a group of siblings. The keywords even and odd, and will respectively select elements, from a sub-group of sibling elements.

In the following example, second element of <p> type is selected.

Example

from bs4 import BeautifulSoup

html = '''
<p id="0"></p>
<p id="1"></p>
<span id="2"></span>
<span id="3"></span>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('p:nth-of-type(2)'))

Output

[<p id="1"></p>]

:nth-child()

This selector matches elements based on their position in a group of siblings. The keywords even and odd will respectively select elements whose position is either even or odd amongst a group of siblings.

Usage

:nth-child(even)
:nth-child(odd)
:nth-child(2)

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div

child = tag.select_one(':nth-child(2)')
print (child)

Output

<p>Python</p>

Beautiful Soup - Find all Comments

Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document.

In HTML and XML, the comment text is written between  tags.

<!-- Comment Text -->

The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between  is recognized as a Comment.

Example

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))

Output

This is a comment text in HTML <class 'bs4.element.Comment'>

To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument 'string' to find_all() method. We shall assign the return value of a function iscomment() to it.

comments = soup.find_all(string=iscomment)

The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function.

def iscomment(elem):
   return isinstance(elem, Comment)

The comments variable shall store all the comment text occurrences in the given HTML document. We shall use the following index.html file in the example code −

<html>
   <head>
      <!-- Title of document -->
      <title>TutorialsPoint</title>
   </head>
   <body>
      <!-- Page heading -->
      <h2>Departmentwise Employees</h2>
      <!-- top level list-->
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <!-- first inner list -->
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul id="HR">
         <!-- second inner list -->
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

The following Python program scrapes the above HTML document, and finds all the comments in it.

Example

from bs4 import BeautifulSoup, Comment

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

def iscomment(elem):
    return isinstance(elem, Comment)

comments = soup.find_all(string=iscomment)
print (comments)

Output

[' Title of document ', ' Page heading ', ' top level list', ' first inner list ', ' second inner list ']

The above output shows a list of all comments. We can also use a for loop over the collection of comments.

Example

i=0
for comment in comments:
   i+=1
   print (i,".",comment)

Output

1 .  Title of document 
2 .  Page heading
3 .  top level list
4 .  first inner list
5 .  second inner list

In this chapter, we learned how to extract all the comment strings in a HTML document.

Beautiful Soup - Scraping List from HTML

Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document.

Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag.

HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags.

We shall use the following HTML document −

<html>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ol id="HR">
         <li>Rani</li>
         <li>Ankita</li>
         </ol>
      </ul>
   </body>
</html>

Scraping lists by Tag

In the above HTML document, we have a top-level <ul> list, inside which there's another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.ul

print (lst)

Output

<ul id="dept">
<li>Accounts</li>
<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
</ul>

Change value of lst to point to <ol> element to get the inner list.

lst=soup.ol

Output

<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>

Using select() method

The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.select("ol")

print (lst)

Output

[<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>]

Using find_all() method

The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag.

In the following code, find_all() method returns a list of all elements in the <ul> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul")

print (lst)

We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id="acc".

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all("ul", {"id":"acc"})

print (lst)

Output

[<ul id="acc">
<li>Anand</li>
<li>Mahesh</li>
</ul>]

Here's another example. We collect all elements with <li> tag with the inner text starting with 'A'. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True.

Example

from bs4 import BeautifulSoup

def startingwith(ch):
   return ch.startswith('A')

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

lst=soup.find_all('li',string=startingwith)

print (lst)

Output

[<li>Accounts</li>, <li>Anand</li>, <li>Ankita</li>]

Beautiful Soup - Scraping Paragraphs from HTML

One of the frequently appearing tags in a HTML document is the <p> tag that marks a paragraph text. With Beautiful Soup, you can easily extract paragraph from the parsed document tree. In this chapter, we shall discuss the following ways of scraping paragraphs with the help of BeautifulSoup library.

Scraping HTML paragraph with <p> tag
Scraping HTML paragraph with find_all() method
Scraping HTML paragraph with select() method

We shall use the following HTML document for these exercises −

<html>
   <head>
      <title>BeautifulSoup - Scraping Paragraph</title>
   </head>
   <body>
      <p id='para1'>The quick, brown fox jumps over a lazy dog.</p>
      <h2>Hello</h2>
      <p>DJs flock by when MTV ax quiz prog.</p>
      
      <p>Junk MTV quiz graced by fox whelps.</p>
      
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>

Scraping by <p> tag

Easiest way to search a parse tree is to search the tag by its name. Hence, the expression soup.p points towards the first <p> tag in the scouped document.

para = soup.p

To fetch all the subsequent <p> tags, you can run a loop till the soup object is exhausted of all the <p> tags. The following program displays the prettified output of all the paragraph tags.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

para = soup.p 
print (para.prettify())
while True:
   p = para.find_next('p')
   if p is None:
      break
   print (p.prettify())
   para=p

Output

<p>
 The quick, brown fox jumps over a lazy dog.
</p>

<p>
 DJs flock by when MTV ax quiz prog.
</p>

<p>
 Junk MTV quiz graced by fox whelps.
</p>

<p>
 Bawds jog, flick quartz, vex nymphs.
</p>

Using find_all() method

The find_all() methods is more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to this method. In this case, we want to fetch the contents of a <p> tag.

In the following code, find_all() method returns a list of all elements in the <p> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

paras = soup.find_all('p') 
for para in paras:
   print (para.prettify())

Output

<p>
 The quick, brown fox jumps over a lazy dog.
</p>

<p>
 DJs flock by when MTV ax quiz prog.
</p>

<p>
 Junk MTV quiz graced by fox whelps.
</p>

<p>
 Bawds jog, flick quartz, vex nymphs.
</p>

We can use another approach to find all <p> tags. To begin with, obtain list of all tags using find_all() and check Tag.name of each equals ='p'.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
paras = [tag.contents for tag in tags if tag.name=='p']
print (paras)

The find_all() method also has attrs parameter. It is useful when you want to extract the <p> tag with specific attributes. For example, in the given document, the first <p> element has id='para1'. To fetch it, we need to modify the tag object as −

paras = soup.find_all('p', attrs={'id':'para1'})

Using select() method

The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the <p> tag to select() method. The select_one() method is also available. It fetches the first occurrence of the <p> tag.

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

paras = soup.select('p')
print (paras)

Output

[
<p>The quick, brown fox jumps over a lazy dog.</p>, 
<p>DJs flock by when MTV ax quiz prog.</p>, 
<p>Junk MTV quiz graced by fox whelps.</p>, 
<p>Bawds jog, flick quartz, vex nymphs.</p>
]

To filter out <p> tags with a certain id, use a for loop as follows −

Example

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')
tags = soup.select('p')
for tag in tags:
   if tag.has_attr('id') and tag['id']=='para1':
      print (tag.contents)

Output

['The quick, brown fox jumps over a lazy dog.']

BeautifulSoup - Scraping Link from HTML

While scraping and analysing the content from resources with a website, you are often required to extract all the links that a certain page contains. In this chapter, we shall find out how we can extract links from a HTML document.

HTML has the anchor tag <a> to insert a hyperlink. The href attribute of anchor tag lets you to establish the link. It uses the following syntax −

<a href=="web page URL">hypertext</a>

With the find_all() method we can collect all the anchor tags in a document and then print the value of href attribute of each of them.

In the example below, we extract all the links found on Google's home page. We use requests library to collect the HTML contents of https://google.com, parse it in a soup object, and then collect all <a> tags. Finally, we print href attributes.

Example

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
   print (link)

Here's the partial output when the above program is run −

Output

https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en-IN&authuser=0
https://www.google.com/url?q=https://io.google/2023/%3Futm_source%3Dgoogle-hpp%26utm_medium%3Dembedded_marketing%26utm_campaign%3Dhpp_watch_live%26utm_content%3D&source=hpp&id=19035434&ct=3&usg=AOvVaw0qzqTkP5AEv87NM-MUDd_u&sa=X&ved=0ahUKEwiPzpjku-z-AhU1qJUCHVmqDJoQ8IcBCAU

However, a HTML document may have hyperlinks of different protocol schemes, such as mailto: protocol for link to an email ID, tel: scheme for link to a telephone number, or a link to a local file with file:// URL scheme. In such a case, if we are interested in extracting links with https:// scheme, we can do so by the following example. We have a HTML document that consists of hyperlinks of different types, out of which only ones with https:// prefix are being extracted.

html = '''
<p><a href="https://www.tutorialspoint.com">Web page link </a></p>
<p><a href="https://www.example.com">Web page link </a></p>
<p><a href="mailto:nowhere@mozilla.org">Email link</a></p>
<p><a href="tel:+4733378901">Telephone link</a></p>
'''
from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
   if link.startswith("https"):
      print (link)

Output

https://www.tutorialspoint.com
https://www.example.com

Beautiful Soup - Get all HTML Tags

Tags in HTML are like keywords in a traditional programming language like Python or Java. Tags have a predefined behaviour according to which the its content is rendered by the browser. With Beautiful Soup, it is possible to collect all the tags in a given HTML document.

The simplest way to obtain a list of tags is to parse the web page into a soup object, and call find_all() methods without any argument. It returns a list generator, giving us a list of all the tags.

Let us extract the list of all tags in Google's homepage.

Example

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

tags = soup.find_all()
print ([tag.name for tag in tags])

Output

['html', 'head', 'meta', 'meta', 'title', 'script', 'style', 'style', 'script', 'body', 'script', 'div', 'div', 'nobr', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'u', 'div', 'nobr', 'span', 'span', 'span', 'a', 'a', 'a', 'div', 'div', 'center', 'br', 'div', 'img', 'br', 'br', 'form', 'table', 'tr', 'td', 'td', 'input', 'input', 'input', 'input', 'input', 'div', 'input', 'br', 'span', 'span', 'input', 'span', 'span', 'input', 'script', 'input', 'td', 'a', 'input', 'script', 'div', 'div', 'br', 'div', 'style', 'div', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'span', 'div', 'div', 'a', 'a', 'a', 'a', 'p', 'a', 'a', 'script', 'script', 'script']

Naturally, you may get such a list where one certain tag may appear more than once. To obtain a list of unique tags (avoiding the duplication), construct a set from the list of tag objects.

Change the print statement in above code to

Example

print ({tag.name for tag in tags})

Output

{'body', 'head', 'p', 'a', 'meta', 'tr', 'nobr', 'script', 'br', 'img', 'b', 'form', 'center', 'span', 'div', 'input', 'u', 'title', 'style', 'td', 'table', 'html'}

To obtain tags with some text associated with them, check the string property and print if it is not None

tags = soup.find_all()
for tag in tags:
   if tag.string is not None:
      print (tag.name, tag.string)

There may be some singleton tags without text but with one or more attributes as in the <img> tag. Following loop constructs lists out such tags.

In the following code, the HTML string is not a complete HTML document in the sense that thr <html> and <body> tags are not given. But the html5lib and lxml parsers add these tags on their own while parsing the document tree. Hence, when we extract the tag list, the additional tags will also be seen.

Example

html = '''
<h1 style="color:blue;text-align:center;">This is a heading</h1>
<p style="color:red;">This is a paragraph.</p>
<p>This is another paragraph</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")

tags = soup.find_all()
print ({tag.name for tag in tags} )

Output

{'head', 'html', 'p', 'h1', 'body'}

Beautiful Soup - Get Text Inside Tag

There are two types of tags in HTML. Many of the tags are in pairs of opening and closing counterparts. The top level <html> tag having a corresponding closing </html> tag is the main example. Others are <body> and </body>, <p> and </p>, <h1> and </h1> and many more. Other tags are self-closing tags - such as <img> and<a>. The self-closing tags don't have a text as most of the tags with opening and closing symbols (such as <b>Hello</b>). In this chapter, we shall have a look at how can we get the text part inside such tags, with the help of Beautiful Soup library.

There are more than one methods/properties available in Beautiful Soup, with which we can fetch the text associated with a tag object.

Sr.No	Methods & Description
1	text property Get all child strings of a PageElement, concatenated using a separator if specified.
2	string property Convenience property to string from a child element.
3	strings property yields string parts from all the child objects under the current PageElement.
4	stripped_strings property Same as strings property, with the linebreaks and whitespaces removed.
5	get_text() method returns all child strings of this PageElement, concatenated using a separator if specified.

Consider the following HTML document −

<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>

If we retrieve the stripped_string property of each tag in the parsed document tree, we will find that the two div tags and the p tag have two NavigableString objects, Hello and World. The <b> tag embeds world string, while <img> doesn't have a text part.

The following example fetches the text from each of the tags in the given HTML document −

Example

html = """
<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
   print ("Tag: {} attributes: {} ".format(tag.name, tag.attrs))
   for txt in tag.stripped_strings:
      print (txt)
       
   print()

Output

Tag: div attributes: {'id': 'outer'} 
Hello
World

Tag: div attributes: {'id': 'inner'} 
Hello
World

Tag: p attributes: {} 
Hello
World

Tag: b attributes: {} 
World

Tag: img attributes: {'src': 'logo.jpg'}

Beautiful Soup - Find all Headings

In this chapter, we shall explore how to find all heading elements in a HTML document with BeautifulSoup. HTML defines six heading styles from H1 to H6, each with decreasing font size. Suitable tags are used for different page sections, such as main heading, heading for section, topic etc. Let us use the find_all() method in two different ways to extract all the heading elements in a HTML document.

We shall use the following HTML script (saved as index.html) in the code examples in this chapter −

<html>
   <head>
      <title>BeautifulSoup - Scraping Headings</title>
   </head>
   <body>
      <h2>Scraping Headings</h2>
      <b>The quick, brown fox jumps over a lazy dog.</b>
      <h3>Paragraph Heading</h3>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <h3>List heading</h3>
      <ul>
         <li>Junk MTV quiz graced by fox whelps.</li>
         <li>Bawds jog, flick quartz, vex nymphs.</li>
      </ul>
   </body>
</html>

Example 1

In this approach, we collect all the tags in the parsed tree, and check if the name of each tag is found in a list of all heading tags.

from bs4 import BeautifulSoup

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

headings = ['h1','h2','h3', 'h4', 'h5', 'h6']
tags = soup.find_all()
heads = [(tag.name, tag.contents[0]) for tag in tags if tag.name in headings]
print (heads)

Here, headings is a list of all heading styles h1 to h6. If the name of a tag is any of these, the tag and its contents are collected in a lists named heads.

Output

[('h2', 'Scraping Headings'), ('h3', 'Paragraph Heading'), ('h3', 'List heading')]

Example 2

You can pass a regex expression to the find_all() method. Take a look at the following regex.

re.compile('^h[1-6]$')

This regex finds all tags that start with h, have a digit after the h, and then end after the digit. Let use this as an argument to find_all() method in the code below −

from bs4 import BeautifulSoup
import re

fp = open('index.html')

soup = BeautifulSoup(fp, 'html.parser')

tags = soup.find_all(re.compile('^h[1-6]$'))
print (tags)

Output

[<h2>Scraping Headings</h2>, <h3>Paragraph Heading</h3>, <h3>List heading</h3>]

Beautiful Soup - Extract Title Tag

The <title> tag is used to provide a text caption to the page that appears in the browser's title bar. It is not a part of the main content of the web page. The title tag is always present inside the <head> tag.

We can extract the contents of title tag by Beautiful Soup. We parse the HTML tree and obtain the title tag object.

Example

html = '''
<html>
   <head>
      <Title>Python Libraries</title>
   </head>
   <body>
      <p Hello World</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")

title = soup.title
print (title)

Output

<title>Python Libraries</title>

In HTML, we can use title attribute with all tags. The title attribute gives additional information about an element. The information is works as a tooltip text when the mouse hovers over the element.

We can extract the text of title attribute of each tag with following code snippet −

Example

html = '''
<html>
   <body>
      <p title='parsing HTML and XML'>Beautiful Soup</p>
      <p title='HTTP library'>requests</p>
      <p title='URL handling'>urllib</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")
tags = soup.find_all()
for tag in tags:
   if tag.has_attr('title'):
      print (tag.attrs['title'])

Output

parsing HTML and XML
HTTP library
URL handling

Beautiful Soup - Extract Email IDs

To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques.

A typical usage of Email ID in href attribute is as below −

<a href = "mailto:xyz@abc.com">test link</a>

In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks −

<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li><a href = "mailto:sales@company.com">Sales Enquiries</a></li>
      <li><a href = "mailto:careers@company.com">Careers</a></li>
      <li><a href = "mailto:partner@company.com">Partner with us</a></li>
      </ul>
   </body>
</html>

Here's the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id.

from bs4 import BeautifulSoup
import re
fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all("a")
for tag in tags:
   if tag.has_attr("href") and tag['href'][:7]=='mailto:':
      print (tag['href'][7:])

For the given HTML document, the Email IDs will be extracted as follows −

sales@company.com
careers@company.com
partner@company.com

In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python's re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address −

pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'

For this exercise, we shall use the following HTML document, having Email IDs in <li>tags.

<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li>Sales Enquiries: sales@company.com</a></li>
      <li>Careers: careers@company.com</a></li>
      <li>Partner with us: partner@company.com</a></li>
      </ul>
   </body>
</html>

Using the email regex, we'll find the appearance of the pattern in each <li> tag string. Here is the Python code −

Example

from bs4 import BeautifulSoup
import re

def isemail(s):
   pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'
   grp=re.findall(pat,s)
   return (grp)

fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('li')

for tag in tags:
   emails = isemail(tag.string)
   if emails:
      print (emails)

Output

['sales@company.com']
['careers@company.com']
['partner@company.com']

Using the simple techniques described above, we can use BeautifulSoup to extract Email IDs from web pages.

Beautiful Soup - Scrape Nested Tags

The arrangement of tags or elements in a HTML document is hierarchical nature. The tags are nested upto multiple levels. For example, the <head> and <body> tags are nested inside <html> tag. Similarly, one or more <li> tags may be inside a <ul> tag. In this chapter, we shall find out how to scrape a tag that has one or more children tags nested in it.

Let us consider the following HTML document −

<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>

In this case, the two <div> tags and a <p> tag has one or more child elements nested inside. Whereas, the <img> and <b> tag donot have any children tags.

The findChildren() method returns a ResultSet of all the children under a tag. So, if a tag doesn't have any children, the ResultSet will be an empty list like [].

Taking this as a cue, the following code finds out the tags under each tag in the document tree and displays the list.

Example

html = """
   <div id="outer">
      <div id="inner">
         <p>Hello<b>World</b></p>
         <img src='logo.jpg'>
      </div>
   </div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
   print ("Tag: {} attributes: {}".format(tag.name, tag.attrs))
   print ("Child tags: ", tag.findChildren())
   print()

Output

Tag: div attributes: {'id': 'outer'}
Child tags:  [<div id="inner">
<p>Hello<b>World</b></p>
<img src="logo.jpg"/>
</div>, <p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: div attributes: {'id': 'inner'}
Child tags:  [<p>Hello<b>World</b></p>, <b>World</b>, <img src="logo.jpg"/>]

Tag: p attributes: {}
Child tags:  [<b>World</b>]

Tag: b attributes: {}
Child tags:  []

Tag: img attributes: {'src': 'logo.jpg'}
Child tags:  []

Beautiful Soup - Parsing Tables

In addition to a textual content, a HTML document may also have a structured data in the form of HTML tables. With Beautiful Soup, we can extract the tabular data in Python objects such as list or dictionary, if required store it in databases or spreadsheets, and perform processing. In this chapter, we shall parse HTML table using Beautiful Soup.

Although Beautiful Soup doesn't any special function or method for extracting table data, we can achieve it by simple scraping techniques. Just like any table, say in SQL or spreadsheet, HTML table consists of rows and columns.

HTML has <table> tag to build a tabular structure. There are one or more nested <tr> tags one each for a row. Each row consists of <td> tags to hold the data in each cell of the row. First row usually is used for column headings, and the headings are placed in <th> tag instead of <td>

Following HTML script renders a simple table on the browser window −

<html>
   <body>
   <h2>Beautiful Soup - Parse Table</h2>
      <table border="1">
         <tr>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr class='data'>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr class='data'>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

Note that, the appearance of data rows is customized with a CSS class data, in order to distinguish it from the header row.

We shall now see how to parse the table data. First, we obtain the document tree in the BeautifulSoup object. Then collect all the column headers in a list.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser")

tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)

The data row tags with class='data' attribute following the header row are then fetched. A dictionary object with column header as key and corresponding value in each cell is formed and appended to a list of dict objects.

rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
   row = {}
   
   data = i.find_all('td')
   n=0
   for j in data: 
      
      row[headers[n]] = j.string
      n+=1
   trows.append(row)

A list of dictionary objects is collected in trows. You can then use it for different purposes such as storing in a SQL table, saving as a JSON or pandas dataframe object.

The complete code is given below −

markup = """
<html>
	<body>
	   <p>Beautiful Soup - Parse Table</p>
		<table>
			<tr>
				<th>Name</th>
				<th>Age</th>
				<th>Marks</th>
			</tr>
			<tr class='data'>
				<td>Ravi</td>
				<td>23</td>
				<td>67</td>
			</tr>
			<tr class='data'>
				<td>Anil</td>
				<td>27</td>
				<td>84</td>
			</tr>
		</table>
	</body>
</html>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser")

tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
print (headers)

rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
   row = {}
   
   data = i.find_all('td')
   n=0
   for j in data: 
      
      row[headers[n]] = j.string
      n+=1
   trows.append(row)

print (trows)

Output

[{'Name': 'Ravi', 'Age': '23', 'Marks': '67'}, {'Name': 'Anil', 'Age': '27', 'Marks': '84'}]

Beautiful Soup - Selecting nth Child

HTML is characterized by the hierarchical order of tags. For example, the <html> tag encloses <body> tag, inside which there may be a <div> tag further may have <ul> and <li> elements nested respectively. The findChildren() method and .children property both return a ResultSet (list) of all the child tags directly under an element. By traversing the list, you can obtain the child located at a desired position, nth child.

The code below uses the children property of a <div> tag in the HTML document. Since the return type of children property is a list iterator, we shall retrieve a Python list from it. We also need to remove the whitespaces and line breaks from the iterator. Once done, we can fetch the desired child. Here the child element with index 1 of the <div> tag is displayed.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
children = tag.children
childlist = [child for child in children if child not in ['\n', ' ']]
print (childlist[1])

Output

<p>Python</p>

To use findChildren() method instead of children property, change the statement to

children = tag.findChildren()

There will be no change in the output.

A more efficient approach toward locating nth child is with the select() method. The select() method uses CSS selectors to obtain required PageElements from the current element.

The Soup and Tag objects support CSS selectors through their .css property, which is an interface to the CSS selector API. The selector implementation is handled by the Soup Sieve package, which gets installed along with bs4 package.

The Soup Sieve package defines different types of CSS selectors, namely simple, compound and complex CSS selectors that are made up of one or more type selectors, ID selectors, class selectors. These selectors are defined in CSS language.

There are pseudo class selectors as well in Soup Sieve. A CSS pseudo-class is a keyword added to a selector that specifies a special state of the selected element(s). We shall use :nth-child pseudo class selector in this example. Since we need to select a child from <div> tag at 2^nd position, we shall pass :nthchild(2) to the select_one() method.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div

child = tag.select_one(':nth-child(2)')
print (child)

Output

<p>Python</p>

We get the same result as with the findChildren() method. Note that the child numbering starts with 1 and not 0 as in case of a Python list.

Beautiful Soup - Search by text inside a Tag

Beautiful Soup provides different means to search for a certain text in the given HTML document. Here, we use the string argument of the find() method for the purpose.

In the following example, we use the find() method to search for the word 'by'

Example

html = '''
   <p> The quick, brown fox jumps over a lazy dog.</p>
   <p> DJs flock by when MTV ax quiz prog.</p>
   <p> Junk MTV quiz graced by fox whelps.</p>
   <p> Bawds jog, flick quartz, vex nymphs./p>
'''
from bs4 import BeautifulSoup, NavigableString

def search(tag):
   if 'by' in tag.text:
      return True

soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', string=search)
print (tag)

Output

<p> DJs flock by when MTV ax quiz prog.</p>
You can find all occurrences of the word with find_all() method
tag = soup.find_all('p', string=search)
print (tag)

Output

[<p> DJs flock by when MTV ax quiz prog.</p>, <p> Junk MTV quiz graced by fox whelps.</p>]

There may be a situation where the required text may be somewhere in a child tag deep inside the document tree. We need to first locate a tag which has no further elements and then check whether the required text is in it.

Example

html = '''
   <p> The quick, brown fox jumps over a lazy dog.</p>
   <p> DJs flock by when MTV ax quiz prog.</p>
   <p> Junk MTV quiz graced by fox whelps.</p>
   <p> Bawds jog, flick quartz, vex nymphs./p>
'''
from bs4 import BeautifulSoup


soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(lambda tag: len(tag.find_all()) == 0 and "by" in tag.text)
for tag in tags:
   print (tag)

Output

<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>

Beautiful Soup - Remove HTML Tags

In this chapter, let us see how we can remove all tags from a HTML document. HTML is a markup language, made up of predefined tags. A tag marks a certain text associated with it so that the browser renders it as per its predefined meaning. For example, the word Hello marked with <b> tag for example <b>Hello</b), is rendered in bold face by the browser.

If we want to filter out the raw text between different tags in a HTML document, we can use any of the two methods - get_text() or extract() in Beautiful Soup library.

The get_text() method collects all the raw text part from the document and returns a string. However, the original document tree is not changed.

In the example below, the get_text() method removes all the HTML tags.

Example

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
 DJs flock by when MTV ax quiz prog.
 Junk MTV quiz graced by fox whelps.
 Bawds jog, flick quartz, vex nymphs.

Not that the soup object in the above example still contains the parsed tree of the HTML document.

Another approach is to collect the string enclosed in a Tag object before extracting it from the soup object. In HTML, some tags don't have a string property (we can say that tag.string is None for some tags such as <html> or <body>). So, we concatenate strings from all other tags to obtain the plain text out of the HTML document.

Following program demonstrates this approach.

Example

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()

string=''
for tag in tags:
   #print (tag.name, tag.string)
   if tag.string != None:
      string=string+tag.string+'\n'
   tag.extract()
print ("Document text after removing tags:")
print (string)
print ("Document:")
print (soup)

Output

Document text after removing tags:
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

Document:

The clear() method removes the inner string of a tag object but doesn't return it. Similarly the decompose() method destroys the tag as well as all its children elements. Hence, these methods are not suitable to retrieve the plain text from HTML document.

Beautiful Soup - Remove all Styles

This chapter explains how to remove all styles from a HTML document. Cascaded style sheets (CSS) are used to control the appearance of different aspects of a HTML document. It includes styling the rendering of text with a specific font, color, alignment, spacing etc. CSS is applied to HTML tags in different ways.

One is to define different styles in a CSS file and include in the HTML script with the <link> tag in the <head> section in the document. For example,

Example

<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
   . . .
   . . .
   </body>
</html>

The different tags in the body part of the HTML script will use the definitions in mystyle.css file

Another approach is to define the style configuration inside the <head> part of the HTML document itself. Tags in the body part will be rendered by using the definitions provided internally.

Example of internal styling −

<html>
<head>
   <style>
      p {
         text-align: center;
         color: red;
      } 
   </style>
</head>
   <body>
      <p>para1.</p>
      <p id="para1">para2</p>
      <p>para3</p>
   </body>
</html>

In either cases, to remove the styles programmatically, simple remove the head tag from the soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

Third approach is to define the styles inline by including style attribute in the tag itself. The style attribute may contain one or more style attribute definitions such as color, size etc. For example

<body>
   <h1 style="color:blue;text-align:center;">This is a heading</h1>
   <p style="color:red;">This is a paragraph.</p>
</body>

To remove such inline styles from a HTML document, you need to check if attrs dictionary of a tag object has style key defined in it, and if yes delete the same.

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup)

The following code removes the inline styles as well as removes the head tag itself, so that the resultant HTML tree will not have any styles left.

html = '''
<html>
   <head>
      <link rel="stylesheet" href="style.css">
   </head>
   <body>
      <h1 style="color:blue;text-align:center;">This is a heading</h1>
      <p style="color:red;">This is a paragraph.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.head.extract()

tags=soup.find_all()
for tag in tags:
   if tag.has_attr('style'):
      del tag.attrs['style']
print (soup.prettify())

Output

<html>
 <body>
  <h1>
   This is a heading
  </h1>
  <p>
   This is a paragraph.
  </p>
 </body>
</html>

Beautiful Soup - Remove all Scripts

One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.

The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.

To include an external Javascript file, the syntax used is −

<head>
   <script src="javascript.js"></script>
</head>

You can then invoke the functions defined in this file from inside HTML.

Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.

<body>
   <p>Hello World</p>
   <script>
      alert("Hello World")
   </script>
</body>

To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.

Example

html = '''
<html>
   <head>
      <script src="javascript.js"></scrript>
   </head>
   <body>
      <p>Hello World</p>
      <script>
      alert("Hello World")
      </script>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all('script'):
   tag.extract()

print (soup)

Output

<html>
<head>

</head>
</html>

You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −

[tag.decompose() for tag in soup.find_all('script')]

Beautiful Soup - Remove Empty Tags

In HTML, many of the tags have an opening and closing tag. Such tags are mostly used for defining the formatting properties, such as <b> and </b>, <h1> and </h1> etc. There are some self-closing tags also which don't have a closing tag and no textual part. For example <img>, <br>, <input> etc. However, while composing HTML, tags such as <p></p> without any text may be inadvertently inserted. We need to remove such empty tags with the help of Beautiful Soup library functions.

Removing textual tags without any text between opening and closing symbols is easy. You can call extract() method on a tag if length of its inner text is 0.

for tag in tags:
   if (len(tag.get_text(strip=True)) == 0):
      tag.extract()

However, this would remove tags such as <hr>, <img>, and <input> etc. These are all self-closing or singleton tags. You would not like to close tags that have one or more attributes even if there is no text associated with it. So, you'll have to check if a tag has any attributes and the get_text() returns none.

In the following example, there are both situations where an empty textual tag and some singleton tags are present in the HTML string. The code retains the tags with attributes but removes ones without any text embedded.

Example

html ='''
<html>
   <body>
      <p>Paragraph</p>
      <embed type="image/jpg" src="Python logo.jpg" width="300" height="200">
      <hr>
      <b></b>
      <p>
      <a href="#">Link</a>
      <ul>
      <li>One</li>
      </ul>
      <input type="text" id="fname" name="fname">
      <img src="img_orange_flowers.jpg" alt="Flowers">
   </body>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tags =soup.find_all()

for tag in tags:
   if (len(tag.get_text(strip=True)) == 0): 
      if len(tag.attrs)==0:
         tag.extract()
print (soup)

Output

<html>
<body>
<p>Paragraph</p>
<embed height="200" src="Python logo.jpg" type="image/jpg" width="300"/>

<p>
<a href="#">Link</a>
<ul>
<li>One</li>
</ul>
<input id="fname" name="fname" type="text"/>
<img alt="Flowers" src="img_orange_flowers.jpg"/>
</p>
</body>
</html>

Note that the original html code has a <p> tag without its enclosing </p>. The parser automatically inserts the closing tag. The position of the closing tag may change if you change the parser to lxml or html5lib.

Beautiful Soup - Remove Child Elements

HTML document is a hierarchical arrangement of different tags, where a tag may have one or more tags nested in it at more than one level. How do we remove the child elements of a certain tag? With BeautifulSoup, it is very easy to do it.

There are two main methods in BeautifulSoup library, to remove a certain tag. The decompose() method and extract() method, the difference being that that the latter returns the thing that was removed, whereas the former just destroys it.

Hence to remove the child elements, call findChildren() method for a given Tag object, and then extract() or decompose() on each.

Consider the following code segment −

soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print (soup)

This will destroy the entire soup object itself, which is the parsed tree of the document. Obviously, we would not like to do that.

Now the following code −

soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
   for t in tag.findChildren():
      t.extract()

In the document tree, <html> is the first tag, and all other tags are its children, hence it will remove all the tags except <html> and </html> in the first iteration of the loop itself.

More effective use of this can be done if we want to remove the children of a specific tag. For example, you may want to remove the header row of a HTML table.

The following HTML script ha a table with first <tr> element having headers marked by <th> tag.

<html>
   <body>
      <h2>Beautiful Soup - Remove Child Elements</h2>
      <table border="1">
         <tr class='header'>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

We can use the following Python code to remove all the children elements of <tr> tag with <th> cells.

Example

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class':'header'})

for tag in tags:
   for t in tag.findChildren():
      t.extract()

print (soup)

Output

<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">

</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>

It can be seen that the <th> elements have been removed from the parsed tree

Beautiful Soup - find vs find_all

Beautiful Soup library includes find() as well as find_all() methods. Both methods are one of the most frequently used methods while parsing HTML or XML documents. From a particular document tree You often need to locate a PageElement of a certain tag type, or having certain attributes, or having a certain CSS style etc. These criteria are given as argument to both find() and find_all() methods. The main point of difference between the two is that while find() locates the very first child element that satisfies the criteria, find_all() method searches for all the children elements of the criteria.

The find() method is defined with following syntax −

Syntax

find(name, attrs, recursive, string, **kwargs)

The name argument specifies a filter on tag name. With attrs, a filter on tag attribute values can be set up. The recursive argument forces a recursive search if it is True. You can pass variable kwargs as dictionary of filters on attribute values.

soup.find(id = 'nm')
soup.find(attrs={"name":'marks'})

The find_all() method takes all the arguments as for the find() method, in addition there is a limit argument. It is an integer, restricting the search the specified number of occurrences of the given filter criteria. If not set, find_all() searches for the criteria among all the children under the said PageElement.

soup.find_all('input')
lst=soup.find_all('li', limit =2)

If the limit argument for find_all() method is set to 1, it virtually acts as find() method.

The return type of both the methods differs. The find() method returns either a Tag object or a NavigableString object first found. The find_all() method returns a ResultSet consisting of all the PageElements satisfying the filter criteria.

Here is an example that demonstrates the difference between find and find_all methods.

Example

from bs4 import BeautifulSoup

markup =open("index.html")

soup = BeautifulSoup(markup, 'html.parser')
ret1 = soup.find('input')
ret2 = soup.find_all ('input')
print (ret1, 'Return type of find:', type(ret1))
print (ret2)
print ('Return tyoe find_all:', type(ret2))

#set limit =1
ret3 = soup.find_all ('input', limit=1)
print ('find:', ret1)
print ('find_all:', ret3)

Output

<input id="nm" name="name" type="text"/> Return type of find: <class 'bs4.element.Tag'>
[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]
Return tyoe find_all: <class 'bs4.element.ResultSet'>
find: <input id="nm" name="name" type="text"/>
find_all: [<input id="nm" name="name" type="text"/>]

Beautiful Soup - Specifying the Parser

A HTML document tree is parsed into an object of BeautifulSoup class. The constructor of this class needs the mandatory argument as the HTML string or a file object pointing to the html file. The constructor has all other optional arguments, important being features.

BeautifulSoup(markup, features)

Here markup is a HTML string or file object. The features parameter specifies the parser to be used. It may be a specific parser such as "lxml", "lxml-xml", "html.parser", or "html5lib; or type of markup to be used ("html", "html5", "xml").

If the features argument is not given, BeautifulSoup chooses the best HTML parser that's installed. Beautiful Soup ranks lxml's parser as being the best, then html5lib's, then Python's built-in parser.

You can specify one of the following −

The type of markup you want to parse. Beautiful Soup currently supports are "html", "xml", and "html5".

The name of the parser library to be used. Currently supported options are "lxml", "html5lib", and "html.parser" (Python's built-in HTML parser).

To install lxml or html5lib parser, use the command −

pip3 install lxml
pip3 install html5lib

These parsers have their advantages and disadvantages as shown below −

Parser: Python's html.parser

Usage − BeautifulSoup(markup, "html.parser")

Advantages

Batteries included
Decent speed
Lenient (As of Python 3.2)

Disadvantages

Not as fast as lxml, less lenient than html5lib.

Parser: lxml's HTML parser

Usage − BeautifulSoup(markup, "lxml")

Advantages

Very fast
Lenient

Disadvantages

External C dependency

Parser: lxml's XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Or BeautifulSoup(markup, "xml")

Advantages

Very fast
The only currently supported XML parser

Disadvantages

External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Advantages

Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5

Disadvantages

Very slow
External Python dependency

Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here's a short document, parsed as HTML −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a><b /></a>", "html.parser")
print (soup)

Output

<a><b></b></a>

An empty <b /> tag is not valid HTML. Hence the parser turns it into a <b></b> tag pair.

The same document is now parsed as XML. Note that the empty <b /> tag is left alone, and that the document is given an XML declaration instead of being put into an <html> tag.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a><b /></a>", "xml")
print (soup)

Output

<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>

In case of a perfectly-formed HTML document, all HTML parsers result in similar parsed tree though one parser will be faster than another.

However, if HTML document is not perfect, there will be different results by different types of parsers. See how the results differ when "<a></p>" is parsed with different parsers −

lxml parser

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "lxml")
print (soup)

Output

<html><body><a></a></body></html>

Note that the dangling </p> tag is simply ignored.

html5lib parser

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "html5lib")
print (soup)

Output

<html><head></head><body><a><p></p></a></body></html>

The html5lib pairs it with an opening <p> tag. This parser also adds an empty <head> tag to the document.

Built-in html parser

Example

Built in from bs4 import BeautifulSoup

soup = BeautifulSoup("<a></p>", "html.parser")
print (soup)

Output

<a></a>

This parser also ignores the closing </p> tag. But this parser makes no attempt to create a well-formed HTML document by adding a <body> tag, doesn't even bother to add an <html> tag.

The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way.

Beautiful Soup - Comparing Objects

As per the beautiful soup, two navigable string or tag objects are equal if they represent the same HTML/XML markup.

Now let us see the below example, where the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like "<b>Java</b>".

Example

from bs4 import BeautifulSoup
markup = "<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
b1 = soup.find('b')
b2 = b1.find_next('b')
print(b1== b2)

print(b1 is b2)

Output

True
False

In the following examples, tow NavigableString objects are compared.

Example

from bs4 import BeautifulSoup
markup = "<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
i1 = soup.find('i')
i2 = i1.find_next('i')
print(i1.string== i2.string)

print(i1.string is i2.string)

Output

True
False

Beautiful Soup - Copying Objects

To create a copy of any tag or NavigableString, use copy() function from the copy module from Python's standard library.

Example

from bs4 import BeautifulSoup
import copy

markup = "<p>Learn <b>Python, Java</b>, <i>advanced Python and advanced Java</i>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
i1 = soup.find('i')
icopy = copy.copy(i1)

print (icopy)

Output

<i>advanced Python and advanced Java</i>

Although the two copies (original and copied one) contain the same markup however, the two do not represent the same object.

print (i1 == icopy)
print (i1 is icopy)

Output

True
False

The copied object is completely detached from the original Beautiful Soup object tree, just as if extract() had been called on it.

print (icopy.parent)

Output

None

Beautiful Soup - Get Tag Position

The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are −

sourceline − line number at which the tag is found

sourcepos − The starting index of the tag in the line in which it is found.

These properties are supported by the html.parser which is Python's in-built parser and html5lib parser. They are not available when you are using lmxl parser.

In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string.

Example

html = '''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

p_tags = soup.find_all('p')
for p in p_tags:
   print (p.sourceline, p.sourcepos, p.string)

Output

4 0 Web frameworks
9 0 GUI frameworks

For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used.

Example

html = '''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html5lib')

li_tags = soup.find_all('li')
for l in li_tags:
   print (l.sourceline, l.sourcepos, l.string)

Output

6 3 Django
7 3 Flask
11 3 Tkinter
12 3 PyQt

When using html5lib, the sourcepos property returns the position of the final greater-than sign.

Beautiful Soup - Encoding

All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.

Example

from bs4 import BeautifulSoup
markup = "<p>I will display £</p>"
soup = BeautifulSoup(markup, "html.parser")
print (soup.p)
print (soup.p.string)

Output

<p>I will display £</p>
I will display £

Above behavior is because BeautifulSoup internally uses the sub-library called Unicode, Dammit to detect a document's encoding and then convert it into Unicode.

However, not all the time, the Unicode, Dammit guesses correctly. As the document is searched byte-by-byte to guess the encoding, it takes lot of time. You can save some time and avoid mistakes, if you already know the encoding by passing it to the BeautifulSoup constructor as from_encoding.

Below is one example where the BeautifulSoup misidentifies, an ISO-8859-8 document as ISO-8859-7 −

Example

from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print (soup.h1)

print (soup.original_encoding)

Output

<h1>翴檛</h1>
ISO-8859-7

To resolve above issue, pass it to BeautifulSoup using from_encoding −

Example

from bs4 import BeautifulSoup
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.h1)

print (soup.original_encoding)

Output

<h1>םולש</h1>
iso-8859-8

Another new feature added from BeautifulSoup 4.4.0 is, exclude_encoding. It can be used, when you don't know the correct encoding but sure that Unicode, Dammit is showing wrong result.

soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

Output encoding

The output from a BeautifulSoup is UTF-8 document, irrespective of the entered document to BeautifulSoup. Below a document, where the polish characters are there in ISO-8859-2 format.

Example

markup = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
   <HEAD>
      <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=iso-8859-2">
   </HEAD>
   <BODY>
   ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
   </BODY>
</HTML>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser", from_encoding="iso-8859-8")
print (soup.prettify())

Output

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
      <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
   </head>
   <body>
      ą ć ę ł ń ó ś ź ż Ą Ć Ę Ł Ń Ó Ś Ź Ż
   </body>
</html>

In the above example, if you notice, the <meta> tag has been rewritten to reflect the generated document from BeautifulSoup is now in UTF-8 format.

If you don't want the generated output in UTF-8, you can assign the desired encoding in prettify().

print(soup.prettify("latin-1"))

Output

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n <head>\n  <meta content="text/html; charset=latin-1" http-equiv="content-type"/>\n </head>\n <body>\n  ą ć ę ł ń \xf3 ś ź ż Ą Ć Ę Ł Ń \xd3 Ś Ź Ż\n </body>\n</html>\n'

In the above example, we have encoded the complete document, however you can encode, any particular element in the soup as if they were a python string −

soup.p.encode("latin-1")
soup.h1.encode("latin-1")

Output

b'<p>My first paragraph.</p>'
b'<h1>My First Heading</h1>'

Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Below is one such example −

markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
print(tag.encode("utf-8"))

Output

b'<b>\xe2\x98\x83</b>'

If you try to encode the above in "latin-1" or "ascii", it will generate "&#9731", indicating there is no representation for that.

print (tag.encode("latin-1"))
print (tag.encode("ascii"))

Output

b'<b>☃</b>'
b'<b>☃</b>'

Unicode, Dammit

Unicode, Dammit is used mainly when the incoming document is in unknown format (mainly foreign language) and we want to encode in some known format (Unicode) and also we don't need Beautifulsoup to do all this.

Beautiful Soup - Output Formatting

If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.

An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −

<	less than	<	<
>	greater than	>	>
&	ampersand	&	&
"	double quote	"	"
'	single quote	'	'
"	Left Double quote	“	“
"	Right double quote	”	”
£	Pound	£	£
¥	yen	¥	¥
€	euro	€	€
©	copyright	©	©

By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">"

For others, they'll be converted to Unicode characters.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (str(soup))

Output

Hello "World!"

If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won't get the HTML entities back −

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode())

Output

b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'

To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.

formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML

formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

formatter="html5" − it's similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br"

formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML

Example

from bs4 import BeautifulSoup

french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))

Output

minimal: 
<p>
 Il a dit <<Sacré bleu!>>
</p>

html:
<p>
 Il a dit <<Sacré bleu!>>
</p>

None:
<p>
 Il a dit <<Sacré bleu!>>
</p>

In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.

HTMLFormatter class − Used to customize the formatting rules for HTML documents.

XMLFormatter class − Used to customize the formatting rules for XML documents.

Beautiful Soup - Pretty Printing

To display the entire parsed tree of an HTML document or the contents of a specific tag, you can use the print() function or call str() function as well.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("<h1>Hello World</h1>", "lxml")
print ("Tree:",soup)
print ("h1 tag:",str(soup.h1))

Output

Tree: <html><body><h1>Hello World</h1></body></html>
h1 tag: <h1>Hello World</h1>

The str() function returns a string encoded in UTF-8.

To get a nicely formatted Unicode string, use Beautiful Soup's prettify() method. It formats the Beautiful Soup parse tree so that there each tag is on its own separate line with indentation. It allows to you to easily visualize the structure of the Beautiful Soup parse tree.

Consider the following HTML string.

<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>

Using the prettify() method we can better understand its structure −

html = '''
   <p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
print (soup.prettify())

Output

<html>
 <body>
  <p>
   The quick,
   <b>
    brown fox
   </b>
   jumps over a lazy dog.
  </p>
 </body>
</html>

You can call prettify() on on any of the Tag objects in the document.

print (soup.b.prettify())

Output

<b>
 brown fox
</b>

The prettify() method is for understanding the structure of the document. However, it should not be used to reformat it, as it adds whitespace (in the form of newlines), and changes the meaning of an HTML document.

He prettify() method can optionally be provided formatter argument to specify the formatting to be used.

Beautiful Soup - NavigableString Class

One of the main objects prevalent in Beautiful Soup API is the object of NavigableString class. It represents the string or text between the opening and closing counterparts of most of the HTML tags. For example, if <b>Hello</b> is the markup to be parsed, Hello is the NavigableString.

NavigableString class is subclassed from the PageElement class in bs4 package, as well as Python's built-in str class. Hence, it inherits the PageElement methods such as find_*(), insert, append, wrap,unwrap methods as well as methods from str class such as upper, lower, find, isalpha etc.

The constructor of this class takes a single argument, a str object.

Example

from bs4 import NavigableString
new_str = NavigableString('world')

You can now use this NavigableString object to perform all kinds of operations on the parsed tree, such as append, insert, find etc.

In the following example, we append the newly created NavigableString object to an existing Tab object.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
new_str = NavigableString('world')
tag.append(new_str)
print (soup)

Output

<b>Helloworld</b>

Note that the NavigableString is a PageElement, hence it can be appended to the Soup object also. Check the difference if we do so.

Example

new_str = NavigableString('world')
soup.append(new_str)
print (soup)

Output

<b>Hello</b>world

As we can see, the string appears after the <b> tag.

Beautiful Soup offers a new_string() method. Create a new NavigableString associated with this BeautifulSoup object.

Let us new_string() method to create a NavigableString object, and add it to the PageElements.

Example

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 

ns=soup.new_string(' World')
tag.append(ns)
print (tag)
soup.append(ns)
print (soup)

Output

<b>Hello World</b>
<b>Hello</b> World

We find an interesting behaviour here. The NavigableString object is added to a tag inside the tree, as well as to the soup object itself. While the tag shows the appended string, but in the soup object, the text World is appended, but it doesn't show in the tag. This is because the new_string() method creates a NavigableString associated with the Soup object.

Beautiful Soup - Convert Object to String

The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object.

Assuming that we have a following HTML document

html = '''
<p>Hello <b>World</b></p>
'''

Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python's builtin str() function.

The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn't add the <html> and <body> tags.

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (str(soup))

Output

<p>Hello <b>World</b></p>

On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print (str(soup))

Output

<html><head></head><body><p>Hello <b>World</b></p>
</body></html>

The Tag object has a string property that returns a NavigableString object.

tag = soup.find('b')
obj = (tag.string)
print (type(obj),obj)

Output

string <class 'bs4.element.NavigableString'> World

There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes.

If the HTML string is −

html = '''
   <p>Hello <div id='id'>World</div></p>
'''

We try to obtain the text property of <p> tag

tag = soup.find('p')
obj = (tag.text)
print ( type(obj), obj)

Output

<class 'str'> Hello World

You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string

obj = tag.get_text()
print (type(obj),obj)

Output

<class 'str'> Hello World

Beautiful Soup - Convert HTML to Text

One of the important and a frequently required application of a web scraper such as Beautiful Soup library is to extract text from a HTML script. You may need to discard all the tags along with the attributes associated if any with each tag and separate out the raw text in the document. The get_text() method in Beautiful Soup is suitable for this purpose.

Here is a basic example demonstrating the usage of get_text() method. You get all the text from HTML document by removing all the HTML tags.

Example

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

The get_text() method has an optional separator argument. In the following example, we specify the separator argument of get_text() method as '#'.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

Output

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

The get_text() method has another argument strip, which can be True or False. Let us check the effect of strip parameter when it is set to True. By default it is False.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

Output

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.

Beautiful Soup - Parsing XML

BeautifulSoup can also parse a XML document. You need to pass fatures='xml' argument to Beautiful() constructor.

Assuming that we have the following books.xml in the current working directory −

Example

<?xml version="1.0" ?>
<books>
   <book>
      <title>Python</title>
      <author>TutorialsPoint</author>
      <price>400</price>
   </book>
</books>

The following code parses the given XML file −

from bs4 import BeautifulSoup
fp = open("books.xml")
soup = BeautifulSoup(fp,  features="xml")

print (soup)
print ('type:', type(soup))

When the above code is executed, you should get the following result −

<?xml version="1.0" encoding="utf-8"?>
<books>
<book>
<title>Python</title>
<author>TutorialsPoint</author>
<price>400</price>
</book>
</books>
type: <class 'bs4.BeautifulSoup'>

XML parser Error

By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.

To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −

soup = BeautifulSoup(markup, "lxml-xml")

soup = BeautifulSoup(markup, "xml")

One common XML parsing error is −

AttributeError: 'NoneType' object has no attribute 'attrib'

This might happen in case, some element is missing or not defined while using find() or findall() function.

Beautiful Soup - Error Handling

While trying to parse HTML/XML document with Beautiful Soup, you may encounter errors, not from your script but from the structure of the snippet because the BeautifulSoup API throws an error.

By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4.

To parse the document as XML, you need to have lxml parser and you just need to pass the "xml" as the second argument to the Beautifulsoup constructor −

soup = BeautifulSoup(markup, "lxml-xml")

soup = BeautifulSoup(markup, "xml")

One common XML parsing error is −

AttributeError: 'NoneType' object has no attribute 'attrib'

This might happen in case, some element is missing or not defined while using find() or findall() function.

Apart from the above mentioned parsing errors, you may encounter other parsing issues such as environmental issues where your script might work in one operating system but not in another operating system or may work in one virtual environment but not in another virtual environment or may not work outside the virtual environment. All these issues may be because the two environments have different parser libraries available.

It is recommended to know or check your default parser in your current working environment. You can check the current default parser available for the current working environment or else pass explicitly the required parser library as second arguments to the BeautifulSoup constructor.

As the HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. However, if you want to preserve mixed-case or uppercase tags and attributes, then it is better to parse the document as XML.

UnicodeEncodeError

Let us look into below code segment −

Example

soup = BeautifulSoup(response, "html.parser")
   print (soup)

Output

UnicodeEncodeError: 'charmap' codec can't encode character '\u011f'

Above problem may be because of two main situations. You might be trying to print out a unicode character that your console doesn't know how to display. Second, you are trying to write to a file and you pass in a Unicode character that's not supported by your default encoding.

One way to resolve above problem is to encode the response text/character before making the soup to get the desired result, as follows −

responseTxt = response.text.encode('UTF-8')
KeyError: [attr]

It is caused by accessing tag['attr'] when the tag in question doesn't define the attr attribute. Most common errors are: "KeyError: 'href'" and "KeyError: 'class'". Use tag.get('attr') if you are not sure attr is defined.

for item in soup.fetch('a'):
   try:
      if (item['href'].startswith('/') or "tutorialspoint" in item['href']):
      (...)
   except KeyError:
      pass # or some other fallback action

AttributeError

You may encounter AttributeError as follows −

AttributeError: 'list' object has no attribute 'find_all'

The above error mainly occurs because you expected find_all() return a single tag or string. However, soup.find_all returns a python list of elements.

All you need to do is to iterate through the list and catch data from those elements.

To avoid the above errors when parsing a result, that result will be bypassed to make sure that a malformed snippet isn't inserted into the databases −

except(AttributeError, KeyError) as er:
   pass

Beautiful Soup - Trouble Shooting

If you run into problems while trying to parse a HTML/XML document, it is more likely because how the parser in use is interpreting the document. To help you locate and correct the problem, Beautiful Soup API provides a dignose() utility.

The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you're facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you're missing a parser.

The diagnose() method is defined in bs4.diagnose module. Its output starts with a message as follows −

Example

diagnose(markup)

Output

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:

If it doesn't find any of these parsers, a corresponding message also appears.

I noticed that html5lib is not installed. Installing it may help.

If the HTML document fed to diagnose() method is perfectly formed, the parsed tree by any of the parsers will be identical. However if it is not properly formed, then different parser interprets differently. If you don't get the tree as you anticipate, changing the parser might help.

Sometimes, you may have chosen HTML parser for a XML document. The HTML parsers add all the HTML tags while parsing the document incorrectly. Looking at the output, you will realize the error and can help in correcting.

If Beautiful Soup raises HTMLParser.HTMLParseError, try and change the parser.

parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag are both generated by Python's built-in HTML parser library, and the solution is to install lxml or html5lib.

If you encounter SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = '[document]'), it is caused by running an old Python 2 version of Beautiful Soup under Python 3, without converting the code.

The ImportError with message No module named HTMLParser is because of an old Python 2 version of Beautiful Soup under Python 3.

While, ImportError: No module named html.parser - is caused by running the Python 3 version of Beautiful Soup under Python 2.

If you get ImportError: No module named BeautifulSoup - more often than not, it is because of running Beautiful Soup 3 code on a system that doesn't have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.

Finally, ImportError: No module named bs4 - is due to the fact that you are trying a Beautiful Soup 4 code on a system that doesn't have BS4 installed.

Beautiful Soup - Porting Old Code

You can make the code from earlier version of Beautiful Soup compatible with the lates version by making following change in the import statement −

Example

from BeautifulSoup import BeautifulSoup
#becomes this:

from bs4 import BeautifulSoup

If you get the ImportError "No module named BeautifulSoup", it means you're trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4 installed. Similarly, If you get the ImportError "No module named bs4", because you're trying to run Beautiful Soup 4 code, but you only have Beautiful Soup 3 installed.

Beautiful Soup 3 used Python's SGMLParser, a module that has been removed in Python 3.0. Beautiful Soup 4 uses html.parser by default, but you can also use lxml or html5lib.

Although BS4 is mostly backwards-compatible with BS3, most of its methods have been deprecated and given new names for PEP 8 compliance.

Here are a few examples −

replaceWith -> replace_with
findAll -> find_all
findNext -> find_next
findParent -> find_parent
findParents -> find_parents
findPrevious -> find_previous
getText -> get_text
nextSibling -> next_sibling
previousSibling -> previous_sibling

Beautiful Soup - contents Property

Method Description

The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String).

Syntax

Tag.contents

Return value

The contents property returns a list of child elements and strings in the Tag/Soup object,.

Example 1

Contents of a tag object −

from bs4 import BeautifulSoup

markup = '''
   <div id="Languages">
      <p>Java</p>
      <p>Python</p>
      <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.div
print (tag.contents)

Output

['\n', <p>Java</p>, '\n', <p>Python</p>, '\n', <p>C++</p>, '\n']

Example 2

Contents of the entire document −

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

print (soup.contents)

Output

['\n', <div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>, '\n']

Example 3

Note that a NavigableString object doesn't have contents property. It throws AttributeError if we try to access the same.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p
s=tag.contents[0]
print (s.contents)

Output

Traceback (most recent call last):
  File "C:\Users\user\BeautifulSoup\2.py", line 11, in <module>
    print (s.contents)
           ^^^^^^^^^^
  File "C:\Users\user\BeautifulSoup\Lib\site-packages\bs4\element.py", line 984, in __getattr__
    raise AttributeError(
AttributeError: 'NavigableString' object has no attribute 'contents'

Beautiful Soup - children Property

Method Description

The Tag object in Beautiful Soup library has children property. It returns a generator used to iterate over the immediate child elements and text nodes (i.e. Navigable String).

Syntax

Tag.children

Return value

The property returns a generator with which you can iterate over direct children of the PageElement.

Example 1

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div
children = tag.children
for child in children:
   print (child)

Output

<p>Java</p>

<p>Python</p>

<p>C++</p>

Example 2

The soup object too bears the children property.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

children = soup.children
for child in children:
   print (child)

Output

<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>

Example 3

In the following example, we append NavigableString objects to the <p> Tag and get the list of children.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
soup.p.extend(['and', 'JavaScript'])
children = soup.p.children
for child in children:
    print (child)

Output

Java
and
JavaScript

Beautiful Soup - string Property

Method Description

In Beautiful Soup, the soup and Tag object has a convenience property - string property. It returns a single string within a PageElement, Soup or Tag. If this element has a single string child, then a NavigableString corresponding to it is returned. If this element has one child tag, return value is the 'string' attribute of the child tag, and if element itself is a string, (with no children), then the string property returns None.

Syntax

Tag.string

Example 1

The following code has the HTML string with a <div> tag that encloses three <p> elements. We find the string property of first <p> tag.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p

navstr = tag.string
print (navstr, type(navstr))

nav_str = str(navstr)
print (nav_str, type(nav_str))

Output

Java <class 'bs4.element.NavigableString'>
Java <class 'str'>

The string property returns a NavigableString. It can be cast to a regular Python string with str() function

Example 2

The string property of an element with children elements inside, returns None. Check with the <div> tag.

tag = soup.div

navstr = tag.string
print (navstr)

Output

None

Beautiful Soup - strings Property

Method Description

For any PageElement having more than one children, the inner text of each can be fetched by the strings property. Unlike the string property, strings handles the case when the element contains multiple children. The strings property returns a generator object. It yields a sequence of NavigableStrings corresponding to each of the child elements.

Syntax

Tag.strings

Example 1

You can retrieve the value od strings property for soup as well as a tag object. In the following example, the soup object's stings property is checked.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
print ([string for string in soup.strings])

Output

['\n', '\n', 'Java', ' ', 'Python', ' ', 'C++', '\n', '\n']

Note the line breaks and white spaces in the list.We can remove them with stripped_strings property.

Example 2

We now obtain a generator object returned by the strings property of <div> tag. With a loop, we print the strings.

tag = soup.div

navstrs = tag.strings
for navstr in navstrs:
   print (navstr)

Output

Java
 
Python
 
C++

Note that the line breaks and whiteapces have appeared in the output, which can be removed with stripped_strings property.

Beautiful Soup - stripped_strings Property

Method Description

The stripped_strings property of a Tag/Soup object gives the return similar to strings property, except for the fact that the extra line breaks and whitespaces are stripped off. Hence, it can be said that the stripped_strings property results in a generator of NavigableString objects of the inner elements belonging to the object in use.

Syntax

Tag.stripped_strings

Example 1

In the example below, the strings of all the elements in the document tree parsed in a BeautifulSoup object are displayed after applying the stripping.

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
print ([string for string in soup.stripped_strings])

Output

['Java', 'Python', 'C++']

Compared to the output of strings property, you can see that the line breaks and whitespaces are removed.

Example 2

Here we extract the NavigableStrings of each of the child elements under the <div> tag.

tag = soup.div

navstrs = tag.stripped_strings
for navstr in navstrs:
   print (navstr)

Output

Java
Python
C++

Beautiful Soup - descendants Property

Method Description

With the descendants property of a PageElement object in Beautiful Soup API you can traverse the list of all children under it. This property returns a generator object, with which the children elements can be retrieved in a breadth-first sequence.

While searching a tree structure, the Breadth-first traversal starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level.

Syntax

tag.descendants

Return value

The descendants property returns a generator object.

Example 1

In the code below, we have a HTML document with nested unordered list tags. We scrape through the children elements parsed in breadth-first manner.

html = '''
   <ul id='outer'>
   <li class="mainmenu">Accounts</li>
      <ul>
      <li class="submenu">Anand</li>
      <li class="submenu">Mahesh</li>
      </ul>
   <li class="mainmenu">HR</li>
      <ul>
      <li class="submenu">Anil</li>
      <li class="submenu">Milind</li>
      </ul>
   </ul>
''' 
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('ul', {'id': 'outer'})
tags = soup.descendants
for desc in tags:
   print (desc)

Output

<ul id="outer">
<li class="mainmenu">Accounts</li>
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>
<li class="mainmenu">HR</li>
<ul>
<li class="submenu">Anil</li>
<li class="submenu">Milind</li>
</ul>
</ul>

<li class="mainmenu">Accounts</li>
Accounts
<ul>
<li class="submenu">Anand</li>
<li class="submenu">Mahesh</li>
</ul>

<li class="submenu">Anand</li>
Anand
<li class="submenu">Mahesh</li>
Mahesh

<li class="mainmenu">HR</li>
HR
<ul>
<li class="submenu">Anil</li>
<li class="submenu">Milind</li>
</ul>

<li class="submenu">Anil</li>
Anil
<li class="submenu">Milind</li>
Milind

Example 2

In the following example, we list out the descendants of <head> tag

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
   print (element)

Output

<title>TutorialsPoint</title>
TutorialsPoint

Beautiful Soup - parent Property

Method Description

The parent property in BeautifulSoup library returns the immediate parent element of the said PegeElement. The type of the value returned by the parents property is a Tag object. For the BeautifulSoup object, its parent is a document object

Syntax

Element.parent

Return value

The parent property returns a Tag object. For Soup object, it returns document object

Example 1

This example uses .parent property to find the immediate parent element of the first <p> tag in the example HTML string.

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <p>Hello World</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
print (tag.parent.name)

Output

body

Example 2

In the following example, we see that the <title> tag is enclosed inside a <head> tag. Hence, the parent property for <title> tag returns the <head> tag.

html = """
<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <p>Hello World</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
print (tag.parent)

Output

<head><title>TutorialsPoint</title></head>

Example 3

The behaviour of Python's built-in HTML parser is a little different from html5lib and lxml parsers. The built-in parser doesn't try to build a perfect document out of the string provided. It doesn't add additional parent tags like body or html if they don't exist in the string. On the other hand, html5lib and lxml parsers add these tags to make the document a perfect HTML document.

html = """
<p><b>Hello World</b></p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
print (soup.p.parent.name)

soup = BeautifulSoup(html, 'html5lib')
print (soup.p.parent.name)

Output

[document]
Body

As the HTML parser doesn't add additional tags, the parent of parsed soup is document object. However, when we use html5lib, the parent tag's name property is Body.

Beautiful Soup - parents Property

Method Description

The parents property in BeautifulSoup library retrieves all the parent elements of the said PegeElement in a recursive manner. The type of the value returned by the parents property is a generator, with the help of which we can list out the parents in the down-to-up order.

Syntax

Element.parents

Return value

The parents property returns a generator object.

Example 1

This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <p> tag in the example HTML string.

html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p>Hello World</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
for element in tag.parents:
   print (element.name)

Output

body
html
[document]

Note that the parent to the BeautifulSoup object is [document].

Example 2

In the following example, we see that the <b> tag is enclosed inside a <p> tag. The two div tags above it have an id attribute. We try to print the only those elements having id attribute. The has_attr() method is used for the purpose.

html = """
<div id="outer">
<div id="inner">
<p>Hello<b>World</b></p>
</div>
</div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.b
for parent in tag.parents:
   if parent.has_attr("id"):
      print(parent["id"])

Output

inner
outer

Beautiful Soup - next_sibling Property

Method Description

The HTML tags appearing at the same indentation level are called siblings. The next_sibling property of the PageElement returns next tag at the same level, or under the same parent.

Syntax

element.next_sibling

Return type

The next_sibling property returns a PageElement, a Tag or a NavigableString object.

Example 1

The index.html wage page consists of a HTML form with three input elements each with a name attribute. In the following example, the next sibling of an input tag with name attribute as nm is located.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'age'})
print (tag.find_previous())
from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'nm'})
sib = tag.next_sibling
print (sib)

Output

<input id="nm" name="name" type="text"/>

Example 2

In the next example, we have a HTML document with a couple of tags inside a <p> tag. The next_sibling property returns the tag next to <b> tag in it.

from bs4 import BeautifulSoup 

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.b 
print ("next:",tag1.next_sibling)

Output

next: <i>Python</i>

Example 3

Consider the HTML string in the following document. It has two <p> tags at the same level. The next_sibling of first <p> should give the second <p> tag's contents.

html = '''
<p><b>Hello</b><i>Python</i></p>
<p>TutorialsPoint</p>
'''
soup = BeautifulSoup(html, 'html.parser')

tag1 = soup.p
print ("next:",tag1.next_sibling)

Output

next:

The blank line after the word next: is unexpected. But that's because of the \n character after the first <p> tag. Change the print statement as shown below to obtain the contents of the next_sibling

tag1 = soup.p
print ("next:",tag1.next_sibling.next_sibling)

Output

next: <p>TutorialsPoint</p>

Beautiful Soup - previous_sibling Property

Method Description

The HTML tags appearing at the same indentation level are called siblings. The previous_sibling property of the PageElement returns a previous tag (a tag appearing before the current tag) at the same level, or under the same parent. This property encapsulates the find_previous_sibling() method.

Syntax

element.previous_sibling

Return type

The previous_sibling property returns a PageElement, a Tag or a NavigableString object.

Example 1

In the following code, the HTML string consists of two adjacent tags inside a <p> tag. It shows the sibling tag for <b> tag appearing before it.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag = soup.i
sibling = tag.previous_sibling
print (sibling)

Output

<b>Hello</b>

Example 2

We are using the index.html file for parsing. The page contains a HTML form with three input elements. Which element is a previous sibling of input element with its id attribute as age? The following code shows it −

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('input', {'id':'age'})
sib = tag.previous_sibling.previous_sibling
print (sib)

Output

<input id="nm" name="name" type="text"/>

Example 3

First we find the <p> tag containing the string 'Tutorial' and then fins a tag previous to it.

html = '''
<p>Excellent</p><p>Python</p><p>Tutorial</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find('p', string='Tutorial')
print (tag.previous_sibling)

Output

<p>Python</p>

Beautiful Soup - next_siblings Property

Method Description

The HTML tags appearing at the same indentation level are called siblings. The next_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the subsequent tags and strings under the same parent.

Syntax

element.next_siblings

Return type

The next_siblings property returns a generator of sibling PageElements.

Example 1

In HTML form code in index.html contains three input elements. Following script uses next_siblings property to collect next siblings of an input element wit id attribute as nm

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'id':'nm'})
siblings = tag.next_siblings
print (list(siblings))

Output

['\n', <input id="age" name="age" type="text"/>, '\n', <input id="marks" name="marks" type="text"/>, '\n']

Example 2

Let us use the following HTML snippet for this purpose −

Use the following code to traverse next siblings tags.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.b 
print ("next siblings:")
for tag in tag1.next_siblings:
   print (tag)

Output

next siblings:
<i>Python</i>
<u>Tutorial</u>

Example 3

Next example shows that the <head> tag has only one next sibling in the form of body tag.

html = '''
<html>
   <head>
      <title>Hello</title>
   </head>
   <body>
      <p>Excellent</p><p>Python</p><p>Tutorial</p>
   </body>
   </head>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = soup.head.next_siblings
print ("next siblings:")
for tag in tags:
   print (tag)

Output

next siblings:

<body>
<p>Excellent</p><p>Python</p><p>Tutorial</p>
</body>

The additional lines are because of the linebreaks in the generator.

Beautiful Soup - previous_siblings Property

Method Description

The HTML tags appearing at the same indentation level are called siblings. The previous_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the tags and strings before the current tag, under the same parent. This gives he similar output as find_previous_siblings() method.

Syntax

element.previous_siblings

Return type

The previous_siblings property returns a generator of sibling PageElements.

Example 1

The following example parses the given HTML string that has a few tags embedded inside the outer <p> tag. The previous siblings of the <u> tag are fetched with the help of previous_siblings property.

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.u
print ("previous siblings:")
for tag in tag1.previous_siblings:
   print (tag)

Output

previous siblings:
<i>Python</i>
<b>Excellent</b>

Example 2

In the index.html file used in the following example, there are three input elements in the HTML form. We find out what are the sibling tags previous to the one with id set as marks, and under the <form> tag.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('input', {'id':'marks'})
sibs = tag.previous_siblings
print ("previous siblings:")
for sib in sibs:
   print (sib)

Output

previous siblings:

<input id="age" name="age" type="text"/>

<input id="nm" name="name" type="text"/>

Example 3

The top level <html> tag always has two sibling tags - head and body. Hence, the <body> tag has only one previous sibling i.e. head, as the following code shows −

html = '''
<html>
   <head>
      <title>Hello</title>
   </head>
   <body>
      <p>Excellent</p><p>Python</p><p>Tutorial</p>
   </body>
   </head>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = soup.body.previous_siblings
print ("previous siblings:")
for tag in tags:
   print (tag)

Output

previous siblings:

<head>
<title>Hello</title>
</head>

Beautiful Soup - next_element Property

Method Description

In Beautiful Soup library, the next_element property returns the Tag or NavigableString that appears immediately next to the current PageElement, even if it is out of the parent tree. There is also a next property which has similar behaviour

Syntax

Element.next_element

Return value

The next_element and next properties return a tag or a NavigableString appearing immediately next to the current tag.

Example 1

In the document tree parsed from the given HTML string, we find the next_element of the <b> tag

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
tag = soup.b 
print (tag)
nxt = tag.next_element
print ("Next:",nxt)

nxt = tag.next_element.next_element
print ("Next:",nxt)

Output

<b>Excellent</b>
Next: Excellent
Next: <p>Python</p>

The output is a little strange as the next element for <b>Excellent</b> is shown to be 'Excellent', that is because the inner string is registered as the next element. To obtain the desired result (<p>Python</p>) as the next element, fetch the next_element property of the inner NavigableString object.

Example 2

The BeautifulSoup PageElements also support next property which is analogous to next_element property

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
tag = soup.b 
print (tag)
nxt = tag.next
print ("Next:",nxt)

nxt = tag.next.next
print ("Next:",nxt)

Output

<b>Excellent</b>
Next: Excellent
Next: <p>Python</p>

Example 3

In the next example, we try to determine the element next to <body> tag. As it is followed by a line break (\n), we need to find the next element of the one next to body tag. It happens to be <h1> tag.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('body')
nxt = tag.next_element.next
print ("Next:",nxt)

Output

Next: <h1>TutorialsPoint</h1>

Beautiful Soup - previous_element Property

Method Description

In Beautiful Soup library, the previous_element property returns the Tag or NavigableString that appears immediately prior to the current PageElement, even if it is out of the parent tree. There is also a previous property which has similar behaviour

Syntax

Element.previous_element

Return value

The previous_element and previous properties return a tag or a NavigableString appearing immediately before the current tag.

Example 1

In the document tree parsed from the given HTML string, we find the previous_element of the <p id='id1'> tag

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
tag = soup.find('p', id='id1')
print (tag)
pre = tag.previous_element
print ("Previous:",pre)

pre = tag.previous_element.previous_element
print ("Previous:",pre)

Output

<p id="id1">Tutorial</p>
Previous: Python
Previous: <p>Python</p>

The output is a little strange as the previous element for shown to be 'Python, that is because the inner string is registered as the previous element. To obtain the desired result (<p>Python</p>) as the previous element, fetch the previous_element property of the inner NavigableString object.

Example 2

The BeautifulSoup PageElements also supports previous property which is analogous to previous_element property

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
tag = soup.find('p', id='id1')
print (tag)
pre = tag.previous
print ("Previous:",pre)

pre = tag.previous.previous
print ("Previous:",pre)

Output

<p id="id1">Tutorial</p>
Previous: Python
Previous: <p>Python</p>

Example 3

In the next example, we try to determine the element next to <input> tag whose id attribute is 'age'

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')

tag = soup.find('input', id='age')
pre = tag.previous_element.previous
print ("Previous:",pre)

Output

Previous: <input id="nm" name="name" type="text"/>

Beautiful Soup - next_elements Property

Method Description

In Beautiful Soup library, the next_elements property returns a generator object containing the next strings or tags in the parse tree.

Syntax

Element.next_elements

Return value

The next_elements property returns a generator.

Example 1

The next_elements property returns tags and NavibaleStrings appearing after the <b> tag in the document string below −

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('b')

nexts = tag.next_elements
print ("Next elements:")
for next in nexts:
   print (next)

Output

Next elements:
Excellent
Python
Python
<p id="id1">Tutorial</p>
Tutorial

Example 2

All the elements appearing after the <p> tag are listed below −

from bs4 import BeautifulSoup

html = '''
   <p>
   <b>Excellent</b><i>Python</i>
   </p>
   <u>Tutorial</u>
'''
soup = BeautifulSoup(html, 'html.parser')

tag1 = soup.find('p')
print ("Next elements:")
print (list(tag1.next_elements))

Output

Next elements:
['\n', <b>Excellent</b>, 'Excellent', <i>Python</i>, 'Python', '\n', '\n', <u>Tutorial</u>, 'Tutorial', '\n']

Example 3

The elements next to the input tag present in the HTML form of index.html are listed below −

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')

tag = soup.find('input')
nexts = soup.previous_elements
print ("Next elements:")
for next in nexts:
   print (next)

Output

Next elements:

<input id="age" name="age" type="text"/>

<input id="marks" name="marks" type="text"/>

Beautiful Soup - previous_elements Property

Method Description

In Beautiful Soup library, the previous_elements property returns a generator object containing the previous strings or tags in the parse tree.

Syntax

Element.previous_elements

Return value

The previous_elements property returns a generator.

Example 1

The previous_elements property returns tags and NavibaleStrings appearing before the <p> tag in the document string below −

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', id='id1')

pres = tag.previous_elements
print ("Previous elements:")
for pre in pres:
   print (pre)

Output

Previous elements:
Python
<p>Python</p>
Excellent
<b>Excellent</b>
<p><b>Excellent</b><p>Python</p><p id="id1">Tutorial</p></p>

Example 2

All the elements appearing before the <u> tag are listed below −

from bs4 import BeautifulSoup

html = '''
<p>
<b>Excellent</b><i>Python</i>
</p>
<u>Tutorial</u>
'''
soup = BeautifulSoup(html, 'html.parser')

tag1 = soup.find('u')
print ("previous elements:")
print (list(tag1.previous_elements))

Output

previous elements:
['\n', '\n', 'Python', <i>Python</i>, 'Excellent', <b>Excellent</b>, '\n', <p>
<b>Excellent</b><i>Python</i>
</p>, '\n']

Example 3

The BeautifulSoup object itself doesn't have any previous elements −

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html5lib')

tag = soup.find('input', id='marks')
pres = soup.previous_elements
print ("Previous elements:")
for pre in pres:
   print (pre.name)

Output

Previous elements:

Beautiful Soup - find() Method

Method Description

The find() method in Beautiful Soup looks for the first Element that matches the given criteria in the children of this PageElement and returns it.

Syntax

Soup.find(name, attrs, recursive, string, **kwargs)

Parameters

name − A filter on tag name.

attrs − A dictionary of filters on attribute values.

recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.

limit − Stop looking after specified number of occurrences have been found.

kwargs − A dictionary of filters on attribute values.

Return value

The find() method returns Tag object or a NavigableString object

Example 1

Let us use the following HTML script (as index.html) for the purpose

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
      <input type = 'text' id = 'marks' name = 'marks'>
      </form>
   </body>
</html>

The following Python code finds the element with its id as nm

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

obj = soup.find(id = 'nm')
print (obj)

Output

<input id="nm" name="name" type="text"/>

Example 2

The find() method returns the first tag in the parsed document that has the given attributes.

obj = soup.find(attrs={"name":'marks'})

Output

<input id="marks" name="marks" type="text"/>

Example 3

If find() can't find anything, it returns None

obj = soup.find('dummy')
print (obj)

Output

None

Beautiful Soup - find_all() Method

Method Description

The find_all() method in Beautiful Soup looks for the elements that match the given criteria in the children of this PageElement and returns a list of all elements.

Syntax

Soup.find_all(name, attrs, recursive, string, **kwargs)

Parameters

name − A filter on tag name.

attrs − A dictionary of filters on attribute values.

recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.

limit − Stop looking after specified number of occurrences have been found.

kwargs − A dictionary of filters on attribute values.

Return type

The find_all() method returns a ResultSet object which is a list generator.

Example 1

When we can pass in a value for name, Beautiful Soup only considers tags with certain names. Text strings will be ignored, as will tags whose names that don't match. In this example we pass title to find_all() method.

from bs4 import BeautifulSoup
html = open('index.html')
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find_all('input')
print (obj)

Output

[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]

Example 2

We shall use following HTML script in this example −

<html>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul id="dept">
      <li>Accounts</li>
         <ul id='acc'>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ol id="HR">
         <li>Rani</li>
         <li>Ankita</li>
         </ol>
      </ul>
   </body>
</html>

We can pass a string to the name argument of find_all() method. With string you can search for strings instead of tags. You can pass in a string, a regular expression, a list, a function, or the value True.

In this example, a function is passed to name argument. All the name starting with 'A' are returned by find_all() method.

from bs4 import BeautifulSoup

def startingwith(ch):
   return ch.startswith('A')

soup = BeautifulSoup(html, 'html.parser')

lst=soup.find_all(string=startingwith)

print (lst)

Output

['Accounts', 'Anand', 'Ankita']

Example 3

In this example, we pass limit=2 argument to find_all() method. The method returns first two appearances of <li> tag.

soup = BeautifulSoup(html, 'html.parser')
lst=soup.find_all('li', limit =2)

print (lst)

Output

[<li>Accounts</li>, <li>Anand</li>]

Beautiful Soup - find_parents() Method

Method Description

The find_parent() method in BeautifulSoup package finds all parents of this Element that matches the given criteria.

Syntax

find_parents( name, attrs, limit, **kwargs)

Parameters

name − A filter on tag name.

attrs − A dictionary of filters on attribute values.

limit − Stop looking after specified number of occurrences have been found.

kwargs − A dictionary of filters on attribute values.

Return Type

The find_parents() method returns a ResultSet consisting of all the parent elements in a reverse order.

Example 1

We shall use following HTML script in this example −

<html>
   <body>
   <h2>Departmentwise Employees</h2>
   <ul id="dept">
   <li>Accounts</li>
      <ul id='acc'>
      <li>Anand</li>
      <li>Mahesh</li>
      </ul>
   <li>HR</li>
      <ol id="HR">
      <li>Rani</li>
      <li>Ankita</li>
      </ol>
   </ul>
   </body>
</html>

Output

ul
body
html
[document]

Note that the name property of BeautifulSoup object always returns [document].

Example 2

In this example, the limit argument is passed to find_parents() method to restrict the parent search to two levels up.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('li')
parents=obj.find_parents(limit=2)
for parent in parents:
   print (parent.name)

Output

ul
body

Beautiful Soup - find_parent() Method

Method Description

The find_parent() method in BeautifulSoup package finds the closest parent of this PageElement that matches the given criteria.

Syntax

find_parent( name, attrs, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
kwargs − A dictionary of filters on attribute values.

Return Type

The find_parent() method returns Tag object or a NavigableString object.

Example 1

We shall use following HTML script in this example −

<html>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul id="dept">
      <li>Accounts</li>
      <ul id='acc'>
      <li>Anand</li>
      <li>Mahesh</li>
      </ul>
      <li>HR</li>
      <ol id="HR">
      <li>Rani</li>
      <li>Ankita</li>
      </ol>
      </ul>
   </body>
</html>

In the following example, we find the name of the tag that is parent to the string 'HR'.

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser')
obj=soup.find(string='HR')
print (obj.find_parent().name)

Output

li

Example 2

The <body> tag is always enclosed within the top level <html> tag. In the following example, we confirm this fact with find_parent() method −

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('body')
print (obj.find_parent().name)

Output

html

Beautiful Soup - find_next_siblings() Method

Method Description

The find_next_siblings() method is similar to next_sibling property. It finds all siblings at the same level of this PageElement that match the given criteria and appear later in the document.

Syntax

find_fnext_siblings(name, attrs, string, limit, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − The string to search for (rather than tag).
limit − Stop looking after specified number of occurrences have been found.
kwargs − A dictionary of filters on attribute values.

Return Type

The find_next_siblings() method returns a list of Tag objects or a NavigableString objects.

Example 1

Let us use the following HTML snippet for this purpose −

<p>
   <b>
      Excellent
   </b>
   <i>
      Python
   </i>
   <u>
      Tutorial
   </u>
</p>

In the code below, we try to find all the siblings of <b> tag. There are two more tags at the same level in the HTML string used for scraping.

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.find('b')
print ("next siblings:")
for tag in tag1.find_next_siblings():
    print (tag)

Output

The ResultSet of find_next_siblings() is being iterated with the help of for loop.

next siblings:
<i>Python</i>
<u>Tutorial</u>

Example 2

If there are no siblings to be found after a tag, this method returns an empty list.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.find('u')
print ("next siblings:")
print (tag1.find_next_siblings())

Output

next siblings:
[]

Beautiful Soup - find_next_sibling() Method

Method Description

The find_next_sibling() method in Beautiful Soup Find the closest sibling at the same level to this PageElement that matches the given criteria and appears later in the document. This method is similar to next_sibling property.

Syntax

find_fnext_sibling(name, attrs, string, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − The string to search for (rather than tag).
kwargs − A dictionary of filters on attribute values.

Return Type

The find_next_sibling() method returns Tag object or a NavigableString object.

Example 1

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.find('b')
print ("next:",tag1.find_next_sibling())

Output

next: <i>Python</i>

Example 2

If the next node doesn't exist, the method returns None.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')

tag1 = soup.find('i')
print ("next:",tag1.find_next_sibling())

Output

next: None

Beautiful Soup - find_previous_siblings() Method

Method Description

The find_previous_siblings() method in Beautiful Soup package returns all siblings that appear earlier to this PAgeElement in the document and match the given criteria.

Syntax

find_previous_siblings(name, attrs, string, limit, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − A filter for a NavigableString with specific text.
limit − Stop looking after finding this many results.
kwargs − A dictionary of filters on attribute values.

Return Value

The find_previous_siblings() method a ResultSet of PageElements.

Example 1

Let us use the following HTML snippet for this purpose −

<p>
   <b>
      Excellent
   </b>
   <i>
      Python
   </i>
   <u>
      Tutorial
   </u>
</p>

In the code below, we try to find all the siblings of <> tag. There are two more tags at the same level in the HTML string used for scraping.

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')

tag1 = soup.find('u')
print ("previous siblings:")
for tag in tag1.find_previous_siblings():
   print (tag)

Output

<i>Python</i>
<b>Excellent</b>

Example 2

The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous siblings.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('input', {'id':'marks'})
sibs = tag.find_previous_sibling()
print (sibs)

Output

[<input id="age" name="age" type="text"/>, <input id="nm" name="name" type="text"/>]

Example 3

The HTML string has two <p> tags. We find out the siblings previous to the one with id1 as its id attribute.

html = '''
<p><b>Excellent</b><p>Python</p><p id='id1'>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.find('p', id='id1')
ptags = tag.find_previous_siblings()
for ptag in ptags:
   print ("Tag: {}, Text: {}".format(ptag.name, ptag.text))

Output

Tag: p, Text: Python
Tag: b, Text: Excellent

Beautiful Soup - find_previous_sibling() Method

Method Description

The find_previous_sibling() method in Beautiful Soup returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.

Syntax

find_previous_sibling(name, attrs, string, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − A filter for a NavigableString with specific text.
kwargs − A dictionary of filters on attribute values.

Return Value

The find_previous_sibling() method returns a PageElement that could be a Tag or a NavigableString.

Example 1

From the HTML string used in the following example, we find out the previous sibling of <i> tag, having the tag name as 'u'

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup("<p><u>Excellent</u><b>Hello</b><i>Python</i></p>", 'html.parser')
tag = soup.i
sibling = tag.find_previous_sibling('u')
print (sibling)

Output

<u>Excellent</u>

Example 2

The web page (index.html) has a HTML form with three input elements. We locate one with id attribute as marks and then find its previous sibling that had id set to nm.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('input', {'id':'marks'})
sib = tag.find_previous_sibling(id='nm')
print (sib)

Output

<input id="nm" name="name" type="text"/>

Example 3

In the code below, the HTML string has two <p> elements and a string inside the outer <p> tag. We use find_previous_string() method to search for the NavigableString object sibling of <p>Tutorial</p> tag.

html = '''
<p>Excellent<p>Python</p><p>Tutorial</p></p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tag = soup.find('p', string='Tutorial')
ptag = tag.find_previous_sibling(string='Excellent')
print (ptag, type(ptag))

Output

Excellent <class 'bs4.element.NavigableString'>

Beautiful Soup - find_all_next() Method

Method Description

The find_all_next() method in Beautiful Soup finds all PageElements that match the given criteria and appear after this element in the document. This method returns tags or NavigableString objects and method takes in the exact same parameters as find_all().

Syntax

find_all_next(name, attrs, string, limit, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered.
limit − Stop looking after specified number of occurrences have been found.
kwargs − A dictionary of filters on attribute values.

Return Value

This method returns a ResultSet containing PageElements (Tags or NavigableString objects).

Example 1

Using the index.html as the HTML document for this example, we first locate the <form> tag and collect all the elements after it with find_all_next() method.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.form
tags = tag.find_all_next()
print (tags)

Output

[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]

Example 2

Here, we apply a filter to the find_all_next() method to collect all the tags subsequent to <form>, with id being nm or age.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.form
tags = tag.find_all_next(id=['nm', 'age'])
print (tags)

Output

[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>]

Example 3

If we check the tags following the body tag, it includes a <h1> tag as well as <form> tag, that includes three input elements.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.body
tags = tag.find_all_next()
print (tags)

Output

<h1>TutorialsPoint</h1>
<form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/>
</form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/>

Beautiful Soup - find_next() Method

Method Description

The find_next() method in Beautiful soup finds the first PageElement that matches the given criteria and appears later in the document. returns the first tag or NavigableString that comes after the current tag in the document. Like all other find methods, this method has the following syntax −

Syntax

find_next(name, attrs, string, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − A filter for a NavigableString with specific text.
kwargs − A dictionary of filters on attribute values.

Return Value

This find_next () method returns a Tag or a NavigableString

Example 1

A web page index.html with following script has been used for this example

<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      <h1>TutorialsPoint</h1>
      <form>
         <input type = 'text' id = 'nm' name = 'name'>
         <input type = 'text' id = 'age' name = 'age'>
         <input type = 'text' id = 'marks' name = 'marks'>
      </form>
   </body>
</html>

We first locate the <form> tag and then the one next to it.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.h1
print (tag.find_next())

Output

<form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/>
</form>

Example 2

In this example, we first locate the <input> tag with its name='age' and obtain its next tag.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.find('input', {'name':'age'})
print (tag.find_next())

Output

<input id="marks" name="marks" type="text"/>

Example 3

The tag next to the <head> tag happens to be <title> tag.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

tag = soup.head
print (tag.find_next())

Output

<title>TutorialsPoint</title>

Beautiful Soup - find_all_previous() Method

Method Description

The find_all_previous() method in Beautiful Soup look backwards in the document from this PageElement and finds all the PageElements that match the given criteria and appear before the current element. It returns a ResultsSet of PageElements that comes before the current tag in the document. Like all other find methods, this method has the following syntax −

Syntax

find_previous(name, attrs, string, limit, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − A filter for a NavigableString with specific text.
limit − Stop looking after finding this many results.
kwargs − A dictionary of filters on attribute values.

Return Value

The find_all_previous() method returns a ResultSet of Tag or NavigableString objects. If the limit parameter is 1, the method is equivalent to find_previous() method.

Example 1

In this example, name property of each object that appears before the first input tag is displayed.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input')
for t in tag.find_all_previous():
   print (t.name)

Output

form
h1
body
title
head
html

Example 2

In the HTML document under consideration (index.html), there are three input elements. With the following code, we print the tag names of all preceding tags before thr <input> tag with nm attribute as marks. To differentiate between the two input tags before it, we also print the attrs property. Note that the other tags don't have any attributes.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'marks'})
pretags = tag.find_all_previous()
for pretag in pretags:
   print (pretag.name, pretag.attrs)

Output

input {'type': 'text', 'id': 'age', 'name': 'age'}
input {'type': 'text', 'id': 'nm', 'name': 'name'}
form {}
h1 {}
body {}
title {}
head {}
html {}

Example 3

The BeautifulSoup object stores the entire document's tree. It doesn't have any previous element, as the example below shows −

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all_previous()
print (tags)

Output

[]

Beautiful Soup - find_previous() Method

Method Description

The find_previous() method in Beautiful Soup look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. It returns the first tag or NavigableString that comes before the current tag in the document. Like all other find methods, this method has the following syntax −

Syntax

find_previous(name, attrs, string, **kwargs)

Parameters

name − A filter on tag name.
attrs − A dictionary of filters on attribute values.
string − A filter for a NavigableString with specific text.
kwargs − A dictionary of filters on attribute values.

Return Value

The find_previous() method returns a Tag or NavigableString object.

Example 1

In the example below, we try to find which is the previous object before the <body> tag. It happens to be <title> element.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.body
print (tag.find_previous())

Output

<title>TutorialsPoint</title>

Example 2

There are three input elements in the HTML document used in this example. The following code locates the input element with name attribute = age and looks for its previous element.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('input', {'name':'age'})
print (tag.find_previous())

Output

<input id="nm" name="name" type="text"/>

Example 3

The element before <title> happens to be <head> element.

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
tag = soup.find('title')
print (tag.find_previous())

Output

<head>
<title>TutorialsPoint</title>
</head>

Beautiful Soup - select() Method

Method Description

In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. The selection of an element in the document tree is done based on the CSS selector given to it as an argument.

Syntax

select(selector, limit, **kwargs)

Parameters

selector − A string containing a CSS selector.
limit − After finding this number of results, stop looking.
kwargs − Keyword arguments to be passed.

If the limit parameter is set to 1, it becomes equivalent to select_one() method.

Return Value

The select() method returns a ResultSet of Tag objects. The select_one() method returns a single Tag object.

The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −

Type selectors match elements by node name. For example −

tags = soup.select('div')

The Universal selector (*) matches elements of any type. Example −

tags = soup.select('*')

The ID selector matches an element based on its id attribute. The symbol # denotes the ID selector. Example −

tags = soup.select("#nm")

The class selector matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −

tags = soup.select(".submenu")

Example: Type Selector

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Example: ID selector

from bs4 import BeautifulSoup

html = '''
   <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
      <input type = 'text' id = 'marks' name = 'marks'>
   </form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Example: class selector

html = '''
   <ul>
      <li class="mainmenu">Accounts</li>
      <ul>
         <li class="submenu">Anand</li>
         <li class="submenu">Mahesh</li>
      </ul>
      <li class="mainmenu">HR</li>
      <ul>
         <li class="submenu">Rani</li>
         <li class="submenu">Ankita</li>
      </ul>
   </ul>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tags = soup.select(".mainmenu")
print (tags)

Output

[<li class="mainmenu">Accounts</li>, <li class="mainmenu">HR</li>]

Beautiful Soup - append() Method

Method Description

The append() method in Beautiful Soup adds a given string or another tag at the end of the current Tag object's contents. The append() method works similar to the append() method of Python's list object.

Syntax

append(obj)

Parameters

obj − any PageElement, may be a string, a NavigableString object or a Tag object.

Return Type

The append() method doesn't return a new object.

Example 1

In the following example, the HTML script has a <p> tag. With append(), additional text is appended.In the following example, the HTML script has a <p> tag. With append(), additional text is appended.

from bs4 import BeautifulSoup

markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p

tag.append(" World")
print (soup)

Output

<p>Hello</p>
<p>Hello World</p>

Example 2

With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.

from bs4 import BeautifulSoup, Tag

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify())

Output

   <b>
      Hello
   <i>
      World
   </i>
</b>

Example 3

If you have to add a string to the document, you can append a NavigableString object.

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())

Output

<b>
   Hello
   World
</b>

Beautiful Soup - extend() Method

Method Description

The extend() method in Beautiful Soup has been added to Tag class from version 4.7 onwards. It adds all the elements in a list to the tag. This method is analogous to a standard Python List's extend() method - it takes in an array of strings to append to the tag's content.

Syntax

extend(tags)

Parameters

tags − A list of srings or NavigableString objects to be appended.

Return Type

The extend() method doesn't return any new object.

Example

from bs4 import BeautifulSoup

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.b 
vals = ['World.', 'Welcome to ', 'TutorialsPoint']
tag.extend(vals)
print (soup.prettify())

Output

<b>
   Hello
   World.
   Welcome to
   TutorialsPoint
</b>

Beautiful Soup - NavigableString() Method

Method Description

The NavigableString() method in bs4 package is the constructor method for NavigableString class. A NavigableString represents the innermost child element of a parsed document. This method casts a regular Python string to a NavigableString. Conversely, the built-in str() method coverts NavigableString object to a Unicode string.

Syntax

NavigableString(string)

Parameters

string − an object of Python's str class.

Return Value

The NavigableString() method returns a NavigableString object.

Example 1

In the code below, the HTML string contains an empty <b> tag. We add a NavigableString object in it.

html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello World")
soup.b.append(navstr)
print (soup)

Output

<p><b>Hello World</b></p>

Example 2

In this example, we see that two NavigableString objects are appended to an empty <b> tag. The tag responds to strings property instead of string property. It is a generator of NavigableString objects.

html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello")
soup.b.append(navstr)
navstr = NavigableString("World")
soup.b.append(navstr)
for s in soup.b.strings:
   print (s, type(s))

Output

Hello <class 'bs4.element.NavigableString'>
World <class 'bs4.element.NavigableString'>

Example 3

Instead of strings property, if we access the stripped_strings property of <b> tag object, we get a generator of Unicode strings i.e. str objects.

html = """
<p><b></b></p>
"""
from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(html, 'html.parser')
navstr = NavigableString("Hello")
soup.b.append(navstr)
navstr = NavigableString("World")
soup.b.append(navstr)
for s in soup.b.stripped_strings:
   print (s, type(s))

Output

Hello <class 'str'>
World <class 'str'>

Beautiful Soup - new_tag() Method

The new_tag() method in Beautiful Soup library creates a new Tag object, that is associated with an existing BeautifulSoup object. You can use this factory method to append or insert the new tag into the document tree.

Syntax

new_tag(name, namespace, nsprefix, attrs, sourceline, sourcepos, **kwattrs)

Parameters

name − The name of the new Tag.
namespace − The URI of the new Tag's XML namespace, optional.
prefix − The prefix for the new Tag's XML namespace, optional.
attrs − A dictionary of this Tag's attribute values.
sourceline − The line number where this tag was found in its source document.
sourcepos − The character position within `sourceline` where this tag was found.
kwattrs − Keyword arguments for the new Tag's attribute values.

Return Value

This method returns a new Tag object.

Example 1

The following example shows the use of new_tag() method. A new tag for <a> element. The tag object is initialized with the href and string attributes and then inserted in the document tree.

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Welcome to <b>online Tutorial library</b></p>', 'html.parser')
tag = soup.new_tag('a')
tag.attrs['href'] = "www.tutorialspoint.com"
tag.string = "Tutorialspoint"
soup.b.insert_before(tag)
print (soup)

Output

<p>Welcome to <a href="www.tutorialspoint.com">Tutorialspoint</a><b>online Tutorial library</b></p>

Example 2

In the following example, we have a HTML form with two input elements. We create a new input tag and append it to the form tag.

html = '''
   <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
   </form>'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
tag = soup.form
newtag=soup.new_tag('input', attrs={'type':'text', 'id':'marks', 'name':'marks'})
tag.append(newtag)
print (soup)

Output

<form>
<input id="nm" name="name" type="text"/>
<input id="age" name="age" type="text"/>
<input id="marks" name="marks" type="text"/></form>

Example 3

Here we have an empty <p> tag in the HTML string. A new tag is inserted in it.

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p></p>', 'html.parser')
tag = soup.new_tag('b')
tag.string = "Hello World"
soup.p.insert(0,tag)
print (soup)

Output

<p><b>Hello World</b></p>

Beautiful Soup - insert() Method

Method Description

The insert() method in Beautiful Soup add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.

Syntax

insert(position, child)

Parameters

position − The position at which the new PageElement should be inserted.
child − A PageElement to be inserted.

Return Type

The insert() method doesn't return any new object.

Example 1

In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert(1, "Tutorial ")
print (soup.prettify())

Output

<b>
   Excellent
   Tutorial
</b>
<u>
   from TutorialsPoint
</u>

Example 2

In the following example, the insert() method is used to successively insert strings from a list to a <p> tag in HTML markup.

from bs4 import BeautifulSoup, NavigableString

markup = '<p>Excellent Tutorials from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
langs = ['Python', 'Java', 'C']
i=0
for lang in langs:
   i+=1
   tag = soup.new_tag('p')
   tag.string = lang
   soup.p.insert(i, tag)


print (soup.prettify())

Output

<p>
   Excellent Tutorials from TutorialsPoint
   <p>
   Python
   </p>
   <p>
      Java
   </p>
   <p>
      C
   </p>
</p>

Beautiful Soup - insert_before() Method

Method Description

The insert_before() method in Beautiful soup inserts tags or strings immediately before something else in the parse tree. The inserted element becomes the immediate predecessor of this one. The inserted element can be a tag or a string.

Syntax

insert_before(*args)

Parameters

args − One or more elements, may be tag or a string.

Return Value

This insert_before() method doesn't return any new object.

Example 1

The following example inserts a text "Here is an" before "Excellent in the given HTML markup string.

from bs4 import BeautifulSoup, NavigableString

markup = '<b>Excellent</b> Python Tutorial <u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert_before("Here is an ")
print (soup.prettify())

Output

Here is an
<b>
   Excellent
</b>
   Python Tutorial
<u>
   from TutorialsPoint
</u>

Example 2

You can also insert a tag before another tag. Take a look at this example.

from bs4 import BeautifulSoup, NavigableString

markup = '<P>Excellent <b>Tutorial</b> from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('b')
tag1.string = "Python "
tag.insert_before(tag1)
print (soup.prettify())

Output

<p>
   Excellent
   <b>
      Python
   </b>
   <b>
      Tutorial
   </b>
   from TutorialsPoint
</p>

Example 3

The following code passes more than one strings to be inserted before the <b> tag.

from bs4 import BeautifulSoup

markup = '<p>There are <b>Tutorials</b> <u>from TutorialsPoint</u></p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert_before("many ", 'excellent ')
print (soup.prettify())

Output

<p>
   There are
   many
   excellent
   <b>
      Tutorials
   </b>
   <u>
      from TutorialsPoint
   </u>
</p>

Beautiful Soup - insert_after() Method

Method Description

The insert_after() method in Beautiful soup inserts tags or strings immediately after something else in the parse tree. The inserted element becomes the immediate successor of this one. The inserted element can be a tag or a string.

Syntax

insert_after(*args)

Parameters

args − One or more elements, may be tag or a string.

Return Value

This insert_after() method doesn't return any new object.

Example 1

Following code inserts a string "Python" after the first <b> tag.

from bs4 import BeautifulSoup

markup = '<p>An <b>Excellent</b> Tutorial <u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b

tag.insert_after("Python ")
print (soup.prettify())

Output

<p>
   An
   <b>
      Excellent
   </b>
   Python
   Tutorial
   <u>
      from TutorialsPoint
   </u>
</p>

Example 2

You can also insert a tag before another tag. Take a look at this example.

from bs4 import BeautifulSoup, NavigableString

markup = '<P>Excellent <b>Tutorial</b> from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('b')
tag1.string = "on Python "
tag.insert_after(tag1)
print (soup.prettify())

Output

<p>
   Excellent
   <b>
      Tutorial
   </b>
   <b>
      on Python
   </b>
   from TutorialsPoint
</p>

Example 3

Multiple tags or strings can be inserted after a certain tags.

from bs4 import BeautifulSoup, NavigableString

markup = '<P>Excellent <b>Tutorials</b> from TutorialsPoint</p>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.p
tag1 = soup.new_tag('i')
tag1.string = 'and Java'
tag.insert_after("on Python", tag1)
print (soup.prettify())

Output

<p>
   Excellent
   <b>
      Tutorials
   </b>
   from TutorialsPoint
</p>
on Python
<i>
   and Java
</i>

Beautiful Soup - clear() Method

Method Description

The clear() method in Beautiful Soup library removes the inner content of a tag, keeping the tag intact. If there are any child elements, extract() method is called on them. If decompose argument is set to True, then decompose() method is called instead of extract().

Syntax

clear(decompose=False)

Parameters

decompose − If this is True, decompose() (a more destructive method) will be called instead of extract()

Return Value

The clear() method doesn't return any object.

Example 1

As clear() method is called on the soup object that represents the entire document, all the content is removed, leaving the document blank.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.clear()
print(soup)

Output

Example 2

In the following example, we find all the <p> tags and call clear() method on each of them.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('p')
for tag in tags:
   tag.clear() 

print(soup)

Output

Contents of each <p> .. </p> will be removed, the tags will be retained.

<html>
<body>
<p></p>
<p></p>
<p></p>
<p></p>
</body>
</html>

Example 3

Here we clear the contents of <body> tags with decompose argument set to Tue.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tags = soup.find('body')
ret = tags.clear(decompose=True)

print(soup)

Output

<html>
<body></body>
</html>

Beautiful Soup - extract() Method

Method Description

The extract() method in Beautiful Soup library is used to remove a tag or a string from the document tree. The extract() method returns the object that has been removed. It is similar to how a pop() method in Python list works.

Syntax

extract(index)

Parameters

Index − The position of the element to be removed. None by default.

Return Type

The extract() method returns the element that has been removed from the document tree.

Example 1

html = '''
   <div>
      <p>Hello Python</p>
   </div>
'''
from bs4 import BeautifulSoup

soup=BeautifulSoup(html, 'html.parser')
                
tag1 = soup.find("div")
tag2 = tag1.find("p")
ret = tag2.extract()
print ('Extracted:',ret)
print ('original:',soup)

Output

Extracted: <p>Hello Python</p>
original:
<div>
</div>

Example 2

Consider the following HTML markup −

<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs./p>
   </body>
</html>

Here is the code −

from bs4 import BeautifulSoup

fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
   obj = tag.extract()
   print ("Extracted:",obj)

print (soup)

Output

Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>

Example 3

You can also use extract() method along with find_next(), find_previous() methods and next_element, previous_element properties.

html = '''
<div>
<p><b>Hello</b><b>Python</b></p>
</div>
'''
from bs4 import BeautifulSoup

soup=BeautifulSoup(html, 'html.parser')
                
tag1 = soup.find("b")
ret = tag1.next_element.extract()
print ('Extracted:',ret)
print ('original:',soup)

Output

Extracted: Hello
original:
<div>
<p><b></b><b>Python</b></p>
</div>

Beautiful Soup - decompose() Method

Method Description

The decompose() method destroys current element along with its children, thus the element is removed from the tree, wiping it out and everything beneath it. You can check whether an element has been decomposed, by the `decomposed` property. It returns True if destroyed, false otherwise.

Syntax

decompose()

Parameters

No parameters are defined for this method.

Return Type

The method doesn't return any object.

Example 1

When we call descompose() method on the BeautifulSoup object itself, the entire content will be destroyed.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.decompose()
print ("decomposed:",soup.decomposed)
print (soup)

Output

decomposed: True
document: Traceback (most recent call last):
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
TypeError: can only concatenate str (not "NoneType") to str

Since the soup object is decomposed, it returns True, however, you get TypeError as shown above.

Example 2

The code below makes use of decompose() method to remove all the occurrences of <p> tags in the HTML string used.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
p_all = soup.find_all('p')
[p.decompose() for p in p_all]

print ("document:",soup)

Output

Rest of the HTML document after removing all <p> tags will be printed.

document: 
<html>
<body>

</body>
</html>

Example 3

Here, we find the <body> tag from the HTML document tree and decompose the previous element which happens to be the <title> tag. The resultant document tree omits the <title> tag.

html = '''
<html>
   <head>
      <title>TutorialsPoint</title>
   </head>
   <body>
      Hello World
   </body>
</html>

'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag = soup.body
tag.find_previous().decompose()

print ("document:",soup)

Output

document: 
<html>
<head>

</head>
<body>
Hello World
</body>
</html>

Beautiful Soup - replace_with() Method

Method Description

Beautiful Soup's replace_with() method replaces a tag or string in an element with the provided tag or string.

Syntax

replace_with(tag/string)

Parameters

The method accepts a tag object or a string as argument.

Return Type

The replace_method doesn't return a new object.

Example 1

In this example, the <p> tag is replaced by <b> with the use of replace_with() method.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
txt = tag1.string
tag2 = soup.new_tag('b')
tag2.string = txt
tag1.replace_with(tag2)
print (soup)

Output

<html>
<body>
<b>The quick, brown fox jumps over a lazy dog.</b>
</body>
</html>

Example 2

You can simply replace the inner text of a tag with another string by calling replace_with() method on the tag.string object.

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
tag1.string.replace_with("DJs flock by when MTV ax quiz prog.")
print (soup)

Output

<html>
<body>
<p>DJs flock by when MTV ax quiz prog.</p>
</body>
</html>

Example 3

The tag object to be used for replacement can be obtained by any of the find() methods. Here, we replace the text of the tag next to <p> tag.

html = '''
<html>
   <body>
      <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('p')
tag1.find_next('b').string.replace_with('black')
print (soup)

Output

<html>
<body>
<p>The quick, <b>black</b> fox jumps over a lazy dog.</p>
</body>
</html>

Beautiful Soup - wrap() Method

Method Description

The wrap() method in Beautiful Soup encloses the element inside another element. You can wrap an existing tag element with another, or wrap the tag's string with a tag.

Syntax

wrap(tag)

Parameters

The tag to be wrapped with.

Return Type

The method returns a new wrapper with the given tag.

Example 1

In this example, the <b> tag is wrapped in <div> tag.

html = '''
<html>
   <body>
      <p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = soup.new_tag('div')
tag1.wrap(newtag)
print (soup)

Output

<html>
<body>
<p>The quick, <div><b>brown</b></div> fox jumps over a lazy dog.</p>
</body>
</html>

Example 2

We wrap the string inside the <p> tag with a wrapper tag.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>tutorialspoint.com</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))

print (soup)

Output

<p><b>tutorialspoint.com</b></p>

Beautiful Soup - unwrap() Method

Method Description

The unwrap() method is the opposite of wrap() method. It It replaces a tag with whatever's inside that tag. It removes the tag from an element and returns it.

Syntax

unwrap()

Parameters

The method doesn't require any parameter.

Return Type

The unwrap() method returns the tag that has been removed.

Example 1

In the following example, the <b> tag from the html string is removed.

html = '''
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = tag1.unwrap()

print (soup)

Output

<p>The quick, brown fox jumps over a lazy dog.</p>

Example 2

The code below prints the returned value of unwrap() method.

html = '''
<p>The quick, <b>brown</b> fox jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
tag1 = soup.find('b')
newtag = tag1.unwrap()

print (newtag)

Output

<b></b>

Example 3

The unwrap() method is useful for good for stripping out markup, as the following code shows −

html = '''
<html>
   <body>
      <p>The quick, brown fox jumps over a lazy dog.</p>
      <p>DJs flock by when MTV ax quiz prog.</p>
      <p>Junk MTV quiz graced by fox whelps.</p>
      <p>Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
#print (soup.unwrap())
for tag in soup.find_all():
   tag.unwrap()
print (soup)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

Beautiful Soup - smooth() Method

Method Description

After calling a bunch of methods that modify the parse tree, you may end up with two or more NavigableString objects next to each other. The smooth() method smooths out this element's children by consolidating consecutive strings. This makes pretty-printed output look more natural following a lot of operations that modified the tree.

Syntax

smooth()

Parameters

This method has no parameters.

Return Type

This method returns the given tag after smoothing.

Example 1

html ='''<html>
<head>
    <title>TutorislsPoint/title>
</head>
<body>
Some Text
    <div></div>
    <p></p>
    <div>Some more text</div>
    <b></b>
    <i></i> # COMMENT
</body>
</html>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
soup.find('body').sm
for item in soup.find_all():
   if not item.get_text(strip=True):
      p = item.parent
      item.replace_with('')
      p.smooth()

print (soup.prettify())

Output

<html>
   <head>
      <title>
         TutorislsPoint/title>
      </title>
   </head>
   <body>
      Some Text
      <div>
         Some more text
      </div>
      # COMMENT
   </body>
</html>

Example 2

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello</p>", 'html.parser')
soup.p.append(", World")

soup.smooth()
print (soup.p.contents)

print(soup.p.prettify())

Output

['Hello, World']
<p>
   Hello, World
</p>

Beautiful Soup - prettify() Method

Method Description

Syntax

prettify(encoding, formatter)

Parameters

encoding − The eventual encoding of the string. If this is None, a Unicode string will be returned.
A Formatter object, or a string naming one of the standard formatters.

Return Type

The prettify() method returns a Unicode string (if encoding==None) or a bytestring (otherwise).

Example 1

Consider the following HTML string.

<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>

Using the prettify() method we can better understand its structure −

html = '''
<p>The quick, <b>brown fox</b> jumps over a lazy dog.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
print (soup.prettify())

Output

<html>
   <body>
      <p>
         The quick,
      <b>
         brown fox
      </b>
         jumps over a lazy dog.
      </p>
   </body>
</html>

Example 2

You can call prettify() on on any of the Tag objects in the document.

print (soup.b.prettify())

Output

<b>
   brown fox
</b>

He prettify() method can optionally be provided formatter argument to specify the formatting to be used.

There are following possible values for the formatter.

formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML.

formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

formatter="html5" − it's similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br".

formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML.

Example 3

from bs4 import BeautifulSoup

french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))

Output

minimal: 
<p>
 Il a dit <
 <sacré bleu!="">
  >
 </sacré>
</p>
html: 
<p>
 Il a dit <
 <sacré bleu!="">
  >
 </sacré>
</p>
None: 
<p>
 Il a dit <
 <sacré bleu!="">
  >
 </sacré>
</p>

Beautiful Soup - encode() Method

Method Description

The encode() method in Beautiful Soup renders a bytestring representation of the given PageElement and its contents.

The prettify() method, which allows to you to easily visualize the structure of the Beautiful Soup parse tree, has the encoding argument. The encode() method plays the same role as the encoding in prettify() method has.

Syntax

encode(encoding, indent_level, formatter, errors)

Parameters

encoding − The destination encoding.
indent_level − Each line of the rendering will be
indented this many levels. Used internally in recursive calls while pretty-printing.
formatter − A Formatter object, or a string naming one of the standard formatters.
errors − An error handling strategy.

Return Value

The encode() method returns a byte string representation of the tag and its contents.

Example 1

The encoding parameter is utf-8 by default. Following code shows the encoded byte string representation of the soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
print (soup.encode('utf-8'))

Output

b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'

Example 2

The formatter object has the following predefined values −

formatter="minimal" − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML.

formatter="html" − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

formatter="html5" − it's similar to formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like "br".

formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML.

In the following example, different formatter values are used as argument for encode() method.

from bs4 import BeautifulSoup

french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print ("minimal: ")
print(soup.p.encode(formatter="minimal"))
print ("html: ")
print(soup.p.encode(formatter="html"))
print ("None: ")
print(soup.p.encode(formatter=None))

Output

minimal: 
b'<p>Il a dit <<Sacr\xc3\xa9 bleu!>></p>'
html:
b'<p>Il a dit <<Sacré bleu!>></p>'
None:
b'<p>Il a dit <<Sacr\xc3\xa9 bleu!>></p>'

Example 3

The following example uses Latin-1 as the encoding parameter.

markup = '''
<html>
   <head>
      <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
   </head>
   <body>
      <p>Sacr`e bleu!</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, 'lxml')
print(soup.p.encode("latin-1"))

Output

b'<p>Sacr`e bleu!</p>'

Beautiful Soup - decode() Method

Method Description

The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples.

Syntax

decode(pretty_print, encoding, formatter, errors)

Parameters

pretty_print − If this is True, indentation will be used to make the document more readable.
encoding − The encoding of the final document. If this is None, the document will be a Unicode string.
formatter − A Formatter object, or a string naming one of the standard formatters.
errors − The error handling scheme to use for the handling of decoding errors. Values are 'strict', 'ignore' and 'replace'.

Return Value

The decode() method returns a Unicode String.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", 'html.parser')
enc = soup.encode('utf-8')
print (enc)
dec = enc.decode()
print (dec)

Output

b'Hello \xe2\x80\x9cWorld!\xe2\x80\x9d'
Hello "World!"

Beautiful Soup - get_text() Method

Method Description

The get_text() method returns only the human-readable text from the entire HTML document or a given tag. All the child strings are concatenated by the given separator which is a null string by default.

Syntax

get_text(separator, strip)

Parameters

separator − The child strings will be concatenated using this parameter. By default it is "".
strip − The strings will be stripped before concatenation.

Return Type

The get_Text() method returns a string.

Example 1

In the example below, the get_text() method removes all the HTML tags.

html = '''
<html>
<body>
   <p> The quick, brown fox jumps over a lazy dog.</p>
   <p> DJs flock by when MTV ax quiz prog.</p>
   <p> Junk MTV quiz graced by fox whelps.</p>
   <p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

Output

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

Example 2

In the following example, we specify the separator argument of get_text() method as '#'.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

Output

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

Example 3

Let us check the effect of strip parameter when it is set to True. By default it is False.

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

Output

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.

Beautiful Soup - diagnose() Method

Method Description

Syntax

diagnose(data)

Parameters

data − the document string.

Return Value

The diagnose() method prints the result of parsing the given document according all the available parsers.

Example

Let us take this simple document for our exercise −

<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>

The following code runs the diagnostics on the above HTML script −

markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''

from bs4.diagnose import diagnose

diagnose(markup)

The diagonose() output starts with a message showing what all parsers are available −

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1

If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors.

To begin the built-in html.parser is take up. The report will be as follows −

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
   <h1>
      Hello World
   <b>
      Welcome
   </b>
   <p>
      <b>
         Beautiful Soup
         <i>
            Tutorial
         </i>
         <p>
         </p>
      </b>
   </p>
</h1>

You can see that Python's built-in parser doesn't insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end.

Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
   <head>
   </head>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
         <p>
            <b>
               Beautiful Soup
               <i>
                  Tutorial
               </i>
            </b>
         </p>
         <p>
            <b>
            </b>
         </p>
      </h1>
   </body>
</html>

With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
      </h1>
      <p>
         <b>
            Beautiful Soup
            <i>
               Tutorial
            </i>
         </b>
      </p>
      <p>
      </p>
   </body>
</html>

The diagnose() method parses the document as XML document also, which probably is superfluous in our case.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
   Hello World
   <b>
      Welcome
   </b>
   <P>
      <b>
         Beautiful Soup
      </b>
      <i>
         Tutorial
      </i>
   <p/>
   </P>
</h1>

Let us give the diagnose() method a XML document instead of HTML document.

<?xml version="1.0" ?>
   <books>
      <book>
         <title>Python</title>
         <author>TutorialsPoint</author>
         <price>400</price>
      </book>
   </books>

Now if we run the diagnostics, even if it's a XML, the html parsers are applied.

Trying to parse your markup with html.parser

Warning (from warnings module):
  File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
    warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
   <head>
   </head>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

The lxml html parser doesn't insert the comment, but parses it as HTML.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

The lxml-xml parser parses the document as XML.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
   <books>
      <book>
         <title>
            Python
         </title>
         <author>
            TutorialsPoint
         </author>
         <price>
            400
         </price>
      </book>
   </books>

The diagnostics report may prove to be useful in finding errors in HTML/XML documents.

Print Page