Beautiful Soup - Find Element using CSS Selectors



In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and the other find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. However, the find*() methods search for the PageElements according to the Tag name and its attributes, the select() method searches the document tree for the given CSS selector.

Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria.

Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().The parameters for select() method are as follows −

select(selector, limit, **kwargs)

selector − A string containing a CSS selector.

limit − After finding this number of results, stop looking.

kwargs − Keyword arguments to be passed.

If the limit parameter is set to 1, it becomes equivalent to select_one() method. While the select() method returns a ResultSet of Tag objects, the select_one() method returns a single Tag object.

Soup Sieve Library

Soup Sieve is a CSS selector library. It has been integrated with Beautiful Soup 4, so it is installed along with Beautiful Soup package. It provides ability to select, match, and filter he document tree tags using modern CSS selectors. Soup Sieve currently implements most of the CSS selectors from the CSS level 1 specifications up to CSS level 4, except for some that are not yet implemented.

The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are −

Type selector

Matching elements is done by node name. For example −

tags = soup.select('div')

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Universal selector (*)

It matches elements of any type. Example −

tags = soup.select('*')

ID selector

It matches an element based on its id attribute. The symbol # denotes the ID selector. Example −

tags = soup.select("#nm")

Example

from bs4 import BeautifulSoup

html = '''
   <form>
      <input type = 'text' id = 'nm' name = 'name'>
      <input type = 'text' id = 'age' name = 'age'>
      <input type = 'text' id = 'marks' name = 'marks'>
   </form>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.select("#nm")
print (obj)

Output

[<input id="nm" name="name" type="text"/>]

Class selector

It matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example −

tags = soup.select(".submenu")

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')

tags = soup.select('div')
print (tags)

Output

[<div id="Languages">
<p>Java</p> <p>Python</p> <p>C++</p>
</div>]

Attribute Selectors

The attribute selector matches an element based on its attributes.

soup.select('[attr]')

Example

from bs4 import BeautifulSoup

html = '''
   <h1>Tutorialspoint Online Library</h1>
   <p><b>It's all Free</b></p>
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a> 
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('[href]'))

Output

[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>]

Pseudo Classes

CSS specification defines a number of pseudo CSS classes. A pseudo-class is a keyword added to a selector so as to define a special state of the selected elements. It adds an effect to the existing elements. For example, :link selects a link (every <a> and <area> element with an href attribute) that has not yet been visited.

The pseudo-class selectors nth-of-type and nth-child are very widely used.

:nth-of-type()

The selector :nth-of-type() matches elements of a given type, based on their position among a group of siblings. The keywords even and odd, and will respectively select elements, from a sub-group of sibling elements.

In the following example, second element of <p> type is selected.

Example

from bs4 import BeautifulSoup

html = '''
<p id="0"></p>
<p id="1"></p>
<span id="2"></span>
<span id="3"></span>
'''
soup = BeautifulSoup(html, 'html5lib')
print(soup.select('p:nth-of-type(2)'))

Output

[<p id="1"></p>]

:nth-child()

This selector matches elements based on their position in a group of siblings. The keywords even and odd will respectively select elements whose position is either even or odd amongst a group of siblings.

Usage

:nth-child(even)
:nth-child(odd)
:nth-child(2)

Example

from bs4 import BeautifulSoup, NavigableString

markup = '''
   <div id="Languages">
      <p>Java</p> <p>Python</p> <p>C++</p>
   </div>
'''
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.div

child = tag.select_one(':nth-child(2)')
print (child)

Output

<p>Python</p>
Advertisements