Beautiful Soup - Get Text Inside Tag



There are two types of tags in HTML. Many of the tags are in pairs of opening and closing counterparts. The top level <html> tag having a corresponding closing </html> tag is the main example. Others are <body> and </body>, <p> and </p>, <h1> and </h1> and many more. Other tags are self-closing tags - such as <img> and<a>. The self-closing tags don't have a text as most of the tags with opening and closing symbols (such as <b>Hello</b>). In this chapter, we shall have a look at how can we get the text part inside such tags, with the help of Beautiful Soup library.

There are more than one methods/properties available in Beautiful Soup, with which we can fetch the text associated with a tag object.

Sr.No Methods & Description
1 text property

Get all child strings of a PageElement, concatenated using a separator if specified.

2 string property

Convenience property to string from a child element.

3 strings property

yields string parts from all the child objects under the current PageElement.

4 stripped_strings property

Same as strings property, with the linebreaks and whitespaces removed.

5 get_text() method

returns all child strings of this PageElement, concatenated using a separator if specified.

Consider the following HTML document −

<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>

If we retrieve the stripped_string property of each tag in the parsed document tree, we will find that the two div tags and the p tag have two NavigableString objects, Hello and World. The <b> tag embeds world string, while <img> doesn't have a text part.

The following example fetches the text from each of the tags in the given HTML document −

Example

html = """
<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
   print ("Tag: {} attributes: {} ".format(tag.name, tag.attrs))
   for txt in tag.stripped_strings:
      print (txt)
       
   print()

Output

Tag: div attributes: {'id': 'outer'} 
Hello
World

Tag: div attributes: {'id': 'inner'} 
Hello
World

Tag: p attributes: {} 
Hello
World

Tag: b attributes: {} 
World

Tag: img attributes: {'src': 'logo.jpg'}
Advertisements