Beautiful Soup - Parsing a Section of a Document



Let's say you want to use Beautiful Soup look at a document's <a> tags only. Normally you would parse the tree and use find_all() method with the required tag as the argument.

soup = BeautifulSoup(fp, "html.parser")

tags = soup.find_all('a')

But that would be time consuming as well as it will take up more memory unnecessarily. Instead, you can create an object of SoupStrainer class and use it as value of parse_only argument to BeautifulSoup constructor.

A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.

product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parse_only=product)

Above lines of code will parse only the titles from a product site, which might be inside a tag field.

Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −

Example

from bs4 import BeautifulSoup, SoupStrainer

#Only "a" tags
only_a_tags = SoupStrainer("a")

#Will parse only the below mentioned "ids".
parse_only = SoupStrainer(id=["first", "third", "my_unique_id"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

#parse only where string length is less than 10
def is_short_string(string):
   return len(string) < 10

only_short_strings = SoupStrainer(string=is_short_string)

The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, text, and **kwargs.

Note that this feature won't work if you're using the html5lib parser, because the whole document will be parsed in that case, no matter what. Hence, you should use either the inbuilt html.parser or lxml parser.

You can also pass a SoupStrainer into any of the methods covered in Searching the tree.

from bs4 import SoupStrainer

a_tags = SoupStrainer("a")
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(a_tags)
Advertisements