Beautiful Soup - Scrape HTML Content



The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process.

Web scraping is is sometimes prohibited by the use of 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked.

Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed.

The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are −

  • urllib.error defines the exceptions and errors raised by the urllib.request command.

  • urllib.parse is used for parsing URLs.

  • urllib.robotparser is used for parsing robots.txt files.

Use the urlopen() function in urllib module to read the content of a web page from a website.

import urllib.request
response =  urllib.request.urlopen('http://python.org/') 
html = response.read()

You can also use the requests library for this purpose. You need to install it before using.

pip3 install requests

In the below code, the homepage of http://www.tutorialspoint.com is scraped −

from bs4 import BeautifulSoup
import requests


url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

The content obtained by either of the above two methods are then parsed with Beautiful Soup.

Advertisements