Parse a website with regex and urllib in Python


A powerful technique for extracting data from websites, web scraping enables automated data extraction and analysis. Python offers several tools to make web scraping chores easier thanks to its robust ecosystem of modules. The libraries urllib and re (regular expressions) are two that are commonly utilized for online scraping.

A Python module called urllib enables obtaining web material, processing URLs, and sending HTTP requests. It offers a simple way to connect with web servers, open URLs, and obtain HTML from web pages. On the other hand, regular expressions, which are character sequences used to construct search patterns, are supported by the built-in Python module re.

We'll concentrate on utilizing urllib and re to parse a website and retrieve pertinent data in this article. We will examine two methods that only rely on these libraries and show how to use regular expressions to obtain specified data from a webpage's HTML content.

Let us look into both examples −

Parsing a Website With Urlib and Regex for the Title of the Website

In this example, the HTML content of a webpage can be obtained employing urllib, and a pattern is defined for collecting data through regular expressions. One can match and extract needed information from the HTML text by utilizing regular expressions, offering a quick and adaptable solution for straightforward web scraping operations.

Algorithm

The algorithm to parse a website with regex and urlib in Python, is given below −

  • Step 1 − Import the required library urlib and re.

  • Step 2 − Open url with urlopen() with the aid of urllib.request(), and retrieve the HTML content.

  • Step 3 − Define the regular expression pattern for <title> tag.

  • Step 4 − Search for all occurrences of the pattern

  • Step 5 − Run a loop and print all the matching titles.

Example

# import the required library 
import urllib.request
import re

# Open URL as well as retrieve HTML content
link = "https://www.tutorialspoint.com/index.htm"
# With the aid of urlopen() is utilized of urllib.request to get the URL
retrieving = urllib.request.urlopen(link)
retrieved_content = retrieving.read().decode()

# Define the regular expression pattern
pattern = r"<title>(.*?)</title>"

# Search all occurrences of the pattern
matches = re.findall(pattern, retrieved_content)

# Process extracted data
for match in matches:
   print("Title:", match)

Output

Title: Online Courses and eBooks Library

Parsing a Website With Urlib and Regex for url of the Website

In this example, the HTML content of a webpage is fetched using urllib, and a specific regular expression pattern is defined using re. The pattern is made to take specific data items out of the HTML page, like URLs or other structured data. To extract information in accordance with needs, this method allows customization and adaptability.

Algorithm

The algorithm to parse a website with regex and urlib in Python is given below −

  • Step 1 − To use urlib and regex, import the required library urlib and re.

  • Step 2 − Open the URL with the help urlopen() function with the aid of urllib.request(), and retrieve the HTML content.

  • Step 3 − Define the regular expression pattern for url with the help of href tag.

  • Step 4 − Search for all occurrences of the pattern

  • Step 5 − Run a loop and print all the matching titles.

Example

import urllib.request
import re

# Open URL as well as retrieve HTML content
link = "https://www.tutorialspoint.com/index.htm"
# With the aid of urlopen() is utilized of urllib.request to get the URL
retrieving = urllib.request.urlopen(link)
retrieved_content = retrieving.read().decode()

# Define the desired regular expression pattern
pattern = r"<a href="(.*?)">"

# Search for all the occurrences of the pattern
matches = re.findall(pattern, retrieved_content)

# Display after processing extracted data
for match in matches:
   print("URL:", match)

Output

URL: All the url’s of the website

Conclusion

In this article, we analysed two examples of parsing websites in Python employing only the urllib as well as re-packages. The first examples showed how to use regular expressions to extract a webpage's title. The second example demonstrated how to use a specific regular expression pattern to extract URLs from anchor tags. These methods offer a quick fix for straightforward web scraping applications that call for the extraction of patterns from HTML text. However, it's crucial to keep in mind that regular expressions might not be appropriate for managing intricate HTML structures or circumstances when pattern matching calls for more sophisticated methods.

Updated on: 18-Oct-2023

122 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements