How to use Python Regular expression to extract URL from an HTML link?

URL is an acronym for Uniform Resource Locator; it is used to identify the location resource on internet. For example, the following URLs are used to identify the location of Google and Microsoft websites −

URL consists of domain name, path, port number etc. The URL can be parsed and processed by using Regular Expression. Therefore, if we want to use Regular Expression we have to use re library in Python.


Following is the example demonstrating URL −

If we parse the above URL we can find the website name and protocol
Protocol: https

Regular Expressions

In Python language, regular expression is one of the search pattern used to find matching strings.

Python has four methods which are used for regular expressions −

  • search() − It is used to find first match.

  • match() − it is used to find only identical match

  • findall() − it is used to find all matches

  • sub() − it is used to substitute string matching pattern with new string.

If we want to search a required pattern in URL by using Python language, we use re.findall() function which is a re library function.


Following is the syntax or usage of searching function re.findall in python

re.findall(regex, string)

The above syntax returns all non-overlapping matches of patterns in a string as a list of strings.


To extract a URL, we can use the following code −

import re
text= '<p>Hello World: </p><a href="">More Courses</a><a href="">Even More Courses</a>'
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
print("Original string: ",text)


Following is the output of the above program, when executed.

Original string:  <p>Hello World: </p><a href="">More Courses</a><a href="">Even More Courses</a>
Urls: ['', '']


The below program demonstrate how to extract the Hostname and protocol from a given URL.

import re  
website = ''
#to find protocol
object1 = re.findall('(\w+)://', website)
# To find host name
object2 = re.findall('://www.([\w\-\.]+)', website)


Following is the output of the above program, when executed.



Following program demonstrates the usage of general URL where path elements are constructed.

# Online Python-3 Compiler (Interpreter)

import re

# url
url = '' 

# finding  all capture groups
object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url)


Following is the output of the above program, when executed.

[('http', '', 'index', 'html')]

