How can BeautifulSoup be used to extract 'href' links from a website?

BeautifulSoup is a Python library used for web scraping and parsing HTML/XML documents. It provides a simple way to navigate, search, and extract data from web pages, including extracting href attributes from anchor tags.

Installation

Install BeautifulSoup using pip ?

pip install beautifulsoup4 requests

Basic Syntax for Extracting href Links

The general approach involves fetching the webpage, parsing it with BeautifulSoup, and using find_all('a') to locate anchor tags ?

from bs4 import BeautifulSoup
import requests

# Basic syntax structure
# soup = BeautifulSoup(html_content, "html.parser")
# links = soup.find_all('a')
# href_value = link.get('href')

Example: Extracting All href Links

Here's how to extract all href attributes from a webpage ?

from bs4 import BeautifulSoup
import requests

url = "https://httpbin.org/html"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

print("All href links:")
for link in soup.find_all('a'):
    href = link.get('href')
    if href:
        print(href)
All href links:
https://httpbin.org
/get
/post
/forms/post

Filtering Specific Links

You can filter links based on specific criteria, such as external links only ?

from bs4 import BeautifulSoup

# Sample HTML content for demonstration
html_content = """
<html>
    <body>
        <a href="https://example.com">External Link</a>
        <a href="/internal-page">Internal Link</a>
        <a href="mailto:test@example.com">Email Link</a>
        <a href="#section1">Anchor Link</a>
    </body>
</html>
"""

soup = BeautifulSoup(html_content, "html.parser")

print("External links only:")
for link in soup.find_all('a'):
    href = link.get('href')
    if href and href.startswith('http'):
        print(f"Link: {href}, Text: {link.text}")
External links only:
Link: https://example.com, Text: External Link

Extracting Links with Additional Information

You can extract both the href attribute and the link text together ?

from bs4 import BeautifulSoup

html_content = """
<html>
    <body>
        <a href="https://python.org" title="Python Official">Python Website</a>
        <a href="/docs" class="internal">Documentation</a>
        <a href="https://github.com">GitHub</a>
    </body>
</html>
"""

soup = BeautifulSoup(html_content, "html.parser")

print("Links with details:")
for link in soup.find_all('a'):
    href = link.get('href')
    text = link.text.strip()
    title = link.get('title', 'No title')
    
    if href:
        print(f"URL: {href}")
        print(f"Text: {text}")
        print(f"Title: {title}")
        print("---")
Links with details:
URL: https://python.org
Text: Python Website
Title: Python Official
---
URL: /docs
Text: Documentation
Title: No title
---
URL: https://github.com
Text: GitHub
Title: No title
---

How It Works

  • requests.get(url) fetches the webpage content
  • BeautifulSoup(content, "html.parser") parses the HTML
  • soup.find_all('a') finds all anchor tags
  • link.get('href') extracts the href attribute value
  • Filter conditions can be applied to get specific types of links

Conclusion

BeautifulSoup makes extracting href links straightforward using find_all('a') and get('href'). You can filter results based on link types and extract additional attributes like text and titles for comprehensive link analysis.

Updated on: 2026-03-25T15:13:55+05:30

13K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements