Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to use regular expressions (Regex) to filter valid emails in a Pandas series?
A regular expression is a sequence of characters that define a search pattern. In this program, we will use these regular expressions to filter valid and invalid emails in a Pandas series.
We will define a Pandas series with different emails and check which email is valid using Python's re library for regex operations.
Email Validation Regex Pattern
The regex pattern for email validation contains several components ?
- ^: Anchor for the start of the string
- [a-z0-9]: Character class to match lowercase letters and digits
- [\._]?: Optional dot or underscore character
- @: Required @ symbol
- \w+: One or more word characters for domain name
- [.]: Literal dot character
- \w{2,3}: 2-3 word characters for domain extension
- $: Anchor for the end of the string
Example
Let's create a Pandas series with email addresses and filter them using regex ?
import pandas as pd
import re
# Create a series with different email addresses
emails = pd.Series(['jimmyadams123@gmail.com', 'hellowolrd.com', 'user@domain.org', 'invalid.email', 'test123@yahoo.co.uk'])
# Define regex pattern for email validation
regex_pattern = r'^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
print("Email Validation Results:")
print("-" * 40)
for email in emails:
if re.search(regex_pattern, email):
print(f"{email}: Valid Email")
else:
print(f"{email}: Invalid Email")
Email Validation Results: ---------------------------------------- jimmyadams123@gmail.com: Valid Email hellowolrd.com: Invalid Email user@domain.org: Valid Email invalid.email: Invalid Email test123@yahoo.co.uk: Invalid Email
Using Pandas str.contains() Method
You can also filter emails directly using Pandas string methods ?
import pandas as pd
emails = pd.Series(['jimmyadams123@gmail.com', 'hellowolrd.com', 'user@domain.org', 'invalid.email'])
# Filter valid emails using regex pattern
regex_pattern = r'^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
valid_emails = emails[emails.str.contains(regex_pattern, regex=True, na=False)]
print("Valid emails:")
print(valid_emails)
Valid emails: 0 jimmyadams123@gmail.com 2 user@domain.org dtype: object
Creating Boolean Mask
Generate a boolean series to identify valid emails ?
import pandas as pd
emails = pd.Series(['jimmyadams123@gmail.com', 'hellowolrd.com', 'user@domain.org', 'invalid.email'])
regex_pattern = r'^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
# Create boolean mask
is_valid = emails.str.contains(regex_pattern, regex=True, na=False)
# Create DataFrame with results
result_df = pd.DataFrame({
'Email': emails,
'Is_Valid': is_valid
})
print(result_df)
Email Is_Valid
0 jimmyadams123@gmail.com True
1 hellowolrd.com False
2 user@domain.org True
3 invalid.email False
Conclusion
Regular expressions provide a powerful way to validate email addresses in Pandas series. Use str.contains() with regex patterns for efficient filtering, or combine with re.search() for more complex validation logic.
