How to screen for outliners and deal with them?

Data points that stand out from the bulk of other data points in a dataset are known as outliers. They can distort statistical measurements and obscure underlying trends in the data, which can have a detrimental effect on data analysis, modeling, and visualization. Therefore, before beginning any analysis, it is crucial to recognize and handle outliers.

In this article, we'll explore different methods for screening outliers and various approaches to deal with them effectively.

Screening for Outliers

We must first identify outliers in order to deal with them. Here are popular techniques for detecting outliers ?

Visual Inspection

Visualizing the data using graphs and plots, such as box plots, scatter plots, and histograms is one method for finding outliers. Let's create a box plot to visualize outliers ?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample data with outliers
data = [10, 12, 14, 15, 16, 18, 20, 22, 25, 100, 120]

# Create box plot
plt.figure(figsize=(8, 6))
plt.boxplot(data, vert=True)
plt.title('Box Plot showing Outliers')
plt.ylabel('Values')
plt.show()

Z-Score Method

A statistical metric called the z-score counts the number of standard deviations a data point deviates from the mean. A z-score of 3 or greater is frequently regarded as an outlier ?

import pandas as pd
import numpy as np
from scipy import stats

# Create sample dataset
data = pd.DataFrame({'value': [10, 12, 14, 15, 16, 18, 20, 22, 25, 100, 120]})

# Calculate z-scores
z_scores = np.abs(stats.zscore(data['value']))

# Identify outliers (z-score > 3)
outliers_mask = z_scores > 3
outliers = data[outliers_mask]

print("Z-scores:")
print(z_scores)
print("\nOutliers detected:")
print(outliers)
Z-scores:
[1.18  1.06  0.93  0.87  0.81  0.68  0.56  0.43  0.25  2.48  2.98]

Outliers detected:
Empty DataFrame
Columns: [value]
Index: []

Interquartile Range (IQR) Method

The interval between the data's 25th percentile (Q1) and its 75th percentile (Q3) is known as the interquartile range. Any data point below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is considered an outlier ?

import pandas as pd
import numpy as np

# Create sample dataset with clear outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})

# Calculate Q1, Q3, and IQR
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")

# Identify outliers
outliers = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]
print(f"\nOutliers detected:\n{outliers}")
Q1: 7.0, Q3: 12.0, IQR: 5.0
Lower bound: -0.5, Upper bound: 19.5
Outliers detected:
   value
9     50
10    60

Dealing with Outliers

After identifying outliers, we must determine how to handle them. Here are common approaches ?

Removal Method

Removing outliers from the dataset is the simplest approach. This should be used cautiously as removing too many outliers can impact statistical measures ?

import pandas as pd
import numpy as np

# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})

# Calculate IQR bounds
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
cleaned_data = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]

print("Original data:")
print(data.T)
print(f"\nCleaned data (outliers removed):")
print(cleaned_data.T)
Original data:
   0  1  2  3  4   5   6   7   8   9   10
value  5  6  7  8  9  10  11  12  13  50  60

Cleaned data (outliers removed):
   0  1  2  3  4   5   6   7   8
value  5  6  7  8  9  10  11  12  13

Transformation Method

Transform data using mathematical functions like logarithmic transformation to reduce the impact of extreme values ?

import pandas as pd
import numpy as np

# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})

# Apply log transformation
data['log_transformed'] = np.log(data['value'])

print("Original vs Log Transformed:")
print(data)
Original vs Log Transformed:
   value  log_transformed
0      5         1.609438
1      6         1.791759
2      7         1.945910
3      8         2.079442
4      9         2.197225
5     10         2.302585
6     11         2.397895
7     12         2.484907
8     13         2.564949
9     50         3.912023
10    60         4.094345

Imputation Method

Replace outliers with statistical measures like mean or median values ?

import pandas as pd
import numpy as np

# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})

# Calculate IQR bounds
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Replace outliers with median
median_value = data['value'].median()
data_imputed = data.copy()
data_imputed.loc[(data_imputed['value'] < lower_bound) | (data_imputed['value'] > upper_bound), 'value'] = median_value

print("Original data:")
print(data['value'].tolist())
print(f"\nAfter imputation with median ({median_value}):")
print(data_imputed['value'].tolist())
Original data:
[5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]

After imputation with median (10.0):
[5, 6, 7, 8, 9, 10, 11, 12, 13, 10.0, 10.0]

Comparison of Methods

Method Pros Cons Best For
Z-Score Simple, assumes normal distribution Not robust for small datasets Large, normally distributed data
IQR Robust, no distribution assumption May be too aggressive General purpose outlier detection
Removal Clean dataset Loss of information Clear data errors
Imputation Preserves dataset size May introduce bias Missing value scenarios

Conclusion

Outlier detection and treatment are crucial steps in data preprocessing. Use IQR method for robust detection, visual inspection for understanding, and choose treatment methods based on your analysis goals. Always document your outlier handling process and consider the impact on your final results.

Updated on: 2026-03-27T00:26:08+05:30

351 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements