Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to screen for outliners and deal with them?
Data points that stand out from the bulk of other data points in a dataset are known as outliers. They can distort statistical measurements and obscure underlying trends in the data, which can have a detrimental effect on data analysis, modeling, and visualization. Therefore, before beginning any analysis, it is crucial to recognize and handle outliers.
In this article, we'll explore different methods for screening outliers and various approaches to deal with them effectively.
Screening for Outliers
We must first identify outliers in order to deal with them. Here are popular techniques for detecting outliers ?
Visual Inspection
Visualizing the data using graphs and plots, such as box plots, scatter plots, and histograms is one method for finding outliers. Let's create a box plot to visualize outliers ?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create sample data with outliers
data = [10, 12, 14, 15, 16, 18, 20, 22, 25, 100, 120]
# Create box plot
plt.figure(figsize=(8, 6))
plt.boxplot(data, vert=True)
plt.title('Box Plot showing Outliers')
plt.ylabel('Values')
plt.show()
Z-Score Method
A statistical metric called the z-score counts the number of standard deviations a data point deviates from the mean. A z-score of 3 or greater is frequently regarded as an outlier ?
import pandas as pd
import numpy as np
from scipy import stats
# Create sample dataset
data = pd.DataFrame({'value': [10, 12, 14, 15, 16, 18, 20, 22, 25, 100, 120]})
# Calculate z-scores
z_scores = np.abs(stats.zscore(data['value']))
# Identify outliers (z-score > 3)
outliers_mask = z_scores > 3
outliers = data[outliers_mask]
print("Z-scores:")
print(z_scores)
print("\nOutliers detected:")
print(outliers)
Z-scores: [1.18 1.06 0.93 0.87 0.81 0.68 0.56 0.43 0.25 2.48 2.98] Outliers detected: Empty DataFrame Columns: [value] Index: []
Interquartile Range (IQR) Method
The interval between the data's 25th percentile (Q1) and its 75th percentile (Q3) is known as the interquartile range. Any data point below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is considered an outlier ?
import pandas as pd
import numpy as np
# Create sample dataset with clear outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})
# Calculate Q1, Q3, and IQR
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
# Identify outliers
outliers = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]
print(f"\nOutliers detected:\n{outliers}")
Q1: 7.0, Q3: 12.0, IQR: 5.0 Lower bound: -0.5, Upper bound: 19.5 Outliers detected: value 9 50 10 60
Dealing with Outliers
After identifying outliers, we must determine how to handle them. Here are common approaches ?
Removal Method
Removing outliers from the dataset is the simplest approach. This should be used cautiously as removing too many outliers can impact statistical measures ?
import pandas as pd
import numpy as np
# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})
# Calculate IQR bounds
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
cleaned_data = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]
print("Original data:")
print(data.T)
print(f"\nCleaned data (outliers removed):")
print(cleaned_data.T)
Original data: 0 1 2 3 4 5 6 7 8 9 10 value 5 6 7 8 9 10 11 12 13 50 60 Cleaned data (outliers removed): 0 1 2 3 4 5 6 7 8 value 5 6 7 8 9 10 11 12 13
Transformation Method
Transform data using mathematical functions like logarithmic transformation to reduce the impact of extreme values ?
import pandas as pd
import numpy as np
# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})
# Apply log transformation
data['log_transformed'] = np.log(data['value'])
print("Original vs Log Transformed:")
print(data)
Original vs Log Transformed: value log_transformed 0 5 1.609438 1 6 1.791759 2 7 1.945910 3 8 2.079442 4 9 2.197225 5 10 2.302585 6 11 2.397895 7 12 2.484907 8 13 2.564949 9 50 3.912023 10 60 4.094345
Imputation Method
Replace outliers with statistical measures like mean or median values ?
import pandas as pd
import numpy as np
# Sample data with outliers
data = pd.DataFrame({'value': [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60]})
# Calculate IQR bounds
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Replace outliers with median
median_value = data['value'].median()
data_imputed = data.copy()
data_imputed.loc[(data_imputed['value'] < lower_bound) | (data_imputed['value'] > upper_bound), 'value'] = median_value
print("Original data:")
print(data['value'].tolist())
print(f"\nAfter imputation with median ({median_value}):")
print(data_imputed['value'].tolist())
Original data: [5, 6, 7, 8, 9, 10, 11, 12, 13, 50, 60] After imputation with median (10.0): [5, 6, 7, 8, 9, 10, 11, 12, 13, 10.0, 10.0]
Comparison of Methods
| Method | Pros | Cons | Best For |
|---|---|---|---|
| Z-Score | Simple, assumes normal distribution | Not robust for small datasets | Large, normally distributed data |
| IQR | Robust, no distribution assumption | May be too aggressive | General purpose outlier detection |
| Removal | Clean dataset | Loss of information | Clear data errors |
| Imputation | Preserves dataset size | May introduce bias | Missing value scenarios |
Conclusion
Outlier detection and treatment are crucial steps in data preprocessing. Use IQR method for robust detection, visual inspection for understanding, and choose treatment methods based on your analysis goals. Always document your outlier handling process and consider the impact on your final results.
