How to screen for outliners and deal with them?


Introduction

Data points that stand out from the bulk of other data points in a dataset are known as outliers. They can distort statistical measurements and obscure underlying trends in the data, which can have a detrimental effect on data analysis, modeling, and visualization. Therefore, before beginning any study, it is crucial to recognize and handle outliers.

In this post, we'll look at different methods for dealing with outliers as well as how to check for them.

Screening for Outliers

We must first recognize outliers in order to deal with them. Here are a few popular techniques for identifying outliers −

1. Visual Inspection

Visualizing the data using graphs and plots, such as box plots, scatter plots, and histograms is one method for finding outliers. A data point that considerably differs from the bulk of other data points is referred to as an outlier. By analyzing the plot, we can determine if the outliers are real or the result of mistakes or corrupted data.

2. Z-score

A statistical metric called the z-score counts the number of standard deviations a data point deviates from the mean. We can find data points that are considerably different from the majority of other data points by computing the z-score of each data point. A z-score of 3 or less is frequently regarded as an anomaly.

3. Interquartile Range (IQR)

The interval between the data's 25th percentile (Q1) and its 75th percentile (Q3) is known as the interquartile range. We can find data points that are considerably different from the majority of other data points by computing the IQR and multiplying it by a factor of 1.5. Any data point that is 1.5 or more times the IQR below Q1 or above Q3 is frequently regarded as an outlier.

Dealing with Outliers

After locating outliers, we must determine how to handle them. Here are a few typical methods for handling outliers −

1. Removal

Taking outliers out of the dataset is the easiest approach to handling them. This strategy should be employed with caution, though, as eliminating too many outliers might have a major negative influence on the dataset's statistical measurements and obscure key trends. It is crucial to record the procedure and the justification for deleting outliers when doing so.

2. Transformation

Transforming the data using mathematical functions like logarithmic, exponential, or power functions is another strategy for addressing outliers. By using this method, the extreme values of the dataset's statistical measures will have less of an influence and patterns will be simpler to spot.

3. Imputation

Imputation is the process of substituting estimated values for missing or anomalous data. Data may be imputed using a variety of techniques, including mean imputation, median imputation, and regression imputation. Although this method can add bias to the dataset and affect the accuracy of the study, it should be used with caution.

4. Segmentation

The process of segmenting a dataset involves breaking it up into smaller groups according to various traits or properties. We may study each group independently and find patterns that are exclusive to each group by segmenting the data. When dealing with outliers that are valid but reflect a distinct portion of the data, this strategy may be helpful.

Example

import pandas as pd
import numpy as np
from scipy import stats

# Create a sample dataset
data = pd.DataFrame({'value': [10, 9, 8, 7, 6, 555, 999, 5, 6]})

# Calculate z-scores for each value in the dataset
z_scores = np.abs(stats.zscore(data))

# Identify outliers as any value with a z-score greater than 3
outliers = data[z_scores > 3]

# Replace outliers with the median value of the dataset
data[z_scores > 3] = data['value'].median()

# Print the updated dataset without outliers
print(data)

Output

  value
0     10
1      9
2      8
3      7
4      6
5    555
6    999
7      5
8      6

Explanation

  • Using one column named value and 10 values, including an outlier with a value of 100, a sample dataset is produced.

  • The stats are used to determine the z-scores for each value in the dataset. from the SciPy package, the Z score function. A data point's Z-score indicates how many standard deviations it is from the mean.

  • Using the print function, the new dataset is printed sans outliers.

  • Given that we are only concerned with the size of the departure from the mean and not its direction, the np.abs function is used to get each z-absolute score's value.

  • The criteria z scores > 3 is used to identify any value with a z-score above 3 as an outlier.

  • Using the median function of the value, the outliers are replaced with the dataset's median value.

The outlier is located and eliminated from the dataset by the code using the z-score approach. The dataset's median value is used to replace the identified outliers. When the sample size is big or the data are regularly distributed, this strategy can be helpful.

It's important to keep in mind that there are other approaches to dealing with outliers, and the one used in this example is only one of them. Trimming, winsorizing, and utilizing machine learning algorithms that are resistant to outliers are further typical techniques. The best approach will rely on the particular traits of the dataset and the objectives of the investigation.

Conclusion

In summary, outliers can negatively affect data analysis, modeling, and visualization, thus it's critical to spot and handles them before beginning any study. We can make sure that our analysis is accurate and insightful by checking for outliers using visual inspection, z-score, and IQR, and then dealing with them using removal, transformation, imputation, or segmentation. However, it's crucial to employ these methods carefully and to record the procedure.

Updated on: 10-Mar-2023

110 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements