Machine Learning - Percentiles



Percentiles are a statistical concept used in machine learning to describe the distribution of a dataset. A percentile is a measure that indicates the value below which a given percentage of observations in a group of observations falls.

For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the observations in the dataset fall, while the 75th percentile (also known as the third quartile) is the value below which 75% of the observations in the dataset fall.

Percentiles can be used to summarize the distribution of a dataset and identify outliers. In machine learning, percentiles are often used in data preprocessing and exploratory data analysis to gain insights into the data.

Python provides several libraries for calculating percentiles, including NumPy and Pandas.

Calculating Percentiles using NumPy

Below is an example of how to calculate percentiles using NumPy −

Example

import numpy as np

data = np.array([1, 2, 3, 4, 5])
p25 = np.percentile(data, 25)
p75 = np.percentile(data, 75)
print('25th percentile:', p25)
print('75th percentile:', p75)

In this example, we create a sample dataset using NumPy and then calculate the 25th and 75th percentiles using the np.percentile() function.

Output

The output shows the values of the percentiles for the dataset.

25th percentile: 2.0
75th percentile: 4.0

Calculating Percentiles using Pandas

Below is an example of how to calculate percentiles using Pandas −

Example

import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
p25 = data.quantile(0.25)
p75 = data.quantile(0.75)

print('25th percentile:', p25)
print('75th percentile:', p75)

In this example, we create a Pandas series object and then calculate the 25th and 75th percentiles using the quantile() method of the series object.

Output

The output shows the values of the percentiles for the dataset.

25th percentile: 2.0
75th percentile: 4.0
Advertisements