Python Pandas - Missing Data



Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.

When and Why Is Data Missed?

Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.

Representing Missing Data in Pandas

Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.

  • numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast to np.float64 or object, as NaN is a floating-point value.

  • NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".

  • <NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.

Example

Let us now see how Pandas represent the missing data for different data types.

import pandas as pd
import numpy as np

ser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])

df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )
print(df)

Its output is as follows −

NumPy Dates Others
1.0 1970-01-01 00:00:00.000000001 1
2.0 1970-01-01 00:00:00.000000002 2
NaN NaT <NA>

Checking for Missing Values

Pandas provides the isna() and notna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.

Example

The following example detecting the missing values using the isna() method.

import pandas as pd
import numpy as np

ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
print(pd.isna(ser))

On executing the above code we will get the following output −

0    False
1     True
dtype: bool

It is important to note that None is also treated as a missing value when using isna() and notna().

Calculations with Missing Data

When performing calculations with missing data, Pandas treats NA as zero. If all data in a calculation are NA, the result will be NA.

Example

This example calculates the sum of value in the DataFrame "one" column with the missing data.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].sum())

Its output is as follows −

2.02357685917

Replacing/Filling Missing Data

Pandas provides several methods to handle missing data. One common approach is to replace missing values with a specific value using the fillna() method.

Example

The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using the fillna() method.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print("Input DataFrame:\n",df)
print("Resultant DataFrame after NaN replaced with '0':")
print(df.fillna(0))

Its output is as follows −

Input DataFrame:
one two three
a 0.188006 -0.685489 -2.088354
b NaN NaN NaN
c -0.446296 2.298046 0.346000
Resultant DataFrame after NaN replaced with '0':
one two three
a 0.188006 -0.685489 -2.088354
b 0.000000 0.000000 0.000000
c -0.446296 2.298046 0.346000

Drop Missing Values

If you want to simply exclude the missing values instead of replacing then, then use the dropna() function for dropping missing values.

Example

This example removes the missing values using the dropna() function.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Its output is as follows −

one two three
a 0.170497 -0.118334 -1.078715
c 0.326345 -0.180102 0.700032
e 1.972619 -0.322132 -1.405863
f 1.760503 -1.179294 0.043965
h 0.747430 0.235682 0.973310
Advertisements