Python Pandas - Missing Data
Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.
When and Why Is Data Missed?
Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.
Representing Missing Data in Pandas
Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.
numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast to np.float64 or object, as NaN is a floating-point value.
NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".
<NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.
Example
Let us now see how Pandas represent the missing data for different data types.
import pandas as pd
import numpy as np
ser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])
df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )
print(df)
Its output is as follows −
| NumPy | Dates | Others |
|---|---|---|
| 1.0 | 1970-01-01 00:00:00.000000001 | 1 |
| 2.0 | 1970-01-01 00:00:00.000000002 | 2 |
| NaN | NaT | <NA> |
Checking for Missing Values
Pandas provides the isna() and notna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.
Example
The following example detecting the missing values using the isna() method.
import pandas as pd
import numpy as np
ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
print(pd.isna(ser))
On executing the above code we will get the following output −
0 False 1 True dtype: bool
It is important to note that None is also treated as a missing value when using isna() and notna().
Calculations with Missing Data
When performing calculations with missing data, Pandas treats NA as zero. If all data in a calculation are NA, the result will be NA.
Example
This example calculates the sum of value in the DataFrame "one" column with the missing data.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df['one'].sum())
Its output is as follows −
2.02357685917
Replacing/Filling Missing Data
Pandas provides several methods to handle missing data. One common approach is to replace missing values with a specific value using the fillna() method.
Example
The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using the fillna() method.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print("Input DataFrame:\n",df)
print("Resultant DataFrame after NaN replaced with '0':")
print(df.fillna(0))
Its output is as follows −
Input DataFrame:
| one | two | three | |
|---|---|---|---|
| a | 0.188006 | -0.685489 | -2.088354 |
| b | NaN | NaN | NaN |
| c | -0.446296 | 2.298046 | 0.346000 |
| one | two | three | |
|---|---|---|---|
| a | 0.188006 | -0.685489 | -2.088354 |
| b | 0.000000 | 0.000000 | 0.000000 |
| c | -0.446296 | 2.298046 | 0.346000 |
Drop Missing Values
If you want to simply exclude the missing values instead of replacing then, then use the dropna() function for dropping missing values.
Example
This example removes the missing values using the dropna() function.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna())
Its output is as follows −
| one | two | three | |
|---|---|---|---|
| a | 0.170497 | -0.118334 | -1.078715 |
| c | 0.326345 | -0.180102 | 0.700032 |
| e | 1.972619 | -0.322132 | -1.405863 |
| f | 1.760503 | -1.179294 | 0.043965 |
| h | 0.747430 | 0.235682 | 0.973310 |