
- Python Pandas Tutorial
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
- Python Pandas Useful Resources
- Python Pandas - Quick Guide
- Python Pandas - Cheatsheet
- Python Pandas - Useful Resources
- Python Pandas - Discussion
Python Pandas - Missing Data
Missing data is always a problem in real life scenarios. particularly in areas like machine learning and data analysis. Missing values can significantly impact the accuracy of models and analyses, making it crucial to address them properly. This tutorial will about how to identify and handle missing data in Python Pandas.
When and Why Is Data Missed?
Consider a scenario where an online survey is conducted for a product. Many a times, people do not share all the information related to them, they might skip some questions, leading to incomplete data. For example, some might share their experience with the product but not how long they have been using it, or vice versa. Missing data is a frequent occurrence in such real-time scenarios, and handling it effectively is essential.
Representing Missing Data in Pandas
Pandas uses different sentinel values to represent missing data (NA or NaN), depending on the data type.
numpy.nan: Used for NumPy data types. When missing values are introduced in an integer or boolean array, the array is upcast to np.float64 or object, as NaN is a floating-point value.
NaT: Used for missing dates and times in np.datetime64, np.timedelta64, and PeriodDtype. NaT stands for "Not a Time".
<NA>: A more flexible missing value representation for StringDtype, Int64Dtype, Float64Dtype, BooleanDtype, and ArrowDtype. This type preserves the original data type when missing values are introduced.
Example
Let us now see how Pandas represent the missing data for different data types.
import pandas as pd import numpy as np ser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2]) ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2]) ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2]) df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} ) print(df)
Its output is as follows −
NumPy | Dates | Others |
---|---|---|
1.0 | 1970-01-01 00:00:00.000000001 | 1 |
2.0 | 1970-01-01 00:00:00.000000002 | 2 |
NaN | NaT | <NA> |
Checking for Missing Values
Pandas provides the isna() and notna() functions to detect missing values, which work across different data types. These functions return a Boolean Series indicating the presence of missing values.
Example
The following example detecting the missing values using the isna() method.
import pandas as pd import numpy as np ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT]) print(pd.isna(ser))
On executing the above code we will get the following output −
0 False 1 True dtype: bool
It is important to note that None is also treated as a missing value when using isna() and notna().
Calculations with Missing Data
When performing calculations with missing data, Pandas treats NA as zero. If all data in a calculation are NA, the result will be NA.
Example
This example calculates the sum of value in the DataFrame "one" column with the missing data.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df['one'].sum())
Its output is as follows −
2.02357685917
Replacing/Filling Missing Data
Pandas provides several methods to handle missing data. One common approach is to replace missing values with a specific value using the fillna() method.
Example
The following program shows how you can replace NaN with a scalar value ("NaN" with "0") using the fillna() method.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c']) print("Input DataFrame:\n",df) print("Resultant DataFrame after NaN replaced with '0':") print(df.fillna(0))
Its output is as follows −
Input DataFrame:
one | two | three | |
---|---|---|---|
a | 0.188006 | -0.685489 | -2.088354 |
b | NaN | NaN | NaN |
c | -0.446296 | 2.298046 | 0.346000 |
one | two | three | |
---|---|---|---|
a | 0.188006 | -0.685489 | -2.088354 |
b | 0.000000 | 0.000000 | 0.000000 |
c | -0.446296 | 2.298046 | 0.346000 |
Drop Missing Values
If you want to simply exclude the missing values instead of replacing then, then use the dropna() function for dropping missing values.
Example
This example removes the missing values using the dropna() function.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna())
Its output is as follows −
one | two | three | |
---|---|---|---|
a | 0.170497 | -0.118334 | -1.078715 |
c | 0.326345 | -0.180102 | 0.700032 |
e | 1.972619 | -0.322132 | -1.405863 |
f | 1.760503 | -1.179294 | 0.043965 |
h | 0.747430 | 0.235682 | 0.973310 |