Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Write a program in Python to find which column has the minimum number of missing values in a given dataframe
When working with data analysis, it's common to encounter missing values in DataFrames. Sometimes you need to identify which column has the minimum number of missing values to help guide your data cleaning strategy.
Problem Statement
Given a DataFrame with missing values, we need to find which column has the fewest NaN values. This is useful for determining which columns are most complete in your dataset.
Sample DataFrame
Let's start by creating a sample DataFrame with missing values to demonstrate the solution ?
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, np.nan, 5, 6, 7, 8, 9, 10],
'Salary': [20000, np.nan, 50000, 40000, 80000, np.nan, 350000, 55000, 60000, 70000],
'Age': [22, 23, np.nan, 25, np.nan, 25, 26, 27, np.nan, 24]
})
print("DataFrame is:")
print(df)
DataFrame is:
Id Salary Age
0 1.0 20000.0 22.0
1 2.0 NaN 23.0
2 3.0 50000.0 NaN
3 NaN 40000.0 25.0
4 5.0 80000.0 NaN
5 6.0 NaN 25.0
6 7.0 350000.0 26.0
7 8.0 55000.0 27.0
8 9.0 60000.0 NaN
9 10.0 70000.0 24.0
Solution Approach
To find the column with minimum missing values, we'll follow these steps ?
Count the number of missing values in each column using
isnull().sum()Find the column name with the minimum count using
idxmin()
Method 1: Using isnull() and idxmin()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, np.nan, 5, 6, 7, 8, 9, 10],
'Salary': [20000, np.nan, 50000, 40000, 80000, np.nan, 350000, 55000, 60000, 70000],
'Age': [22, 23, np.nan, 25, np.nan, 25, 26, 27, np.nan, 24]
})
print("DataFrame is:")
print(df)
# Count missing values per column
missing_counts = df.isnull().sum()
print("\nMissing values per column:")
print(missing_counts)
# Find column with minimum missing values
min_missing_column = missing_counts.idxmin()
print(f"\nColumn with minimum missing values: {min_missing_column}")
DataFrame is:
Id Salary Age
0 1.0 20000.0 22.0
1 2.0 NaN 23.0
2 3.0 50000.0 NaN
3 NaN 40000.0 25.0
4 5.0 80000.0 NaN
5 6.0 NaN 25.0
6 7.0 350000.0 26.0
7 8.0 55000.0 27.0
8 9.0 60000.0 NaN
9 10.0 70000.0 24.0
Missing values per column:
Id 1
Salary 2
Age 3
dtype: int64
Column with minimum missing values: Id
Method 2: Using apply() with Lambda
Alternative approach using apply() function with a lambda expression ?
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, np.nan, 5, 6, 7, 8, 9, 10],
'Salary': [20000, np.nan, 50000, 40000, 80000, np.nan, 350000, 55000, 60000, 70000],
'Age': [22, 23, np.nan, 25, np.nan, 25, 26, 27, np.nan, 24]
})
# Using apply with lambda
missing_counts = df.apply(lambda x: x.isnull().sum(), axis=0)
min_missing_column = missing_counts.idxmin()
print(f"Column with minimum missing values: {min_missing_column}")
print(f"Number of missing values: {missing_counts[min_missing_column]}")
Column with minimum missing values: Id Number of missing values: 1
Complete Analysis
Here's a comprehensive function that provides detailed analysis of missing values ?
import pandas as pd
import numpy as np
def analyze_missing_values(df):
missing_counts = df.isnull().sum()
missing_percentages = (missing_counts / len(df)) * 100
# Create summary DataFrame
summary = pd.DataFrame({
'Missing_Count': missing_counts,
'Missing_Percentage': missing_percentages.round(2)
})
print("Missing Values Summary:")
print(summary)
min_column = missing_counts.idxmin()
min_count = missing_counts[min_column]
print(f"\nColumn with minimum missing values: '{min_column}' ({min_count} missing)")
return min_column
# Test the function
df = pd.DataFrame({
'Id': [1, 2, 3, np.nan, 5, 6, 7, 8, 9, 10],
'Salary': [20000, np.nan, 50000, 40000, 80000, np.nan, 350000, 55000, 60000, 70000],
'Age': [22, 23, np.nan, 25, np.nan, 25, 26, 27, np.nan, 24]
})
result = analyze_missing_values(df)
Missing Values Summary:
Missing_Count Missing_Percentage
Id 1 10.0
Salary 2 20.0
Age 3 30.0
Column with minimum missing values: 'Id' (1 missing)
Key Points
isnull().sum()counts missing values per columnidxmin()returns the index (column name) with minimum valueBoth approaches yield the same result; choose based on preference
Consider analyzing missing percentages for better insights
Conclusion
Use df.isnull().sum().idxmin() to quickly find the column with minimum missing values. This approach helps identify the most complete columns in your dataset for further analysis.
