How to perform dimensionality reduction using Python Scikit-learn?

Python Scikit-learn Server Side Programming Programming

Dimensionality reduction, an unsupervised machine learning method is used to reduce the number of feature variables for each data sample selecting set of principal features. Principal Component Analysis (PCA) is one of the popular algorithms for dimensionality reduction available in Sklearn.

In this tutorial, we perform dimensionality reduction using principal component analysis and incremental principal component analysis using Python Scikit-learn (Sklearn).

Using Principal Component Analysis (PCA)

PCA is a statistical method that linearly project the data into new feature space by analyzing the features of original dataset. The main concept behind PCA is to select the “principal” characteristics of the data and build features based on them. It will give us the new dataset that will be low in size but have the same information as that of the original dataset.

Example

In the below example we will fit the Iris plant dataset that comes as default with the scikit-learn package, with PCA (initialized with 2 components).

# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()

# Print details about the datset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
print('Target size : '+str(iris.target.shape))
X_iris, Y_iris = iris.data, iris.target

# Intialize PCA and fit the data
pca_2 = decomposition.PCA(n_components=2)
pca_2.fit(X_iris)

# Transforming iris data to new dimensions(with 2 features)
X_iris_pca2 = pca_2.transform(X_iris)

# Printing new dataset
print('New Dataset size after transformations : ', X_iris_pca2.shape)
print('\n')

# Getting the direction of maximum variance in data
print("Components : ", pca_2.components_)
print('\n')

# Getting the amount of variance explained by each component
print("Explained Variance:",pca_2.explained_variance_)
print('\n')

# Getting the percentage of variance explained by each component
print("Explained Variance Ratio:",pca_2.explained_variance_ratio_)
print('\n')

# Getting the singular values for each component
print("Singular Values :",pca_2.singular_values_)
print('\n')

# Getting estimated noise covariance
print("Noise Variance :",pca_2.noise_variance_)

Output

It will produce the following output −

Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Features size : (150, 4)

Target names : ['setosa' 'versicolor' 'virginica']

Target size : (150,)
New Dataset size after transformations : (150, 2)

Components : [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
   [ 0.65658877 0.73016143 -0.17337266 -0.07548102]]

Explained Variance: [4.22824171 0.24267075]

Explained Variance Ratio: [0.92461872 0.05306648]

Singular Values : [25.09996044 6.01314738]

Noise Variance : 0.051022296508184406

Using Incremental Principal Component Analysis (IPCA)

Incremental Principal Component Analysis (IPCA) is used to address the biggest limitation of Principal Component Analysis (PCA) and that is PCA only supports batch processing, means all the input data to be processed should fit in the memory.

The Scikit-learn ML library provides sklearn.decomposition.IPCA module that makes it possible to implement Out-of-Core PCA either by using its partial_fit method on sequentially fetched chunks of data or by enabling use of np.memmap, a memory mapped file, without loading the entire file into memory.

Same as PCA, while decomposition using IPCA, input data is centered but not scaled for each feature before applying the SVD.

Example

In the below example we will fit the Iris plant dataset that comes as default with the scikit-learn package, with IPCA (initialized with 2 components and batch size =20).

# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition

# Load iris plant dataset
iris = datasets.load_iris()

# Print details about the datset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
print('Target size : '+str(iris.target.shape))
X_iris, Y_iris = iris.data, iris.target

# Initialize PCA and fit the data
ipca_2 = decomposition.IncrementalPCA(n_components=2, batch_size=20)
ipca_2.fit(X_iris)

# Transforming iris data to new dimensions(with 2 features)
X_iris_ipca2 = ipca_2.transform(X_iris)

# Printing new dataset
print('New Dataset size after transformations : ', X_iris_ipca2.shape)
print('\n')

# Getting the direction of maximum variance in data
print("Components : ", ipca_2.components_)
print('\n')

# Getting the amount of variance explained by each component
print("Explained Variance:",ipca_2.explained_variance_)
print('\n')

# Getting the percentage of variance explained by each component
print("Explained Variance Ratio:",ipca_2.explained_variance_ratio_)
print('\n')

# Getting the singular values for each component
print("Singular Values :",ipca_2.singular_values_)
print('\n')

# Getting estimated noise covariance
print("Noise Variance :",ipca_2.noise_variance_)

Output

It will produce the following output −

Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Features size : (150, 4)

Target names : ['setosa' 'versicolor' 'virginica']

Target size : (150,)
New Dataset size after transformations : (150, 2)

Components : [[ 0.3622612 -0.0850586 0.85634557 0.35805603]
[ 0.64678214 0.73999163 -0.17069766 -0.07033882]]

Explained Variance: [4.22535552 0.24227125]

Explained Variance Ratio: [0.92398758 0.05297912]

Singular Values : [25.09139241 6.0081958 ]

Noise Va riance : 0.00713274779746683

Gaurav Leekha

Updated on: 04-Oct-2022

768 Views

Kickstart Your Career

Get certified by completing the course

Get Started