How to Select Important Variables from Dataset?


Introduction

In machine learning, the data features are one of the parameters which affect the model's performance most. The data's features or variables should be informative and good enough to feed it to the machine learning algorithm, as it is noted that the model can perform best if even less amount of data is provided of good quality.

The traditional machine learning algorithm performs better as it is fed with more data. Still, after some value or the quantity of the data, the model's performance becomes constant and does not increase. This is the point where the selection of the variables can still help us enhance the performance.

This article will discuss some of the best approaches to help us select the best variables from the dataset to get an accurate model with their core intuition, working mechanism, and examples.

Feature Selection

Feature selection is a technique used to select the best features in the dataset. In every machine learning problem, it is impossible to have the best of the best features in the dataset. Some unuseful features must be dropped or ignored while training and building machine-learning models.

There are many methods for the selection of the variables from the dataset.

Approach 1: Use Your Knowledge

Before jumping directly to some complex feature selection methods, using common knowledge about the data and then dropping the unuseful features is the best way to save time and computational efforts.

You can use your knowledge of the data and then decide according to it. For example, data with serial number columns are mostly ignored and impractical, and regression datasets with columns like ID or No are not helpful and can be dropped directly.

Approach 2: Use Pearson Correlation

The correlation features selection method is one of the easiest and less computational methods for feature selection. Here the correlation between each independent variable and the dependent variable is calculated, and based on the correlation values, the best features can be selected manually.

#importing pandas import pandas as pd #dataframe df = pd.read_csv("data.csv") #correlations df.corr()

Approach 3: Use SelectKBest

SelectKBest is one of the most famous and valuable methods for selecting the data's best and most appropriate features. This method is most useful in cases where the dependent and independent features or columns are in numeric form.

This approach can be easily implemented by using Sci-kit Learn. By just passing the number of best features we want, it calculates all the correlations and returns the specified best features as the output.

#importing libraries from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression # dataset generation X, y = make_regression(n_samples=50, n_features=5, n_informative=5) # feature selection fs = SelectKBest(score_func=f_regression, k=5) # apply feature selection X_best = fs.fit_transform(X, y)

As we can see in the above code, we can define the number of best features we want by using the parameter called “k.” After applying this code to the data, the final shape would be (100,5), where 100 specifies the number of rows and 5 is the best five features selected.

Approach 4: Use ANOVA Test

ANOVA Test, or the analysis of the variance test, is also a famous technique frequently used to select the best features among the data. Using this test also, we can define the number of best features we want, and it will generate new data according to it.

This method is mainly used when we have numerical and categorical data in our dataset.

Example

# ANOVA feature selection from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif # generate dataset X, y = make_classification(n_samples=100, n_features=100, n_informative=10) # define feature selection fs = SelectKBest(score_func=f_classif, k=10) # apply feature selection X_best = fs.fit_transform(X, y)

The output from the above could be the best ten features out of 100 given features.

Approach 5: Use Chi-Square Test

The Cho-square method is one of a data scientist's most well-known statistical methods for feature selection. This approach is applied where we have categorical data as independent and dependent data.

By using the parameter value “chi2” for score_func, we can quickly implement and calculate the best variable of the data in SKLearn.

Example

# Chi2 feature selection from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # generate dataset X, y = make_classification(n_samples=100, n_features=100, n_informative=10) # define feature selection fs = SelectKBest(score_func=chi2, k=10) # apply feature selection X_best = fs.fit_transform(X, y)

Similar to the above code, the chi-square metric for all features will also be calculated, and based on the chisquare value, the best feature will be visible as the final output.

Key Takeaways

  • The feature selection is one of the essential steps that should be taken for lesser computational and storage power needs and higher performance of the model during the deployment of the model.

  • The domain knowledge and standard Pearson correlations methods can be used for quick feature selection in cases with a time limit to build the model.

  • The ANOVA and Chi-Square tests can be used for accurate feature selection with data in categorica and numerical form.

Conclusion

In this article, we discussed some of the best feature selection techniques used for selecting the best features or variables from the dataset to enhance the model's performance. Knowledge about these approaches will help one to perform the feature selection on any data very efficiently and will be able to make the best decision according to it.

Updated on: 16-Jan-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements