Beginner’s guide for feature selection

Different methods for feature selection and why should anyone bother for feature selection; by comparing different approaches for selecting optimal feature selection method on housing dataset.

Tamjid Ahsan
Towards Data Science

--

When working with a large dataset, modelling can be time consuming to run because of number of features. It is not uncommon to have hundreds of features for a model. Then it is critical to weed out irrelevant and subpar features. This is when the concept of feature selection comes into play. In this article, I shall try to inform about few of the widely used techniques for feature selection, and demo some of them.

Feature selection is extremely important step for modelling a computationally efficient model. There are a bunch of a technique for this. Let us start by defining the process of feature selection.

Feature selection is the process of selecting a subset of most relevant predicting features for use in machine learning model building.

Feature elimination helps a model to perform better by weeding out redundant features and features that are not providing much insight. It is economical in computing power because there are fewer features to train on. Results are more interpretable, and it reduces chance of overfitting by detecting collinear features and improves model accuracy if methods are used intelligently.

Feature Selection Methods:

Highlights of feature selection methods, Image by Author

1. Filter Methods:

Detection Ability: ★★☆☆☆ | Speed: ★★★★★

Picking the best from available set, Photo by UX Indonesia on Unsplash

This is not a machine learning approach. It filters features based on attributes of the feature. This approach is `Model agnostic`, i.e., performance does not depend on model used.

This method should be used for preliminary screening. It can detect constant, duplicated, and correlated features. Usually not the best performance in terms of reducing features. Being said that, It should be the first step for feature reduction as it deals with multicollinearity of the features depending on method used.

Process of feature selection using filter methods, Image by Author

A few examples of this:

  1. Univariate selection (ANOVA: Analysis of variance)
  2. Chi Square
  3. Based on Pearson’s correlation
  4. Linear discriminant analysis (LDA): Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes of a categorical variable

2. Wrapper Methods:

Detection Ability: ★★★★☆ | Speed: ★☆☆☆☆

Machine Learning for the win, Photo by Possessed Photography on Unsplash

This approach uses Machine Learning algorithm. Performance of this method depends on model selected and data underlying. Usually can suggests the optimal feature subset. Tries different subset of features to figure out optimal features. Typically very computationally expensive. Can detect interactions between features.

Possibly best performance in terms of feature elimination. Wrappers are terribly slow when it comes to large datasets.

Process of feature selection using wrapper methods, Image by Author

A few examples of this:

  1. Forward Selection
  2. Backward Selection
  3. Exhaustive Search

3. Embedded Methods:

Detection Ability: ★★★☆☆ | Speed: ★★★☆☆

Human-Machine working together, Photo from pixabay

Performs feature selection as building the model. Generally less computationally expensive than Wrapper methods. Often provides results that are best of both worlds, often more realistic approach.

Process of feature selection using embedded methods, Image by Author

A few examples of this:

  1. Lasso
  2. Lasso with Ridge (regularized features using ElasticNet)
  3. Tree based selection.
  4. Regression coefficients (features must be standardized).

Hyperparameter tuning is very important for this approach, that is the human intervention component of this method.

4. Hybrid Methods:

Detection Ability: ★★★☆☆ | Speed: ★★★★☆

Amalgamation of all the techniques above. This approach is less computationally expensive than Wrapper methods and has good performance.

Process of feature selection using hybrid methods, Image by Author

A few examples of this:

  1. Feature shuffling
  2. Recursive feature elimination
  3. Recursive feature addition

Take away message:

  1. Filter methods ranks features based on the relevance of features by their correlation in respect of the target variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
  2. Filter methods are much faster compared to wrapper methods as they do not actually train models. While, wrapper methods does, making them computationally expensive and some times impossible to perform.
  3. Filter methods use selected statistical methods for feature selection while wrapper methods performs cross validation to determine best subset of features.
  4. Filter methods might fail to find the best subset of features because of the attributes of the features, but wrapper methods can usually provide the best subset of features more often.
  5. Wrapper methods tend to make model more prone to overfitting, test train split is must for those approach.

To demo these techniques, I performed feature selection based on correlation among features, ANOVA, Forward Selection, RFE and Lasso techniques on the “king county housing dataset”.

To keep this article short and not diverge from the star here, Feature Selection, I am briefly going through the steps that I took for the data preparation.

  1. Categorical features are OneHotEncoded.
  2. I dropped duplicates from `id` column.

2. Filled in NaN’s with 0, and other erroneous inputs to 0 in the ‘waterfront’, ‘view’, ‘yr_renovated’, and ‘sqft_basement’ features.

3. Casted appropriate dtypes.

4. Removed outliers from the data.

5. Scaled the data to MinMaxScaler (This scaler scales and translates each feature individually such that it is in the given range on the dataset, e.g., between zero and one) for use in lasso.

The notebook containing this works can be found here on GitHub named Feature_Selection.ipynb.

  1. Filter Methods:
  • Based on Pearson’s correlation

I use this function to get correlated features

result of this when run on cleaned dataset is following:

correlated features:  1
correlated features: {'sqft_above'}

This means ‘sqft_above’ feature is correlated with other features and should be dropped.

  • Univariate selection (ANOVA)

I used this line of code to perform ANOVA.

I sorted the result based on r squared. Result of this is following.

Image by Author

From this I have an idea about the most important features to include in my model. e.g., grade, sqft_living, zipcode and so on.

Image by Author

When plotted, I also have sense of their importance based on their p value. grade and condition having very high p value.

2. Wrapper Methods:

  • Forward selection

I used this function for forward selection on OneHotEncoded DataFrame.

Then I ran the following code.

model = forward_selected(df_model_processed_ohe, 'price')
print(model.model.formula)
print(model.rsquared_adj)
model.summary()

The result gave me features ranked from most important to least important features.

price ~ sqft_living + yr_built + sqft_lot + sqft_living15 + zipcode_98004 + grade_9 + grade_8 + grade_10 + grade_7 + zipcode_98023 + zipcode_98033 + zipcode_98040 + zipcode_98092 + zipcode_98042 + zipcode_98003 + zipcode_98058 + zipcode_98038 + zipcode_98030 + zipcode_98031 + zipcode_98055 + zipcode_98002 + zipcode_98198 + zipcode_98032 + zipcode_98178 + zipcode_98168 + zipcode_98022 + zipcode_98112 + view_4 + zipcode_98199 + zipcode_98115 + zipcode_98103 + zipcode_98117 + zipcode_98119 + zipcode_98105 + zipcode_98107 + zipcode_98109 + zipcode_98116 + zipcode_98102 + zipcode_98122 + zipcode_98052 + zipcode_98006 + zipcode_98005 + zipcode_98053 + zipcode_98136 + zipcode_98144 + zipcode_98008 + zipcode_98029 + condition_5 + view_2 + zipcode_98188 + view_3 + zipcode_98027 + zipcode_98007 + zipcode_98074 + zipcode_98075 + zipcode_98034 + zipcode_98125 + zipcode_98039 + zipcode_98126 + zipcode_98177 + grade_11 + zipcode_98133 + zipcode_98118 + sqft_basement + condition_4 + yr_renovated + view_1 + zipcode_98155 + waterfront_1 + zipcode_98072 + zipcode_98011 + zipcode_98065 + zipcode_98028 + bathrooms + zipcode_98106 + floors + zipcode_98108 + zipcode_98077 + zipcode_98146 + zipcode_98056 + zipcode_98059 + zipcode_98045 + zipcode_98019 + zipcode_98166 + zipcode_98014 + zipcode_98024 + zipcode_98010 + condition_3 + zipcode_98148 + zipcode_98070 + grade_5 + bedrooms + sqft_lot15 + 1r_sq: 0.8315104663680916

This also returns plots for checking homoscedasticity.

Image by Author

I can see that the residual plot is far from perfect with an apparent bias in the model.

3. Embedded Methods:

I performed Lasso for this demo. Using LassoCV from sklearn.linear_model

There were 95 features to begin with. Result of the above code is:

Model r squared 0.8307881074614486
Number of Selected Features: 91

Lasso reduced the features set to 91. These features were dropped- `sqft_lot15`,` zipcode_98070`, `zipcode_98148`, `grade_8`.

4. Hybrid Methods:

To demo this I used SVR and RFE from scikit-learn, from sklearn.svm and sklearn.feature_selection module. If nothing is passed in the RFE `n_features_to_select` then half of the features are selected. I used following code for this feature selection.

Result of this is:

Model r squred: 0.776523847362918
number of selected feature 47

Selecting feature for model is more of an art where used discretion is important. One often used quote in modeling is Garbage in garbage out. This is also true for feature selection. We have to be careful while modeling in terms of selected features as sometimes less is more.

In the above demo some of them performed well while some tanked. It depends on the data scientist to select the optimal numbers for their model based on the goal of the analysis. Being said that, these techniques can become handy for any data scientist to be an efficient one.

Another approach is principal component analysis for dimensionality reduction. It reduces features while keeping its attribute, but compromises inference of the model.

That is all for today. Until next time!

--

--