Smart Robust Feature Selection (SoFt) for imbalanced and heterogeneous data

Date

2021-11-30

Advisors

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier

Department

Type

Article

ISSN

0950-7051

item.page.extent-format

Citation

Lee GK, Kasim H, Sirigina RP, et al., (2022) Smart Robust Feature Selection (SoFt) for imbalanced and heterogeneous data. Knowledge-Based Systems, Volume 236, January 2022, Article number 107197

Abstract

Designing a smart and robust predictive model that can deal with imbalanced data and a heterogeneous set of features is paramount to its widespread adoption by practitioners. By smart, we mean the model is either parameter-free or works well with default parameters, avoiding the challenge of parameter tuning. Furthermore, a robust model should consistently achieve high accuracy regardless of any dataset (imbalance, heterogeneous set of features) or domain (such as medical, financial). To this end, a computationally inexpensive and yet robust predictive model named smart robust feature selection (SoFt) is proposed. SoFt involves selecting a learning algorithm and designing a filtering-based feature selection algorithm named multi evaluation criteria and Pareto (MECP). Two state-of-the-art gradient boosting methods (GBMs), CatBoost and H2O GBM, are considered potential candidates for learning algorithms. CatBoost is selected over H2O GBM due to its robustness with both default and tuned parameters. The MECP uses multiple parameter-free feature scores to rank the features. SoFt is validated against CatBoost with a full feature set and wrapper-based CatBoost. SoFt is robust and consistent for imbalanced datasets, i.e., average value and standard deviation of log loss are low across different folds of K-fold cross-validation. Features selected by MECP are also consistent, i.e., features selected by SoFt and wrapper-based CatBoost are consistent across different folds, demonstrating the effectiveness of MECP. For balanced datasets, MECP selects too few features, and hence, the log loss of SoFt is significantly higher than CatBoost with a full feature set.

Description

item.page.description-software

item.page.type-software-language

item.page.identifier-giturl

Keywords

Class-imbalanced data, Heterogeneous features, Boosting algorithms, Feature Selection, CatBoost, H2O GBM

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

item.page.relationships

item.page.relationships

item.page.relation-supplements