Smart Robust Feature Selection (SoFt) for imbalanced and heterogeneous data

dc.contributor.authorLee, Gary Kee Khoon
dc.contributor.authorKasim, Henry
dc.contributor.authorSirigina, Rajendra Prasad
dc.contributor.authorHow, Shannon Shi Qi
dc.contributor.authorKing, Stephen
dc.contributor.authorHung, Terence Gih Guang
dc.date.accessioned2022-01-21T12:23:04Z
dc.date.available2022-01-21T12:23:04Z
dc.date.issued2021-11-30
dc.description.abstractDesigning a smart and robust predictive model that can deal with imbalanced data and a heterogeneous set of features is paramount to its widespread adoption by practitioners. By smart, we mean the model is either parameter-free or works well with default parameters, avoiding the challenge of parameter tuning. Furthermore, a robust model should consistently achieve high accuracy regardless of any dataset (imbalance, heterogeneous set of features) or domain (such as medical, financial). To this end, a computationally inexpensive and yet robust predictive model named smart robust feature selection (SoFt) is proposed. SoFt involves selecting a learning algorithm and designing a filtering-based feature selection algorithm named multi evaluation criteria and Pareto (MECP). Two state-of-the-art gradient boosting methods (GBMs), CatBoost and H2O GBM, are considered potential candidates for learning algorithms. CatBoost is selected over H2O GBM due to its robustness with both default and tuned parameters. The MECP uses multiple parameter-free feature scores to rank the features. SoFt is validated against CatBoost with a full feature set and wrapper-based CatBoost. SoFt is robust and consistent for imbalanced datasets, i.e., average value and standard deviation of log loss are low across different folds of K-fold cross-validation. Features selected by MECP are also consistent, i.e., features selected by SoFt and wrapper-based CatBoost are consistent across different folds, demonstrating the effectiveness of MECP. For balanced datasets, MECP selects too few features, and hence, the log loss of SoFt is significantly higher than CatBoost with a full feature set.en_UK
dc.identifier.citationLee GK, Kasim H, Sirigina RP, et al., (2022) Smart Robust Feature Selection (SoFt) for imbalanced and heterogeneous data. Knowledge-Based Systems, Volume 236, January 2022, Article number 107197en_UK
dc.identifier.issn0950-7051
dc.identifier.urihttps://doi.org/10.1016/j.knosys.2021.107197
dc.identifier.urihttps://dspace.lib.cranfield.ac.uk/handle/1826/17470
dc.language.isoenen_UK
dc.publisherElsevieren_UK
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectClass-imbalanced dataen_UK
dc.subjectHeterogeneous featuresen_UK
dc.subjectBoosting algorithmsen_UK
dc.subjectFeature Selectionen_UK
dc.subjectCatBoosten_UK
dc.subjectH2O GBMen_UK
dc.titleSmart Robust Feature Selection (SoFt) for imbalanced and heterogeneous dataen_UK
dc.typeArticleen_UK

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SoFt_for_imbalanced_and_heterogeneous_data-2021.pdf
Size:
954.47 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.63 KB
Format:
Item-specific license agreed upon to submission
Description: