Interpretability of selected variables and performance comparison of variable selection methods in a polyethylene and polypropylene NIR classification task
2021
Abstract Near infrared (NIR) spectra are collected as a high amount of absorption values which usually greatly exceeds the sample size. Variable selection methods are employed in NIR spectroscopy to avoid “curse of dimensionality” related issues. In this paper, we examined the interpretability of selected variables, that is, how much selected wavelengths are related to the chemical structure of the materials studied, and if the relation is important for classification performance. Additionally, we examined classification performance in dependence on the number of selected variables. For this purpose, relative standard deviation (RSD), successive projection algorithm (SPA), stepwise decorrelation of variables (SELECT), genetic algorithm (GA), principal component analysis (PCA), and random (RANDOM) variable selection were applied in two-class classification modelling using linear discriminant analysis (LDA) or a support vector machine (SVM). Different pre-treatments and sample sizes were considered. Variable selection improved classification performance and variables selected by a majority of the methods considered were conveniently related to chemical structure. Interpretability and performance increase/decrease depend greatly on the number of selected variables, however. Since selected variables reveal great chemical interpretability, some variable selection methods could be employed to determine material characteristic absorption bands. SELECT and SPA displayed the best properties among the methods considered. To avoid faulty results, optimization of the number of selected variables should become the crucial stage in the variable selection process.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
61
References
0
Citations
NaN
KQI