MCEN: a method of simultaneous variable selection and clustering for high-dimensional multinomial regression

2019 
Multinomial regression is often used to investigate the association between potential independent variables and multi-class nominal responses such as multiple disease subtypes. However, it cannot identify groups of variables that have similar effects on predicting the same subtypes of diseases, which is an important problem in biomedical research. Clustering variables in this problem is not trivial, since correlated variables may have distinct predictive effects on the multi-class nominal responses. For example, a group of moderately to highly correlated expressed genes may be associated with different subtypes of a disease. This paper presents a new data-driven simultaneous variable selection and clustering method for high-dimensional multinomial regression. By using a novel penalty function that incorporates both regression coefficients and pairwise correlation to define clusters of variables, the proposed method provides a one-stop solution to select and group important variables associated with different classes of multinomial response at the same time. An alternating minimization algorithm is developed to solve the resulting optimizing problem, which incorporates both convex optimization and clustering steps. The proposed method is compared with the state of the art in terms of prediction and variable clustering performance through extensive simulation studies. In addition, three real data examples are presented to demonstrate how to apply our method and further verify the findings in our simulation studies. The results of simulation and real data studies also shed light on the strength and weakness of several different penalized regression methods with respect to variable clustering and prediction in different scenarios.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    46
    References
    0
    Citations
    NaN
    KQI
    []