Accurate feature extraction plays a vital role in the fields of machine learning, pattern recognition and image processing. Feature extraction methods based on principal component analysis (PCA), independent component analysis (ICA), and linear discriminant analysis (LDA) are capable of improving the performances of classifiers. In this paper, we propose two features extraction approaches, which integrate with the extracted features of PCA and ICA through some statistical criterion. The performances of the proposed feature extraction approaches are evaluated on simulated data and three public data sets by using cross-validation accuracy of different classifiers that found in statistics and machine learning literature. Our experiment result shows that integrated with ICA and PCA feature is more effective than others in classification analysis.
ABSTRACT Machine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.
Liver disease indicates inflammatory condition of the liver, liver cirrhosis, cancer, or an overload of toxic substances. A liver transplant may reinstate and extend life if a patient has severe liver disease. In the last few years, machine learning (ML) based diagnosis systems have played a vital role in assessing liver patients which eventually leads to proper treatment and saves human life. In this study, we try to predict liver patients by adopting a hybrid feature extraction method to enhance the performance of the ML algorithm. Medical data frequently exhibits non-linear patterns and class imbalances. This is undesirable for the majority of ML algorithms and degrades performance. Here, we present a hybrid feature space that combines t-SNE, Isomap nonlinear features, and kernel principal components that can explain 90% of the variation in the data as a solution to this issue. Before feeding the ML model, data preprocessing techniques including class balancing, identifying outliers, and impute missing values are used. A simulation study and ensemble learning also conducted to justify the proposed prediction performances. Our suggested hybrid non-linear feature exhibits a 2-20 % improvement over existing studies and the ensemble classifier achieved an ideal and outstanding accuracy of 91.33 %.
Kurtosis plays important role in defining shape characteristics of a probability distribution, and also in extracting as well as sorting independent components. From recent research on various versions of classical kurtosis we see that all the measures substantially underestimate kurtosis parameter and exhibit high variability when underlying population distribution is highly skewed or heavy tailed. This is unwanted for ICA. In this book, we propose a bootstrap bias corrected estimator and compare it with the version of classical measure that is found best in recent works. We use both simulated and real data. Our proposed estimator performs better in the both cases. We then apply our measure in sorting independent components in two data sets and try to examine the capacity of PCA, ICA and ICA on PCA for finding groups. In both data sets ICA on PCA shows the maximum discriminating power whereas PCA the least. We recommend using our proposed measure in both extracting and sorting independent components.
In the present study, pulque (Agave cantala) fibers treated with acetic anhydride were grafted with styrene monomer by free radical graft copolymerization techniques. The grafted fibers were then characterized by Fourier transform infrared spectroscopy (FTIR), scanning electron microscope (SEM), thermogravimetric analysis (TGA), and tensile mechanical tests. Untreated raw pulque fibers were also taken as control for comparison. FTIR analysis confirmed the grafting of styrene monomer onto pulque fibers. It was also found that polystyrene graft acetic anhydride-treated pulque fibers displayed higher thermal resistance, improved tensile properties, and lower moisture content as compared to untreated raw pulque fibers.
Nomophobia is a term describing a growing fear in today’s world, the fear of being without a mobile device or beyond mobile phone contact. It is the biggest non-drug addiction of the 21st century and is mainly affected by teen-aged students. Those experiencing nomophobia may feel a sense of panic, anxiety, or distress when they are separated from their mobile phones. This work uses different statistical tools to identify the risk factor of nomophobia and machine learning techniques to propose a fresh way to measure and understand nomophobia. To create a predictive model for nomophobia, we gathered information from a broad sample (n = 357) of smartphone users and used a variety of machine learning methods. Using a questionnaire on 17 different factors and performing a statistically significant test (p<0.05) and ordinal logistic regression analysis on respondents age, level of education, CGPA, self-evaluation, per-day mobile phone usage, and use of media, we can recognize the most important features causative of nomophobia. The context of maximum phone usage is an important feature that highly affects nomophobia. About 201 respondents are at a moderate level. To develop a predictive model, decision tree (DT), random forest (RF), Gaussian Naïve Bayes (NB), and support vector machine (SVM) are utilized in this study for recognition of nomophobia addiction. Proposing an ensemble method to refine the predictive performance. From the analysis, we have found that the SVM feature selector with ensemble algorithm has classified the extent of smartphone addiction with a 57% accuracy rate. Our findings show that nomophobia tendencies can be accurately captured and predicted by machine learning approaches. The model distinguished between students who had symptoms of nomophobia and those who did not with remarkable accuracy. This study of machine learning-based methods presents a viable tool for diagnosing and treating nomophobia in students, eventually assisting in the creation of focused interventions and preventive measures.
Osteoporosis, a common skeletal disorder, necessitates the identification of its risk factors to develop effective preventive measures. It is crucial to identify the underlying risk factors and their relationships with the response class attribute. Different machine learning (ML) algorithms and feature selection approaches are used to estimate the risk of osteoporosis. However, ML-based algorithms may struggle to detect risk factors as well as grading of osteoporosis due to different measurement scale of data and their probability distributional assumptions. Violation of these assumptions and results interpretation may be improper in the presence of heteroscedasticity, or unequal variance in data. In this study, we seek to overcome distribution assumption constraints and improve the interpretability of our results by using rigorous statistical approaches, ensuring a robust and trustworthy study of osteoporosis risk variables. The study dataset consists of 40 clinical, lifestyle, and genetic attributes, allowing for a comprehensive analysis of potential risk factors associated with osteoporosis. In the analysis, after confirming the normality assumption using Kolmogorov-Smirnov and Shapiro-Wilk tests, independent t-test assess the factor ALT, FBG, HDL-C, LDL-C, FNT, TL, TLT, and URIC has a substantial impact on the risk of developing osteoporosis. The Mann-Whitney U test for the non-normal FN variable likewise showed a p-value of less than 0.05, indicating that this variable has a significant effect on the likelihood of developing osteoporosis. Based on the chi-square test p-values for the categorical factors, gender, calcium, calcitriol, bisphosphonate, calcitonin, COPD, CAD, and drinking have a severe significant risk of osteoporosis. For developing the predictive Gaussian Process (GPs) model, we proposed two customized integrated GP kernels into the analysis to enhance the modeling of complex relationships within the data. The proposed GP kernel model (modified kernel 2) outperforms the other individual kernels in this experiment and has the best accuracy score of 86.64% and AUC score of 86.63% on osteoporosis data. Moreover, a simulation study is also conducted to robustify the proposed model, the results are improved by different evaluation matrices ranging in accuracy from 0.60-11.41% and AUC from 0.50-11.60%.
Non-communicable diseases, such as cardiovascular disease, cancer, chronic respiratory diseases, and diabetes, are responsible for approximately 71% of all deaths worldwide. Stroke, a cerebrovascular disorder, is one of the leading contributors to this burden among the top three causes of death. Early recognition of symptoms can encourage a balanced lifestyle and provide essential information for stroke prediction. To identify a stroke patient and risk factors, machine learning (ML) is a key tool for physicians. Due to different data measurement scales and their probability distributional assumptions, ML-based algorithms struggle to detect risk factors. Furthermore, when dealing with risk factors with high-dimensional features, learning algorithms struggle with complexity. In this study, rigorous statistical tests are used to identify risk factors, and PCA-FA (Integration of Principal Components and Factors) and FPCA (Factor Based PCA) approaches are proposed for projecting suitable feature representations for improving learning algorithm performances. The study dataset consists of different clinical, lifestyle, and genetic attributes, allowing for a comprehensive analysis of potential risk factors associated with stroke, which contains 5110 patient records. Using significant test (P-value <0.05), chi-square and independent sample t-test identified age, heart_disease, hypertension, work_type, ever_married, bmi, and smoking_status as risk factors for stroke. To develop the predicting model with proposed feature extraction techniques, random forests approach provides the best results when utilizing the PCA-FA method. The best accuracy rate for this approach is 92.55%, while the AUC score is 98.15%. The prediction accuracy has increased from 2.19% to 19.03% compared to the existing work. Additionally, the prediction results is robustified and reproducible with a stacking ensemble-based classification algorithm. We also developed a web-based application to help doctors diagnose stroke risk based on the findings of this study, which could be used as an additional tool to help doctors diagnose.
For last two decades, clustering is well-recognized area in the research field of data mining. Data clustering plays the major research at pattern recognition, Signal processing, bioinformatics and Artificial Intelligence. Clustering process is an unsupervised learning techniques where it generates a group of object based on their similarity in such a way that the objects belonging to other groups are similar and those belonging to other are dissimilar. This paper analysis the three different data types clustering techniques like K-Means, Principal components analysis (PCA) and Independent component analysis (ICA) in real and simulated data. The recent developments by considering a rather unexpected application of the theory of Independent component analysis (ICA) found in data clustering, outlier detection and multivariate data visualization. Accurate identification of data clustering plays an important role in statistical analysis. In this paper we explore the connection among these three techniques to identify multivariate data clustering and develop a new method k-means on PCA or ICA then the result shows that ICA based clustering performs well than others.