Several algorithms and software have been developed for inferring phylogenetic trees. However, there exist some biological phenomena such as hybridization, recombination, or horizontal gene transfer which cannot be represented by a tree topology. We need to use phylogenetic networks to adequately represent these important evolutionary mechanisms. In this article, we present a new efficient heuristic algorithm for inferring hybridization networks from evolutionary distance matrices between species. The famous Neighbor-Joining concept and the least-squares criterion are used for building networks. At each step of the algorithm, before joining two given nodes, we check if a hybridization event could be related to one of them or to both of them. The proposed algorithm finds the exact tree solution when the considered distance matrix is a tree metric (i.e. it is representable by a unique phylogenetic tree). It also provides very good hybrids recovery rates for large trees (with 32 and 64 leaves in our simulations) for both distance and sequence types of data. The results yielded by the new algorithm for real and simulated datasets are illustrated and discussed in detail.
We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly embedded in a low-dimensional space while the embedded policy incurs almost no decrease in return.
Phages are one of the most present groups of organisms in the biosphere. Their identification continues and their taxonomies are divergent. However, due to their evolution mode and the complexity of their species ecosystem, their classification is not complete. Here, we present a new approach to the phages classification that combines the methods of horizontal gene transfer detection and ancestral sequence reconstruction.
Credit scoring (CS) is an effective and crucial approach used for risk management in banks and other financial institutions. It provides appropriate guidance on granting loans and reduces risks in the financial area. Hence, companies and banks are trying to use novel automated solutions to deal with CS challenge to protect their own finances and customers. Nowadays, different machine learning (ML) and data mining (DM) algorithms have been used to improve various aspects of CS prediction. In this paper, we introduce a novel methodology, named Deep Genetic Hierarchical Network of Learners (DGHNL). The proposed methodology comprises different types of learners, including Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Probabilistic Neural Networks (PNN), and fuzzy systems. The Statlog German (1000 instances) credit approval dataset available in the UCI machine learning repository is used to test the effectiveness of our model in the CS domain. Our DGHNL model encompasses five kinds of learners, two kinds of data normalization procedures, two extraction of features methods, three kinds of kernel functions, and three kinds of parameter optimizations. Furthermore, the model applies deep learning, ensemble learning, supervised training, layered learning, genetic selection of features (attributes), genetic optimization of learners parameters, and novel genetic layered training (selection of learners) approaches used along with the cross-validation (CV) training-testing method (stratified 10-fold). The novelty of our approach relies on a proper flow and fusion of information (DGHNL structure and its optimization). We show that the proposed DGHNL model with a 29-layer structure is capable to achieve the prediction accuracy of 94.60% (54 errors per 1000 classifications) for the Statlog German credit approval data. It is the best prediction performance for this well-known credit scoring dataset, compared to the existing work in the field.
Biolinguistic IE data archive. This file includes phonetic data, data matrices, Newick strings and word trees discussed in this paper as well as Perl and Python scripts for computing the Levenshtein and SCA distances. (ZIP 328Â kb)
Motivation: Accurate detection of sequence similarity and homologous recombination are essential parts of many evolutionary analyses. Results: We have developed SimPlot++, an open-source multiplatform application implemented in Python, which can be used to produce publication quality sequence similarity plots using 63 nucleotide and 20 amino acid distance models, to detect intergenic and intragenic recombination events using Phi, Max-X2, NSS or proportion tests, and to generate and analyze interactive sequence similarity networks. SimPlot++ supports multicore data processing and provides useful distance calculability diagnostics. Availability: SimPlot++ is freely available on GitHub at: https://github.com/Stephane-S/Simplot_PlusPlus, as both an executable file (for Windows) and Python scripts (for Windows/Linux/MacOS).