Разработка алгоритмов и программных средств классификации кодирующих и некодирующих нуклеотидных последовательностей

В. Р. Закирова,Д. А. Сырокваш,С. В. Гилевский,П. В. Назаров,Н. Н. Яцков

Разработка алгоритмов и программных средств классификации кодирующих и некодирующих нуклеотидных последовательностей

2019

Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations