Unsupervised methods for pattern discovery in high-throughput genomic data

Kristina Buschur

Unsupervised methods for pattern discovery in high-throughput genomic data

2019

Kristina Buschur

Large –omics experiment datasets are being generated at an increasingly fast pace. They present bountiful opportunities for insight into complex diseases and systems but also new challenges in analysis. Novel approaches are needed to make sense of these high-throuput data and especially to consider them jointly for a more complete picture of the system’s biology. In this dissertation, we have focused on improving clustering in high-throughput biological datasets by developing a variety of new features that are specifically tailored to reflect the biological properties of the systems we are trying to understand. We started by proposing new features for representing transcription factor binding sites that capture both the DNA sequence composition of the binding region and the TF-DNA binding strength. We observed that these new features aided clustering for improved DNA binding motif discovery. Next, we presented a new method, single sample network perturbation assessment (ssNPA), and demonstrated how causal network learning algorithms could be used to build features that capture the complex interactions of variables within biological systems such as gene regulatory networks and cluster samples based on how these networks are deregulated in different subtypes. We validated this method in a murine liver cell development dataset and with transcriptomic datasets comparing breast cancer and lung adenocarcinoma tumor samples to normal tissue. Then we used ssNPA to describe new subtypes of chronic obstructive pulmonary disease (COPD) that were based on their relative gene network deregulation compared to normal samples. Finally, we applied causal network modeling techniques to two datasets of chronic lung diseases, exploring the systems biology of lung function decline in COPD at the body systems level and cell type interactions in idiopathic pulmonary fibrosis (IPF) at the scale of the gene expression in single cells.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations