Effective clustering for Single Cell Sequencing cancer data

Simone Ciccolella,Murray Patterson,Paola Bonizzoni,G. Della Vedova

Effective clustering for Single Cell Sequencing cancer data

2019

Single Cell Sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods. One possible solution is to reduce the size of an SCS instance --- usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature, we explore techniques for clustering categorical data --- commonly used in data mining and machine learning --- to SCS data, with this goal in mind. In this work, we explore such categorical clustering techniques, and show on a study of simulated cancer phylogenies that the k-modes technique is the most effective strategy, when coupled with our novel dissimilarity measure --- called the conflict dissimilarity measure --- for computing the resulting modes, or centroids. We demonstrate this by showing that k-modes clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from a method, when applied to the resulting reduced instance produced by k-modes. Finally, we apply the entire pipeline (clustering + inference method) to a real dataset which was previously too large for the inference method alone, showing that our clustering procedure is effective in reducing the running time, hence raising considerably the threshold on the instance size that can be solved.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations