537. New Graph-Based Algorithm for Comprehensive Identification and Tracking Retroviral Integration Sites

2016 
Vector integration sites (IS) in hematopoietic stem cell (HSC) gene therapy (GT) applications are stable genetic marks, distinctive for each independent cell clone and its progeny. The characterization of IS allows to identify each cell clone and individually track its fate in different tissues or cell lineages and during time, and is required for assessing the safety and efficacy of the treatment. Bioinformatics pipelines for IS detection used in GT identify the sequence reads mapping in the same genomic position of the reference genome as a single IS but discard those ambiguously mapped in multiple genomic regions. The loss of such significant portion of patients’ IS may hide potential malignant events thus reducing the reliability of IS studies. We developed a novel tool that is able to accurately identify IS in any genomic region even if composed by repetitive genomic sequences. Our approach exploits an initial genome free analysis of sequencing reads by creating an undirected graph in which nodes are the input sequences and edges represent valid alignments (over a specific identity threshold) between pairs of nodes. Through the analysis and decomposition of the graph, the method identifies indivisible subgraphs of sequences (clusters), each of them corresponding to an IS. Once extracted the consensus sequence of the clusters and aligned on the reference genome, we collect the alignment results and the annotation labels from RepeatMasker. By combining the set of genomic coordinates and the annotation labels, the method retraces the initial sequence graph, statistically validates the clusters through permutation test and produces the final list of IS. We tested the reliability of our tool on 3 IS datasets generated from simulated sequencing reads with incremental rate of nucleotide variations (0%, 0.25% and 0.5%) and real data from a cell line with known IS and we compared out tool to VISPA and UClust, used for GT studies. In the simulated datasets our tool demonstrated precision and recall ranging 0.85-0.97 and 0.88-0.99 respectively, producing the aggregate F-score ranging 0.86-0.98 which resulted higher than VISPA and UClust. In the experimental case of sequences from LAM-PCR products, our tool and VISPA were able to identify all the 6 known ISs for >98% of the reads produced, while UClust identified only 5 out 6 ISs. We then used our tool to reanalyze the sequencing reads of our GT clinical trial for Metachromatic Leukodystrophy (MLD) completing the hidden portion of IS. The overall number of ISs, sequencing reads and estimated actively re-populating HSCs was increased by an average fold ~1.5 with respect the previously published data obtained through VISPA whereas the diversity index of the population did not change and no aberrant clones in repeats occurred. Our tool addresses and solves important open issues in retroviral IS identification and clonal tracking, allowing the generation of a comprehensive repertoire of IS.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []