Christina Huan Shi

Discovery Institute

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Kevin Y. Yip

Discovery Institute

Peter D. Adams

Discovery Institute

Savio Ho-Chit Chow

Chinese University of Hong Kong

Adarsh Rajesh

Discovery Institute

Yat‐Yuen Lim

University of Malaya

Dharma Varapula

Drexel University

Ken Hung-On Yu

University of Hong Kong

Xue Lei

Sanford Burnham Prebys Medical Discovery Institute

Yilin Wang

Sichuan University

Lahari Uppuluri

Children's Hospital of Philadelphia

Cooperative Institutions

Discovery Institute

Sanford Burnham Prebys Medical Discovery Institute

Torrey Pines Institute For Molecular Studies

Chinese University of Hong Kong

Drexel University

Epigenomics (Germany)

La Jolla Alcohol Research

University of Pennsylvania

Prince of Wales Hospital

University of Hong Kong

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Linked-Pair Long-Read Sequencing Strategy for Targeted Resequencing and Enrichment

bioRxiv (Cold Spring Harbor Laboratory) (2023)

Lahari Uppuluri Christina Huan Shi Dharma Varapula Eleanor Young Rachel Ehrlich

ABSTRACT In this report, we present linked-pair sequencing, a novel strategy to construct a long-read sequencing library such that adjacent fragments are linked with end-terminal duplications. We use the CRISPR-Cas9 nickase enzyme and a pool of multiple sgRNAs to perform non-random fragmentation of targeted long DNA molecules (>300kb) into smaller library-sized fragments (about 20 kbp) in a manner so as to retain physical linkage information (up to 1000 bp) between adjacent fragments. DNA molecules targeted for fragmentation are preferentially ligated with adaptors for sequencing, so this method can enrich targeted regions while taking advantage of the long-read sequencing platforms. This enables the sequencing of target regions with significantly lower total coverage, and the genome sequence within linker regions provides information for assembly and phasing. We demonstrated the validity and efficacy of the method first using phage and then by sequencing a panel of 100 full-length cancer-related genes (including both exons and introns) in the human genome. When the designed linkers contained heterozygous genetic variants, long haplotypes could be established. This sequencing strategy can be readily applied in both PacBio and Oxford Nanopore platforms. This economically viable approach is useful for targeted enrichment of hundreds of target genomic regions and where long no-gap contigs need deep sequencing.

Sequencing by hybridization

Minion

Hybrid genome assembly

Sequence assembly

10.1101/2023.10.26.564243

Cite

Citations (0)

A general near-exact k-mer counting method with low memory consumption enablesde novoassembly of 106× human sequence data in 2.7 hours

Bioinformatics (2020)

Christina Huan Shi Kevin Y. Yip

In de novo sequence assembly, a standard pre-processing step is k-mer counting, which computes the number of occurrences of every length-k sub-sequence in the sequencing reads. Sequencing errors can produce many k-mers that do not appear in the genome, leading to the need for an excessive amount of memory during counting. This issue is particularly serious when the genome to be assembled is large, the sequencing depth is high, or when the memory available is limited.Here, we propose a fast near-exact k-mer counting method, CQF-deNoise, which has a module for dynamically removing noisy false k-mers. It automatically determines the suitable time and number of rounds of noise removal according to a user-specified wrong removal rate. We tested CQF-deNoise comprehensively using data generated from a diverse set of genomes with various data properties, and found that the memory consumed was almost constant regardless of the sequencing errors while the noise removal procedure had minimal effects on counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consistently performed the best in terms of memory usage, consuming 49-76% less memory than the second best method. When counting the k-mers from a human dataset with around 60× coverage, the peak memory usage of CQF-deNoise was only 10.9 GB (gigabytes) for k = 28 and 21.5 GB for k = 55. De novo assembly of 106× human sequencing data using CQF-deNoise for k-mer counting required only 2.7 h and 90 GB peak memory.The source codes of CQF-deNoise and SH-assembly are available at https://github.com/Christina-hshi/CQF-deNoise.git and https://github.com/Christina-hshi/SH-assembly.git, respectively, both under the BSD 3-Clause license.

k-mer

Sequence (biology)

Sequence assembly

10.1093/bioinformatics/btaa890

Cite

Citations (3)

A novel method for discovering local spatial clusters of genomic regions with functional relationships from DNA contact maps

Bioinformatics (2016)

Xihao Hu Christina Huan Shi Kevin Y. Yip

The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps.Here, we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains, and further define local sub-domains within each domain.The source code of BNMF and tutorials on how to use the software to extract local clusters from contact maps are available at http://yiplab.cse.cuhk.edu.hk/bnmf/kevinyip@cse.cuhk.edu.hkSupplementary data are available at Bioinformatics online.

10.1093/bioinformatics/btw256

Cite

Citations (8)

Histone chaperone HIRA, promyelocytic leukemia protein, and p62/SQSTM1 coordinate to regulate inflammation during cell senescence

Molecular Cell (2024)

Nirmalya Dasgupta Xue Lei Christina Huan Shi Rouven Arnold Marcos G. Teneche

Promyelocytic leukemia protein

Senescence

Chaperone (clinical)

10.1016/j.molcel.2024.08.006

Cite

Citations (4)

A long-read sequencing strategy with overlapping linkers on adjacent fragments (OLAF-Seq) for targeted resequencing and enrichment

Scientific Reports (2024)

Lahari Uppuluri Christina Huan Shi Dharma Varapula Eleanor Young Rachel Ehrlich

Abstract In this report, we present OLAF-Seq, a novel strategy to construct a long-read sequencing library such that adjacent fragments are linked with end-terminal duplications. We use the CRISPR-Cas9 nickase enzyme and a pool of multiple sgRNAs to perform non-random fragmentation of targeted long DNA molecules (> 300kb) into smaller library-sized fragments (about 20 kbp) in a manner so as to retain physical linkage information (up to 1000 bp) between adjacent fragments. DNA molecules targeted for fragmentation are preferentially ligated with adaptors for sequencing, so this method can enrich targeted regions while taking advantage of the long-read sequencing platforms. This enables the sequencing of target regions with significantly lower total coverage, and the genome sequence within linker regions provides information for assembly and phasing. We demonstrated the validity and efficacy of the method first using phage and then by sequencing a panel of 100 full-length cancer-related genes (including both exons and introns) in the human genome. When the designed linkers contained heterozygous genetic variants, long haplotypes could be established. This sequencing strategy can be readily applied in both PacBio and Oxford Nanopore platforms for both long and short genes with an easy protocol. This economically viable approach is useful for targeted enrichment of hundreds of target genomic regions and where long no-gap contigs need deep sequencing.

Minion

Hybrid genome assembly

Sequence assembly

Sequencing by hybridization

Massive parallel sequencing

10.1038/s41598-024-56402-w

Cite

Citations (0)

Quantifying full-length circular RNAs in cancer

bioRxiv (Cold Spring Harbor Laboratory) (2021)

Ken Hung-On Yu Christina Huan Shi Bo Wang Savio Ho-Chit Chow Grace Tin‐Yun Chung

Abstract Circular RNAs (circRNAs) are abundantly expressed in cancer. Their resistance to exonucleases enables them to have potentially stable interactions with different types of biomolecules. Alternative splicing can create different circRNA isoforms that have different sequences and unequal interaction potentials. The study of circRNA function thus requires knowledge of complete circRNA sequences. Here we describe psirc, a method that can identify full-length circRNA isoforms and quantify their expression levels from RNA sequencing data. We confirm the effectiveness and computational efficiency of psirc using both simulated and actual experimental data. Applying psirc on transcriptome profiles from nasopharyngeal carcinoma and normal nasopharynx samples, we discover and validate circRNA isoforms differentially expressed between the two groups. Compared to the assumed circular isoforms derived from linear transcript annotations, some of the alternatively spliced circular isoforms have 100 times higher expression and contain substantially fewer microRNA response elements, demonstrating the importance of quantifying full-length circRNA isoforms.

Circular RNA

10.1101/2021.02.04.429722

Cite

Citations (3)

Histone chaperone HIRA, Promyelocytic Leukemia (PML) protein and p62/SQSTM1 coordinate to regulate inflammation during cell senescence and aging

bioRxiv (Cold Spring Harbor Laboratory) (2023)

Nirmalya Dasgupta Xue Lei Christina Huan Shi Rouven Arnold Marcos G. Teneche

Cellular senescence, a stress-induced stable proliferation arrest associated with an inflammatory Senescence-Associated Secretory Phenotype (SASP), is a cause of aging. In senescent cells, Cytoplasmic Chromatin Fragments (CCFs) activate SASP via the anti-viral cGAS/STING pathway. PML protein organizes PML nuclear bodies (NBs), also involved in senescence and anti-viral immunity. The HIRA histone H3.3 chaperone localizes to PML NBs in senescent cells. Here, we show that HIRA and PML are essential for SASP expression, tightly linked to HIRA's localization to PML NBs. Inactivation of HIRA does not directly block expression of NF-κB target genes. Instead, an H3.3-independent HIRA function activates SASP through a CCF-cGAS-STING-TBK1-NF-κB pathway. HIRA physically interacts with p62/SQSTM1, an autophagy regulator and negative SASP regulator. HIRA and p62 co-localize in PML NBs, linked to their antagonistic regulation of SASP, with PML NBs controlling their spatial configuration. These results outline a role for HIRA and PML in regulation of SASP.

Senescence

Chaperone (clinical)

Cellular senescence

Promyelocytic leukemia protein

10.1101/2023.06.24.546372

Cite

Citations (1)

Quantifying full-length circular RNAs in cancer

Genome Research (2021)

Ken Hung-On Yu Christina Huan Shi Bo Wang Savio Ho-Chit Chow Grace Tin‐Yun Chung

Circular RNAs (circRNAs) are abundantly expressed in cancer. Their resistance to exonucleases enables them to have potentially stable interactions with different types of biomolecules. Alternative splicing can create different circRNA isoforms that have different sequences and unequal interaction potentials. The study of circRNA function thus requires knowledge of complete circRNA sequences. Here we describe psirc, a method that can identify full-length circRNA isoforms and quantify their expression levels from RNA sequencing data. We confirm the effectiveness and computational efficiency of psirc using both simulated and actual experimental data. Applying psirc on transcriptome profiles from nasopharyngeal carcinoma and normal nasopharynx samples, we discover and validate circRNA isoforms differentially expressed between the two groups. Compared with the assumed circular isoforms derived from linear transcript annotations, some of the alternatively spliced circular isoforms have 100 times higher expression and contain substantially fewer microRNA response elements, showing the importance of quantifying full-length circRNA isoforms.

Circular RNA

10.1101/gr.275348.121

Cite

Citations (15)

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

bioRxiv (Cold Spring Harbor Laboratory) (2019)

Christina Huan Shi Kevin Y. Yip

Abstract K-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

10.1101/723833

Cite

Citations (2)