Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network

2021 
Context Augmented Psi Dataset (CAPD) dataset for benchmarking of RNA splicing models. Contains percent-spliced-in labels for 250 samples from each of the 14 tissue types for all 23 human chromosomes along with auxiliary signals for each sample (RBP transcript abundance levels) and a gene dictionary (same gene sequences for all samples). The dataset is split into training, testing, and validation datasets. Auxiliary signals are provided in a separate archive as a .csv table, where each line represents one sample with a respective tissue label and a number. Labels are stored as .jsonl dictionaries for each sample separately; each entry in the dictionary contains gene name, acceptor, and donor coordinates (with respect to the very first acceptor site of the gene) with respective PSI levels in the range from 0 to 1. Gene dictionary is stored as .jsonl file as well, where each entry is a pre-mRNA gene sequence with 1000nt flanking ends on each side.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []