SuperCRUNCH: A toolkit for creating and manipulating supermatrices and other large phylogenetic datasets

2019 
Phylogenies with extensive taxon sampling have become indispensable for many types of ecological and evolutionary studies. Many large-scale trees are based on a supermatrix approach, which involves amalgamating thousands of published sequences for a group. Constructing up-to-date supermatrices can be challenging, especially as new sequences may become available within the group almost constantly. However, few tools exist for assembling large-scale, high-quality supermatrices (and other large datasets) for phylogenetic analysis. Here we present SuperCRUNCH, a Python toolkit for assembling large phylogenetic datasets from GenBank/NCBI nucleotide data. SuperCRUNCH searches for specified sets of taxa and loci to create species-level or population-level datasets. It offers many transparent options for orthology detection, sequence selection, alignment, and file manipulation for generating large-scale phylogenetic datasets. We compared SuperCRUNCH to the most recent alternative approach for generating supermatrices (PyPHLAWD) for two datasets. Given the same set of starting sequences, SuperCRUNCH required more computational time but it retrieved more taxa and total sequences and produced trees having greater congruence with previous studies. SuperCRUNCH can assemble supermatrices for genomic datasets with thousands of loci, and can also generate population-level datasets for phylogeographic analyses. We demonstrate clear advantages for using data downloaded directly from GenBank/NCBI rather than using intermediate databases. Furthermore, we show the effectiveness of initially identifying loci through label searching followed by rigorous orthology detection, rather than relying on automated clustering of all sequences. SuperCRUNCH is open-source, well-documented, and freely available at https://github.com/dportik/SuperCRUNCH, with several complete example analyses available at https://osf.io/bpt94/. SuperCRUNCH is a flexible method that can be used to assemble high quality phylogenetic datasets for any taxonomic group and for any scale (kingdoms to individuals). It allows rapid construction of supermatrices, greatly simplifying the process of updating large phylogenies with new data. SuperCRUNCH streamlines the major tasks required to process sequence data, including filtering, alignment, trimming, and formatting.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    46
    References
    3
    Citations
    NaN
    KQI
    []