Pleiotropy-when a single gene controls two or more seemingly unrelated traits-has been shown to impact genes with effects on flowering time, leaf architecture, and inflorescence morphology in maize. However, the genome-wide impact of biological pleiotropy across all maize phenotypes is largely unknown. Here, we investigate the extent to which biological pleiotropy impacts phenotypes within maize using GWAS summary statistics reanalyzed from previously published metabolite, field, and expression phenotypes across the Nested Association Mapping population and Goodman Association Panel. Through phenotypic saturation of 120,597 traits, we obtain over 480 million significant quantitative trait nucleotides. We estimate that only 1.56-32.3% of intervals show some degree of pleiotropy. We then assess the relationship between pleiotropy and various biological features such as gene expression, chromatin accessibility, sequence conservation, and enrichment for gene ontology terms. We find very little relationship between pleiotropy and these variables when compared to permuted pleiotropy. We hypothesize that biological pleiotropy of common alleles is not widespread in maize and is highly impacted by nuisance terms such as population structure and linkage disequilibrium. Natural selection on large standing natural variation in maize populations may target wide and large effect variants, leaving the prevalence of detectable pleiotropy relatively low.
Relevant Data and Code for Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize by Taylor Ferebee and Edward Buckler. Input Data The inputs of the models are enclosed in Input_data-2022-001.zipOutput Data The outputs of the models are enclosed in Output_Results-2022-001.zipRelevant Code The code for all analyses is enclosed in Code_Archive.zip
Abstract Non-coding regions of the genome are just as important as coding regions for understanding the mapping from genotype to phenotype. Interpreting deep learning models trained on RNA-seq is an emerging method to highlight functional sites within non-coding regions. Most of the work on RNA abundance models has been done within humans and mice, with little attention paid to plants. Here, we benchmark four genomic deep learning model architectures with genomes and RNA-seq data from 18 species closely related to maize and sorghum within the Andropogoneae. The Andropogoneae are a tribe of C4 grasses that have adapted to a wide range of environments worldwide since diverging 18 million years ago. Hundreds of millions of years of evolution across these species has produced a large, diverse pool of training alleles across species sharing a common physiology. As model input, we extracted 1,026 base pairs upstream of each gene’s translation start site. We held out maize as our test set and two closely related species as our validation set, training each architecture on the remaining Andropogoneae genomes. Within a panel of 26 maize lines, all architectures predict expression across genes moderately well but poorly across alleles. DanQ consistently ranked highest or second highest among all architectures yet performance was generally very similar across architectures despite orders of magnitude differences in size. This suggests that state-of-the-art supervised genomic deep learning models are able to generalize moderately well across related species but not sensitively separate alleles within species, the latter of which agrees with recent work within humans. We are releasing the preprocessed data and code for this work as a community benchmark to evaluate new architectures on our across-species and across-allele tasks.
Abstract Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology.
Relevant Data and Code for Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize by Taylor Ferebee and Edward Buckler. Input Data The inputs of the models are enclosed in Input_data-2022-001.zipOutput Data The outputs of the models are enclosed in Output_Results-2022-001.zipRelevant Code The code for all analyses is enclosed in Code_Archive.zip
Abstract Over the last 20 million years, the Andropogoneae tribe of grasses has evolved to dominate 17% of global land area. Domestication of these grasses in the last 10,000 years has yielded our most productive crops, including maize, sugarcane, and sorghum. The majority of Andropogoneae species, including maize, show a history of polyploidy – a condition that, while offering the evolutionary advantage of multiple gene copies, poses challenges to basic cellular processes, gene expression, and epigenetic regulation. Genomic studies of polyploidy have been limited by sparse sampling of taxa in groups with multiple polyploidy events. Here, we present 33 genome assemblies from 27 species, including chromosome-scale assemblies of maize relatives Zea and Tripsacum . In maize, the after-effects of polyploidy have been widely studied, showing reduced chromosome number, biased fractionation of duplicate genes, and transposable element (TE) expansions. While we observe these patterns within the genus Zea , 12 other polyploidy events deviate significantly. Those tetraploids and hexaploids retain elevated chromosome number, maintain nearly complete complements of duplicate genes, and have only stochastic TE amplifications. These genomes reveal variable outcomes of polyploidy, challenging simple predictions and providing a foundation for understanding its evolutionary implications in an ecologically and economically important clade.
Abstract Genomic selection and gene editing in crops could be enhanced by multi-species, mechanistic models predicting effects of changes in gene regulation. Current expression abundance prediction models require extensive computational resources, hard-to-measure species-specific training data, and often fail to incorporate data from multiple species. We hypothesize that gene expression prediction models that harness the regulatory network structure of Arabidopsis thaliana transcription factor-target gene interactions will improve on the present maize models. To this end, we collect 147 Oryza sativa and 99 Sorghum bicolor gene expression assays and assign them to maize family-based orthologous groups. Using three popular graph-based machine learning frameworks, including a shallow graph convolutional autoencoder, a deep graph convolutional autoencoder, and the inductive GraphSage strategy, we encode an Arabidopsis thaliana integrated gene regulatory network (iGRN) structure and TF gene expression values to predict gene expression both within and between species. We then evaluate the network methods against a partial least-squares baseline. We find that the baseline gives the best predictions within species, with Spearman correlations averaging between 0.74 and 0.78. The graph autoencoder methods were more variable with correlations between -0.1 and 0.65. In particular, the GraphSage and deep autoencoders performed the worst, and the shallow autoencoders performed the best. In the most challenging prediction context, where predictions were in new species and on genes that were not seen, we found that the shallow graph autoencoder framework averaged around 0.65. Unlike initial thoughts about preserved network structure improving gene expression predictions, this study shows that within-species predictions only need simple models, such as partial least squares, to capture expression variations. In cross-species predictions, the best model is often a more complex strategy utilizing regulatory network structure and other studies’ expressions.