Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/α2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes. The genome sequences of six Candida species have been determined, and compared with those of Candida albicans, a marine yeast and baker's yeast. Candida species are the most common cause of opportunistic fungal infection in humans. The genomic comparisons reveal striking gene family expansions associated with pathogenic species. Other aspects of Candida biology, including evolution of the genetic code, and the architecture of mating and meiotic processes, can also be addressed in the interspecies comparisons. Candida species are the most common cause of opportunistic fungal infection worldwide. Here, the genomes of six Candida species are sequenced and compared with each other and with related pathogens and non-pathogens; providing insight into the genetic features that underlie the diversity of Candida biology, including pathogenesis and the architecture of mating and meiotic processes.
TAC1 (for transcriptional activator of CDR genes) is critical for the upregulation of the ABC transporters CDR1 and CDR2, which mediate azole resistance in Candida albicans. While a wild-type TAC1 allele drives high expression of CDR1/2 in response to inducers, we showed previously that TAC1 can be hyperactive by a gain-of-function (GOF) point mutation responsible for constitutive high expression of CDR1/2. High azole resistance levels are achieved when C. albicans carries hyperactive alleles only as a consequence of loss of heterozygosity (LOH) at the TAC1 locus on chromosome 5 (Chr 5), which is linked to the mating-type-like (MTL) locus. Both are located on the Chr 5 left arm along with ERG11 (target of azoles). In this work, five groups of related isolates containing azole-susceptible and -resistant strains were analyzed for the TAC1 and ERG11 alleles and for Chr 5 alterations. While recovered ERG11 alleles contained known mutations, 17 new TAC1 alleles were isolated, including 7 hyperactive alleles with five separate new GOF mutations. Single-nucleotide-polymorphism analysis of Chr 5 revealed that azole-resistant strains acquired TAC1 hyperactive alleles and, in most cases, ERG11 mutant alleles by LOH events not systematically including the MTL locus. TAC1 LOH resulted from mitotic recombination of the left arm of Chr 5, gene conversion within the TAC1 locus, or the loss and reduplication of the entire Chr 5. In one case, two independent TAC1 hyperactive alleles were acquired. Comparative genome hybridization and karyotype analysis revealed the presence of isochromosome 5L [i(5L)] in two azole-resistant strains. i(5L) leads to increased copy numbers of azole resistance genes present on the left arm of Chr 5, among them TAC1 and ERG11. Our work shows that azole resistance was due not only to the presence of specific mutations in azole resistance genes (at least ERG11 and TAC1) but also to their increase in copy number by LOH and to the addition of extra Chr 5 copies. With the combination of these different modifications, sophisticated genotypes were obtained. The development of azole resistance in C. albicans is therefore a powerful instrument for generating genetic diversity.
TAC1, a Candida albicans transcription factor situated near the mating-type locus on chromosome 5, is necessary for the upregulation of the ABC-transporter genes CDR1 and CDR2, which mediate azole resistance. We showed previously the existence of both wild-type and hyperactive TAC1 alleles. Wild-type alleles mediate upregulation of CDR1 and CDR2 upon exposure to inducers such as fluphenazine, while hyperactive alleles result in constitutive high expression of CDR1 and CDR2. Here we recovered TAC1 alleles from two pairs of matched azole-susceptible (DSY294; FH1: heterozygous at mating-type locus) and azole-resistant isolates (DSY296; FH3: homozygous at mating-type locus). Two different TAC1 wild-type alleles were recovered from DSY294 (TAC1-3 and TAC1-4) while a single hyperactive allele (TAC1-5) was isolated from DSY296. A single amino acid (aa) difference between TAC1-4 and TAC1-5 (Asn977 to Asp or N977D) was observed in a region corresponding to the predicted activation domain of Tac1p. Two TAC1 alleles were recovered from FH1 (TAC1-6 and TAC1-7) and a single hyperactive allele (TAC1-7) was recovered from FH3. The N977D change was seen in TAC1-7 in addition to several other aa differences. The importance of N977D in conferring hyperactivity to TAC1 was confirmed by site-directed mutagenesis. Both hyperactive alleles TAC1-5 and TAC1-7 were codominant with wild-type alleles and conferred hyperactive phenotypes only when homozygous. The mechanisms by which hyperactive alleles become homozygous was addressed by comparative genome hybridization and single nucleotide polymorphism arrays and indicated that loss of TAC1 heterozygosity can occur by recombination between portions of chromosome 5 or by chromosome 5 duplication.
Article Figures and data Abstract Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Genome rearrangements resulting in copy number variation (CNV) and loss of heterozygosity (LOH) are frequently observed during the somatic evolution of cancer and promote rapid adaptation of fungi to novel environments. In the human fungal pathogen Candida albicans, CNV and LOH confer increased virulence and antifungal drug resistance, yet the mechanisms driving these rearrangements are not completely understood. Here, we unveil an extensive array of long repeat sequences (65–6499 bp) that are associated with CNV, LOH, and chromosomal inversions. Many of these long repeat sequences are uncharacterized and encompass one or more coding sequences that are actively transcribed. Repeats associated with genome rearrangements are predominantly inverted and separated by up to ~1.6 Mb, an extraordinary distance for homology-based DNA repair/recombination in yeast. These repeat sequences are a significant source of genome plasticity across diverse strain backgrounds including clinical, environmental, and experimentally evolved isolates, and represent previously uncharacterized variation in the reference genome. https://doi.org/10.7554/eLife.45954.001 Introduction Genome plasticity is surprisingly common in eukaryotes. DNA insertions and deletions (indels), copy number variations (CNV), and loss of heterozygosity (LOH) are frequently described during the evolution of organisms and of disease states such as cancer. In particular, the genome plasticity of fungal pathogens was recognized well before whole genome sequencing was available, including genome copy number variation (polyploidy), inter- and intra-chromosomal rearrangements, and aneuploidy (Chibana et al., 2000; Magee and Magee, 2000; Rustchenko-Bulgac, 1991; Suzuki et al., 1982). Controlled in vitro and in vivo evolution experiments in combination with whole genome sequencing have further highlighted the speed in which specific genome rearrangements provide a fitness advantage that can be selected for in these fungal pathogens (Araya et al., 2010; Croll et al., 2013; Dunham et al., 2002; Forche et al., 2011; Ford et al., 2015; Gerstein et al., 2015; Hirakawa et al., 2015; Selmecki et al., 2009; Stukenbrock et al., 2010). Candida albicans is the most prevalent human fungal pathogen, associated with nearly half a million life-threatening infections annually, predominantly in immunocompromised individuals (Brown and Netea, 2012). C. albicans is a heterozygous diploid yeast capable of mating, yet true meiosis has not been observed. Instead, it undergoes a parasexual process that involves random chromosome loss and rare Spo11-dependent chromosome recombination events (Bennett and Johnson, 2003; Forche et al., 2008; Wang et al., 2018). The majority of genomic diversity observed in C. albicans is attributed to asexual mitotic genome rearrangements (Forche et al., 2011; Lephart and Magee, 2006). Despite this clonal lifestyle, C. albicans isolates exhibit extensive genomic diversity in the form of de novo base substitutions, indels, ploidy variation (haploid, diploid, and polyploid), karyotypic variation due to segmental and whole chromosome aneuploidies, and allele copy number variation including LOH (Chibana et al., 2000; Forche et al., 2011; Ford et al., 2015; Hickman et al., 2013; Hirakawa et al., 2015; Magee and Magee, 2000; Rustchenko-Bulgac, 1991; Selmecki et al., 2006; Suzuki et al., 1982). Additionally, while C. albicans did not undergo an ancient whole genome duplication event like Saccharomyces cerevisiae (Butler et al., 2009; Marcet-Houben et al., 2009; Wolfe and Shields, 1997), small-scale duplication events have resulted in gene family expansions, especially in sub-telomeric regions (Anderson et al., 2012; Butler et al., 2009; Dunn et al., 2018). A comprehensive analysis of these duplication events, their evolutionary trajectories and impact on genome stability, remains largely unexplored. Early comparative studies of the C. albicans genome identified diverse repetitive loci that contribute to genotypic and phenotypic plasticity (Braun et al., 2005; Jones et al., 2004). First, repeat analysis in C. albicans has characterized at least three major classes of long repetitive sequences: the 23 bp tandem telomeric repeat units and the 14 member telomere-associated (TLO) gene family residing in sub-telomeric regions; the Major Repeat Sequences (MRS) found, at least in part, on every C. albicans chromosome and formed by a long tandem array of ~2.1 kb RPS units flanking non-repetitive HOK and RBP-2 elements (Chibana et al., 1994; Chindamporn et al., 1998; Lephart and Magee, 2006); and the ribosomal DNA repeats (rDNA) found on ChrR, which are organized as a tandem array of up to ~200 copies of ~12 kb units (Freire-Benéitez et al., 2016; Jones et al., 2004; Rustchenko et al., 1993; Wickes et al., 1991). These long repetitive sequences can undergo both inter- and intra-locus recombination events that rapidly generate chromosome length polymorphisms, chimeric chromosomes, and telomere-telomere chromosomal fusions (Chu et al., 1992; Selmecki et al., 2006; Selmecki et al., 2010). Second, like most eukaryotes, C. albicans also encodes many 'lone' long terminal repeats (LTRs) and retroelements (Zorro, Tca2, Ty1/Copia) (Goodwin and Poulter, 1998; Goodwin and Poulter, 2000); however, the relative copy number of many of these genes is hypervariable between C. albicans isolates and is expanded relative to other Candida species (Butler et al., 2009; Hirakawa et al., 2015). Third, short repeat sequences (short tandem repeats and trinucleotide repeats) are significantly more frequent in protein-coding sequences of C. albicans than in S. cerevisiae and Schizosaccharomyces pombe (Braun et al., 2005; Jones et al., 2004). Fourth, expansions of multi-gene families (identified by protein alignment) were both more common and larger than the orthologous gene family size found in S. cerevisiae. These gene families often encode proteins with roles in commensalism and virulence, including the agglutinin-like sequence (ALS) family (eight genes) and other glycosylphosphatidylinositol (GPI)-anchored genes that encode large cell surface glycoproteins (five genes) (Levdansky et al., 2008; Wilkins et al., 2018). Among these gene families, recombination and/or slippage between repeat units yields extensive allelic variation, leading to functional and phenotypic diversity, similar to the FLO genes in S. cerevisiae (Hoyer et al., 1995; Kunkel, 1993; Pearson et al., 2005; Richard et al., 1999; Verstrepen et al., 2005; Zhang et al., 2003; Zhao et al., 2004). The evolution of different alleles in these repeat-containing ORFs predominantly occurs by the addition, deletion, and rearrangement of repeat units within an ORF and between different ORFs, not by the acquisition of point mutations or indels (Christiaens et al., 2012; Zhang et al., 2010). Importantly, these genomic studies focused on short repeat sequences and repeats found in protein-coding sequences. Less is known about long repeat sequences found throughout the genome, especially those encoding multiple ORFs and intergenic regions. Over 19 years ago, Wolfe and colleagues showed that the C. albicans genome contains thousands of small chromosomal inversion events (~10 genes long) relative to S. cerevisiae. These inversions resulted in substantially different gene order between these two species (Seoighe et al., 2000). Similarly, Dujon and colleagues demonstrated that the C. albicans genome had the highest rate of genome instability due to micro- and macro-rearrangements of syntenic gene blocks, relative to 11 other hemiascomycete species (Fischer et al., 2006). The loss of synteny primarily resulted from chromosomal rearrangements, not sequence divergence of orthologous regions. A mechanism proposed for this genome instability was a higher incidence of repetitive sequences and/or a less efficient DNA repair process (Fischer et al., 2006). The genomic diversity of C. albicans increases during in vitro and in vivo exposure to stress. For example, rates of LOH increase during exposure to elevated temperature (37°C), DNA transformation, and antifungal drugs (Bouchonville et al., 2009; Forche et al., 2011; Forche et al., 2018). LOH is also increased during in vivo models of infection (Ene et al., 2018; Forche et al., 2008; Forche et al., 2018). LOH events occur due to chromosome nondisjunction leading to whole chromosome LOH or via recombination, in which only part of the chromosome undergoes LOH. Exposure to stress also selects for isolates that have acquired adaptive mutations and genome rearrangements. For example, aneuploidy is found in ~50% of isolates resistant to the most common antifungal drug, fluconazole (FLC). The most common and only recurrent aneuploidy in different strain backgrounds is the amplification of the left arm of chromosome 5 (Chr5L), often through acquisition of a novel isochromosome structure (denoted as i(5L)), comprised of two copies of Chr5L separated by the centromere (Selmecki et al., 2006; Selmecki et al., 2008). Acquisition of i(5L) conferred FLC resistance via the amplification of two genes, ERG11 and TAC1, encoding the drug target (Erg11) and a transcriptional activator of drug efflux pumps (Tac1) (Selmecki et al., 2008; Selmecki et al., 2009). Importantly, the centromere of Chr5 contains a long inverted repeat sequence, and recombination between these repeats can form homozygous isochromosomes of both the left arm (i(5L)) and right arm of Chr5 (i(5R)) (Selmecki et al., 2006). The role of long repeat sequences in the formation of other segmental aneuploidies and other genome rearrangements has not been comprehensively addressed. We provide evidence that long repeat sequences are involved in the formation of all observed CNV breakpoints and chromosome inversions, and many LOH breakpoints, across 33 diverse clinical and experimentally evolved isolates. Our comprehensive analysis of long repeat sequences within the C. albicans genome identified hundreds of sequences representing novel multicopy repeats, none of which include MRS, rDNA, sub-telomeric repeats, known repeat families (ALS, TLOs) or known repetitive elements (tRNAs, LTRs, retrotransposons). Long repeats that are associated with genome rearrangements (CNV, LOH, and inversions) have on average higher sequence identity than all long repeats combined. Additionally, long repeats that contain ORFs (including partial ORF sequences, single complete ORF sequences (paralogs), or multiple ORFs and intergenic sequences) are longer and associated with more genome rearrangements than long repeats that contain other genomic features (such as LTRs, retrotransposons, or tRNAs). Additionally, repeat copies involved in genome rearrangements can be located up to ~1.6 Mb apart on the same chromosome, suggesting a non-conventional, long-range mechanism for DNA double-strand break (DSB) repair and somatic genome diversification. Results An inverted repeat within CEN4 is associated with the formation of a novel isochromosome To identify the mechanisms by which C. albicans isolates generate genome plasticity, we performed a comparative genomics analysis of 33 diverse clinical isolates (Supplementary file 1). This set of isolates included 11 that underwent controlled experimental evolution, where a known progenitor isolate was passaged in vitro or in vivo. Additionally, we performed comparative genomics on newly obtained clinical isolates, and clinical isolates whose genomes were published previously, including the reference isolate SC5314. Given the significant impact of i(5L) on antifungal drug resistance, we focused first on the characterization of a novel segmental aneuploidy detected on Chr4 that arose during in vitro evolution in the presence of FLC. Initially, we passaged a FLC-sensitive clinical isolate P78042, which was trisomic for Chr4 (Hirakawa et al., 2015; Lockhart et al., 2002), in the presence of FLC (128 µg/ml) for 100 generations by serial dilution (see Materials and methods). One evolved isolate (AMS3743) was selected, based on increased fitness in FLC (see below), and the whole genome was sequenced. Read depth analysis indicated that this isolate had four copies of the right arm of Chr4 (Chr4R), but only two copies of Chr4L, and the copy number breakpoint occurred at the centromere of Chr4 (CEN4) (Figure 1A). Wildtype CEN4, like CEN5, is comprised of a CENP-A-binding core sequence (~3.1 kb) flanked by a long (524 bp) inverted repeat (Burrack et al., 2016; Ketel et al., 2009; Sanyal et al., 2004). Figure 1 with 2 supplements see all Download asset Open asset Inverted repeat at CEN4 causes a novel isochromosome leading to increased fluconazole resistance. (A) Whole genome sequence data plotted as a log2 ratio and converted to chromosome copy number (Y-axis) and chromosome location (X-axis) using YMAP, for the progenitor clinical isolate (P78042) and an isolate obtained after 100 generations in FLC (AMS3743). The copy number breakpoint in AMS3743 occurs at CEN4 (red arrow). (B) CHEF karyotype gel stained with ethidium bromide (left panel) identifies a novel band (asterisk) above Chr5. Southern blot analysis (right panel) of the same gel using a DIG-labeled CEN4 probe identifies the full-length Chr4 homolog in P78042 and AMS3743, and the novel band in AMS3743 that is twice the size of the right arm of Chr4 in an isochromosome structure (asterisk, i(4R)). (C) PCR validation of i(4R). Schematic representation of the Chr4 homologue (top) and i(4R), where the location of a single primer sequence (Primer 1, Supplementary file 7) that flanks the CEN4 inverted repeat is indicated. PCR with Primer 1 amplified the expected product of i(4R) in AMS3743. (D) 24 hr growth curves in YPAD (top panel) and YPAD +32 µg/ml FLC (bottom panel) for P78042 (black line) and AMS3743 (green line). Average slope and standard error of the mean for three biological replicates is indicated. The average maximum slope (n = 3) of P78042 and AMS3743 in YPAD was not significantly different (0.046 and 0.046, respectively, p>0.75, t-test). The average maximum slope (n = 3) of P78042 and AMS3743 was significantly different in FLC (0.002 and 0.003, respectively, p<0.0006, t-test). OD, optical density (Figure 1—source data 1). https://doi.org/10.7554/eLife.45954.002 Figure 1—source data 1 Growth curve analysis. https://doi.org/10.7554/eLife.45954.005 Download elife-45954-fig1-data1-v2.xlsx To test the hypothesis that this segmental aneuploidy is an isochromosome structure, we performed CHEF karyotype analysis. Isolate AMS3743 had a novel ~1.2 Mb chromosome band that hybridized to a CEN4 probe via Southern blot (Figure 1B). This ~1.2 Mb band was twice the size of the right arm of Chr4 (~607 Kb). Consistent with an isochromosome i(4R) structure (a centromere flanked by inverted copies of Chr4R), a single primer amplified a ~4.1 kb product, from Chr4R through CEN4 and back to Chr4R in the isolate with i(4R) but did not amplify any sequence in the reference (SC5314), or progenitor (P78042) isolates (Figure 1C). Next, we determined the impact of i(4R) on fitness in the presence and absence of FLC over a 24 hr period. In the presence of FLC, the i(4R) isolate grew significantly better than the progenitor P78042 (p<0.0006, t-test, Figure 1D). Interestingly, in the absence of FLC, the i(4R) isolate grew as well as the progenitor P78042 (Figure 1D). Furthermore, i(4R) was maintained in 12/12 populations for over ~300 generations in the absence of FLC (see Materials and methods). One of the populations, AMS3743_10, appeared to be losing i(4R) as measured by CHEF gel densitometry (see Materials and methods) and was plated for single colonies in the absence of FLC. One colony (out of six) had lost i(4R) (AMS3743_10_S6, Figure 1—figure supplement 1A). To ask if i(4R) was necessary and sufficient for the increased fitness in FLC, fitness was determined in the presence and absence of FLC. The colony that had lost i(4R) had a reduced growth rate in the presence of FLC, similar to the progenitor P78042 (Figure 1—figure supplement 1B). Overall, these data imply that the long inverted repeat within CEN4 can generate an independent isochromosome structure comprised of two right arms of Chr4, and that i(4R) is necessary and sufficient for increased fitness in FLC. These results parallel the identification of isochromosomes associated with the long inverted repeat sequence within CEN5, which can result in the formation of i(5R) and i(5L), the latter of which confers FLC resistance (Selmecki et al., 2006; Selmecki et al., 2008). Inverted repeat sequences are associated with inversion of centromere sequences During our investigation of the i(4R) structure, we unveiled a surprising feature of CEN4: the CENP-A-binding core sequence of CEN4 contained two different alleles. One homologue of Chr4 contained a ~3.1 kb sequence inversion between the inverted repeat associated with CEN4. The new, inverted CEN4 sequence was detected by PCR in the reference isolate SC5314, and in the distantly related isolates P78042 and AMS3743 (Figure 1—figure supplement 1C & D). Sanger sequencing indicated that a recombination event occurred between the two arms of the inverted repeat (Figure 1—figure supplement 2). Interestingly, the CENP-A-binding core sequence of CEN4 is asymmetrically positioned on one side of the inverted repeat sequence (Figure 1—figure supplement 1D, shaded region) (Burrack et al., 2016; Sanyal et al., 2004). Therefore, this inversion caused a separation between the known CENP-A-binding core sequence of CEN4 that is located to the right and outside of the inverted repeat. Identification of long repeat sequences throughout the C. albicans genome Given the extensive genome rearrangements observed at the long inverted repeat associated with CEN4, we sought to characterize all long repeat sequences within the C. albicans reference genome (SC5314). All long sequence matches within SC5314 were identified by aligning the reference genome sequence to itself using the bioinformatics suite MUMmer (Kurtz et al., 2004). First, all exact sequence matches of 20 nucleotides or longer were identified, then all matches were clustered and extended to obtain a maximum-length colinear string of matches, resulting in a final list of long repeat sequences that ranged from 65 bp to 6499 bp (median 318 bp) with sequence identities of ≥80% (See Materials and methods). The genomic position and percent identity of all matched repeats was determined with MUMmer and manually verified using BLASTN and IGV (Robinson et al., 2011; Thorvaldsdóttir et al., 2013). After excluding all rDNA, MRS and sub-telomeric repeat sequences, 1974 long repeat matches were identified (Supplementary file 2). The MUMmer analysis identified five ORFs and one gene family with known, complex embedded tandem repeat sequences (PGA18, PGA55, EAP1, orf19.1725, CSA1, and the ALS gene family, herein referred to as 'the complex tandem repeat genes'). The complexity of these repeat sequences prohibited the assignment of exact repeat copy number per genome, and they were removed from analyses when indicated. The remaining long repeat sequences cover 2.87% of the haploid reference genome (see Materials and methods). Long repeat matches occurred between sequences on the same chromosome (intra-chromosomal repeats, Figure 2A), on different chromosomes (inter-chromosomal repeats), or both. The number of all repeat matches per chromosome was correlated with chromosome size (R2 = 0.65, p<0.016, Figure 2B); however, regions of high repeat density (e.g. ChrRR near the rDNA) or low repeat density (e.g. Chr7L) were detected on some chromosome arms. This repeat density did not correlate with GC content (R2 = 0.063, p>0.32) or ORF density (R2 = 0.02, p>0.59) on any chromosome arm (Figure 2—source data 1). Figure 2 with 3 supplements see all Download asset Open asset Long repeat sequences are found across the C. albicans genome. Detailed results for all long intra- and inter-chromosomal repeat positions, orientations, and gene features are found in Supplementary file 2. Repeats associated with the rDNA, major repeat sequences (MRS), and sub-telomeric repeats were removed prior to the analysis. (A) Representative image of the long intra-chromosomal repeat positions (colored lines – not to scale). Each repeat family is assigned a unique color within its respective chromosome. Numbers and symbols below each chromosome indicate chromosomal position (Mb), MRS position (black circles), and rDNA locus (blue circle, ChrR). (B) Number of all repeat matches (excluding the complex tandem repeat genes) on each chromosome, ordered by chromosome size (R2 = 0.65, p<0.016, gray indicates 95% confidence interval, Figure 2—source data 1). (C) The number of intra-chromosomal (Intra-Chr) and inter-chromosomal (Inter-Chr) repeat matches assigned to each genomic feature: Intergenic, LTR, ORF (excluding the complex tandem repeat genes), retrotransposon (Retro), and tRNA (Figure 2—source data 1). https://doi.org/10.7554/eLife.45954.006 Figure 2—source data 1 Distribution, features, and coverage of long repeat sequences in C. albicans. https://doi.org/10.7554/eLife.45954.010 Download elife-45954-fig2-data1-v2.xlsx Figure 2—source data 2 Analysis of long repeat spacer length in C. albicans. https://doi.org/10.7554/eLife.45954.011 Download elife-45954-fig2-data2-v2.xlsx Figure 2—source data 3 Analysis of key features of long repeat sequences in C. albicans. https://doi.org/10.7554/eLife.45954.012 Download elife-45954-fig2-data3-v2.xlsx We next calculated the orientation and distance between matched intra-chromosomal repeat sequences (Figure 2—figure supplement 1), both important factors for reconstructing the evolutionary history of these duplication events and for analyzing the frequency and outcome of homologous recombination events that occur between repeat sequences (Lobachev et al., 1998; Ramakrishnan et al., 2018). Intra-chromosomal repeats are often generated in tandem by recombination between sister chromatids or replication slippage, and these repeats can move further away from each other by chromosomal rearrangement events (including chromosomal inversions) (Achaz et al., 2000; Reams and Roth, 2015). Indeed, intra-chromosomal repeats were predominantly tandem, although inverted and mirrored repeats also occurred (Supplementary file 2). We hypothesized that the distance between matched intra-chromosomal repeats (spacer length) would be predominantly short and that the distribution of spacer lengths on each chromosome would be similar. Strikingly, spacer length ranged from 1 bp to 2,856,212 bp (median ~82.8 kb, excluding the complex tandem repeat genes, see Materials and methods), and was correlated with chromosome size (Figure 2—figure supplement 2A, R2 = 0.066, p<0.0001). Additionally, the distribution of spacer lengths was significantly different between chromosomes (Figure 2—figure supplement 2B, p<0.035, Kruskal-Wallis with Dunn's multiple comparison test) with the larger chromosomes (Chr1 and ChrR) containing many repeat matches that were separated by distances greater than ~1.5 Mb. The increased distance between repeat sequences likely occurred via additional large inversions, insertions or telomere-telomere recombination/fusion events. We further annotated the long repeat sequences according to the genomic features contained within each repeat (see Materials and methods). The most common long repeats contained lone long terminal repeats (LTRs) (775), followed by ORFs (339, excluding the complex tandem repeat genes), tRNAs (334), and retrotransposons (40). Repeat matches containing ORFs included partial ORF sequences (196/339, 57.8%), single complete ORF sequences (114/339, 33.6%), and multiple ORFs and intergenic sequences (29/339, 8.6%) (Supplementary file 2). Repeat matches containing complete ORFs and multiple ORFs represent paralogs and multi-gene duplication events. Additionally, there were 349 intergenic, unannotated sequences, 231 that shared high-sequence identity (>83%) with an annotated sequence found elsewhere in the genome, including known LTRs, retrotransposons, and ORFs (Supplementary file 2, 'Unannotated Intergenic Sequence'). For example, an additional 54 LTRs were identified in the reference genome with this analysis. Interestingly, LTR matched repeat pairs were predominantly dispersed on different chromosomes (78%), while ORF matched repeat pairs were predominantly located on a single chromosome (64%, Figure 2C). Of the matched repeat pairs, the long repeat sequences containing ORFs had the lowest median sequence identity when compared to repeats containing other features (Figure 2—figure supplement 3A, p<0.0001, Kruskal-Wallis with Dunn's multiple comparison test). Conversely, repeats containing ORFs had significantly longer copy length than any other genomic feature (p<0.0001, Kruskal-Wallis with Dunn's multiple comparison test) and was the only feature that had a significant increase in copy length of intra-chromosomal matches relative to inter-chromosomal matches (Figure 2—figure supplement 3B, p<0.0001, Kruskal-Wallis with Dunn's multiple comparison test). The long repeat sequences containing ORFs were predominantly present in only two copies per genome, had pairwise coding sequences with similarly high identity, and therefore represent paralogous gene duplication events (Supplementary file 2). The origin, function, and evolutionary trajectory of these paralogs may provide insight into the evolution of fungal pathogens like C. albicans that did not undergo the ancient whole genome duplication event (Butler et al., 2009; Marcet-Houben et al., 2009; Wolfe and Shields, 1997). The complex tandem repeat genes, for which genome copy number could not be determined, had low sequence identity and were predominantly found on Chr6 (Figure 2—figure supplement 3C). In contrast, the full-length coding sequence of all ORFs that were contained within long repeat sequences, were significantly longer (median value of 1380 bp vs 1200 bp, Figure 2—figure supplement 3D, p<0.0008, Kolmogorov-Smirnov test) and had a significantly higher GC content (median value of 37.22% vs 35.22% Figure 2—figure supplement 3E, p<0.0001, Kolmogorov-Smirnov test) than the full-length coding sequence of all ORFs not contained within long repeat sequences (genome-wide, excluding the complex tandem repeat genes, see Materials and methods). Interestingly, increased GC content was correlated with increased rates of both mitotic and meiotic recombination events in S. cerevisiae (Kiktev et al., 2018). Identification of CNV breakpoints in isolates with segmental aneuploidies Next, CNV breakpoints were determined across 13 additional isolates with one or more segmental aneuploidies. Six of these isolates were from in vitro evolution experiments in the presence of azole antifungal drugs (FLC or miconazole), four were from in vivo evolution experiments in a murine model of oropharyngeal candidiasis (OPC) performed in the absence of antifungal drugs, and three were human clinical isolates (Supplementary file 1). All segmental aneuploidies arose from a known euploid diploid progenitor (Abbey et al., 2014; Hirakawa et al., 2015), except two clinical isolates with unknown origin and the i(4R) isolate that arose from a trisomic progenitor, described above. Segmental aneuploidies were initially detected by CHEF karyotype analysis and ddRAD-seq, but the coordinates of the CNV breakpoints were not known (Abbey et al., 2014; Forche et al., 2018; Mount et al., 2018; Ropars et al., 2018). The ploidy of each isolate was measured by flow cytometry and the DNA copy number of all loci was determined using whole genome sequencing (see Materials and methods). Among the 13 diverse isolates, 19 segmental aneuploidies were confirmed, with at least one segmental aneuploidy detected on each of the eight chromosomes (Figure 3A, Figure 3—figure supplement 1A–J). Segmental amplifications were more frequent (12/19, 63.2%) than segmental deletions (3/19, 15.8%). The remaining segmental aneuploidies (4/19, 21.1%) consisted of more complex rearrangements that resulted in a segmental amplification and a terminal chromosome deletion at the same breakpoint. Figure 3 with 1 supplement see all Download asset Open asset All copy number breakpoints resulting in segmental aneuploidy occur at repeat sequences. (A) Whole genome sequence data plotted as a log2 ratio and converted to chromosome copy number (Y-axis) and chromosome location (X-axis) using YMAP. The source of each isolate is indicated in color: in vivo evolution experiments in a murine model of oropharyngeal candidiasis (OPC) (green), in vitro evolution experiments in the presence of azole antifungal drugs (red), and clinical isolates (blue). Ploidy, determined by flow cytometry, is indicated on the far right. Every copy number breakpoint occurred at a repeat sequence (red arrow), additional details are in Supplementary file 3. Location of the Major Repeat Sequences (black circle) and rDNA array (blue circle) are shown below. Example copy number breakpoints for two isolates (B–C). (B) Isolate AMS3053 underwent a complex rearrangement on Chr3L at a long inverted repeat (Repeat 124, red lines). R
Resistance to the limited number of available antifungal drugs is a serious problem in the treatment of Candida albicans. We found that aneuploidy in general and a specific segmental aneuploidy, consisting of an isochromosome composed of the two left arms of chromosome 5, were associated with azole resistance. The isochromosome forms around a single centromere flanked by an inverted repeat and was found as an independent chromosome or fused at the telomere to a full-length homolog of chromosome 5. Increases and decreases in drug resistance were strongly associated with gain and loss of this isochromosome, which bears genes expressing the enzyme in the ergosterol pathway targeted by azole drugs, efflux pumps, and a transcription factor that positively regulates a subset of efflux pump genes.
Clinical isolates are prototrophic and hence are not amenable to genetic manipulation using nutritional markers. Here we describe a new set of plasmids carrying the NAT1 (nourseothricin) drug resistance marker (Shen et al., ), which can be used both in clinical isolates and in laboratory strains. We constructed novel plasmids containing HA-NAT1 or MYC-NAT1 cassettes to facilitate PCR-mediated construction of strains with C-terminal epitope-tagged proteins and a NAT1-pMet3-GFP plasmid to enable conditional expression of proteins with or without the green fluorescent protein fused at the N-terminus. Furthermore, for proteins that require both the endogenous N- and C-termini for function, we have constructed a GF-NAT1-FP cassette carrying truncated alleles that facilitate insertion of an intact, single copy of GFP internal to the coding sequence. In addition, GFP-NAT1, RFP-NAT1 and M-Cherry-NAT1 plasmids were constructed, expressing two differently labelled gene products for the study of protein co-expression and co-localization in vivo. Together, these vectors provide a useful set of genetic tools for studying diverse aspects of gene function in both clinical and laboratory strains of C. albicans.
Cellular ploidy is the number of complete sets of chromosomes in a cell. Many eukaryotic species have two (diploid) or more than two (polyploid) sets of chromosomes (1). These diploid and polyploid states are often the result of ancient whole-genome duplication (WGD) or hybridization events that occurred throughout the evolution of plants, animals, and fungi (2 – 4). Ploidy changes also occur during the development of many organisms and can vary within different tissues of the same organism and between individuals of the same species. For example, ploidy changes occur during the sexual cycle of eukaryotes, from haploid gametes to diploid somatic cells. Additionally, some cells continue to increase in ploidy during development, resulting in somatic tissues that have a mixture of diploid and polyploid cells, including human hepatocytes and megakaryocytes (5 – 7). These ongoing, developmentally programmed changes in ploidy are important for viability and are beneficial to many organisms (8), but the mechanisms controlling ploidy and the physiological significance of each ploidy level are not well characterized.
In vitro studies suggest that stress may generate random standing variation and that different cellular and ploidy states may evolve more rapidly under stress. Yet this idea has not been tested with pathogenic fungi growing within their host niche in vivo Here, we analyzed the generation of both genotypic and phenotypic diversity during exposure of Candida albicans to the mouse oral cavity. Ploidy, aneuploidy, loss of heterozygosity (LOH), and recombination were determined using flow cytometry and double digest restriction site-associated DNA sequencing. Colony phenotypic changes in size and filamentous growth were evident without selection and were enriched among colonies selected for LOH of the GAL1 marker. Aneuploidy and LOH occurred on all chromosomes (Chrs), with aneuploidy more frequent for smaller Chrs and whole Chr LOH more frequent for larger Chrs. Large genome shifts in ploidy to haploidy often maintained one or more heterozygous disomic Chrs, consistent with random Chr missegregation events. Most isolates displayed several different types of genomic changes, suggesting that the oral environment rapidly generates diversity de novo In sharp contrast, following in vitro propagation, isolates were not enriched for multiple LOH events, except in those that underwent haploidization and/or had high levels of Chr loss. The frequency of events was overall 100 times higher for C. albicans populations following in vivo passage compared with in vitro These hyper-diverse in vivo isolates likely provide C. albicans with the ability to adapt rapidly to the diversity of stress environments it encounters inside the host.