Abstract Blood pressure (BP) was inconsistently associated with migraine and the mechanisms of BP-lowering medications in migraine prophylaxis are unknown. Leveraging large-scale summary statistics for migraine ( N cases / N controls = 59,674/316,078) and BP ( N = 757,601), we find positive genetic correlations of migraine with diastolic BP (DBP, r g = 0.11, P = 3.56 × 10 −06 ) and systolic BP (SBP, r g = 0.06, P = 0.01), but not pulse pressure (PP, r g = −0.01, P = 0.75). Cross-trait meta-analysis reveals 14 shared loci ( P ≤ 5 × 10 −08 ), nine of which replicate ( P < 0.05) in the UK Biobank. Five shared loci ( ITGB5 , SMG6 , ADRA2B , ANKDD1B , and KIAA0040 ) are reinforced in gene-level analysis and highlight potential mechanisms involving vascular development, endothelial function and calcium homeostasis. Mendelian randomization reveals stronger instrumental estimates of DBP (OR [95% CI] = 1.20 [1.15–1.25]/10 mmHg; P = 5.57 × 10 −25 ) on migraine than SBP (1.05 [1.03–1.07]/10 mmHg; P = 2.60 × 10 −07 ) and a corresponding opposite effect for PP (0.92 [0.88–0.95]/10 mmHg; P = 3.65 × 10 −07 ). These findings support a critical role of DBP in migraine susceptibility and shared biology underlying BP and migraine.
Type 2 Diabetes (T2D) constitutes a global health burden. Efforts to uncover predisposing genetic variation have been considerable, yet detailed knowledge of the underlying pathogenesis remains poor. Here, we constructed a T2D phenotypic-linkage network (T2D-PLN), by integrating diverse gene functional information that highlight genes, which when disrupted in mice, elicit similar T2D-relevant phenotypes. Sensitising the network to T2D-relevant phenotypes enabled significant functional convergence to be detected between genes implicated in monogenic or syndromic diabetes and genes lying within genomic regions associated with T2D common risk. We extended these analyses to a recent multiethnic T2D case-control exome of 12,940 individuals that found no evidence of T2D risk association for rare frequency variants outside of previously known T2D risk loci. Examining associations involving protein-truncating variants (PTV), most at low population frequencies, the T2D-PLN was able to identify a convergent set of biological pathways that were perturbed within four of five independent T2D case/control ethnic sets of 2000 to 5000 exomes each. These same pathways were found to be over-represented among both known monogenic or syndromic diabetes genes and genes within T2D-associated common risk loci. Our study demonstrates convergent biology amongst variants representing different classes of T2D genetic risk. Although convergence was observed at the pathway level, few of the contributing genes were found in common between different cohorts or variant classes, most notably between the exome variant sets which suggests that future rare variant studies may be better focusing their power onto a single population of recent common ancestry.
Regional specificity of stem cell-derived astrocytes is believed to be an important prerequisite for their applications in disease modelling and cell-based therapies. The regional identity of these astrocytes is often defined by the positional characteristics of their antecedent, stem cell-derived neural progenitors patterned to a fate of interest, with the assumption that the positional specification is to be preserved by the derived astrocytes. Using a human induced pluripotent stem cell line designed for tracing midbrain floor plate derivatives, here we show that lineage composition of the derived astrocytes is not a faithful recapitulation of the founder progenitor population, as demonstrated by the loss of floor plate differentiated progeny in the final astrocyte products. Using deep single cell RNA sequencing, we identified distinct transcriptomic signatures of midbrain floor plate-derived astrocytes. Our study highlights the need for rigorous characterisation of pluripotent stem cell-derived regional astrocytes and provides a valuable resource for assessing midbrain floor plate-derived human astrocytes.
Copy-number variations (CNVs) are a common cause of intellectual disability and/or multiple congenital anomalies (ID/MCA). However, the clinical interpretation of CNVs remains challenging, especially for inherited CNVs. Well-phenotyped patients (5,531) with ID/MCA were screened for rare CNVs using a 250K single-nucleotide polymorphism array platform in order to improve the understanding of the contribution of CNVs to a patients phenotype. We detected 1,663 rare CNVs in 1,388 patients (25.1%; range 0-5 per patient) of which 437 occurred de novo and 638 were inherited. The detected CNVs were analyzed for various characteristics, gene content, and genotype-phenotype correlations. Patients with severe phenotypes, including organ malformations, had more de novo CNVs (P < 0.001), whereas patient groups with milder phenotypes, such as facial dysmorphisms, were enriched for both de novo and inherited CNVs (P < 0.001), indicating that not only de novo but also inherited CNVs can be associated with a clinically relevant phenotype. Moreover, patients with multiple CNVs presented with a more severe phenotype than patients with a single CNV (P < 0.001), pointing to a combinatorial effect of the additional CNVs. In addition, we identified 20 de novo single-gene CNVs that directly indicate novel genes for ID/MCA, including ZFHX4, ANKH, DLG2, MPP7, CEP89, TRIO, ASTN2, and PIK3C3.
Autism Spectrum Disorders (ASD) are highly heritable and characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. Considering four sets of de novo copy number variants (CNVs) identified in 181 individuals with autism and exploiting mouse functional genomics and known protein-protein interactions, we identified a large and significantly interconnected interaction network. This network contains 187 genes affected by CNVs drawn from 45% of the patients we considered and 22 genes previously implicated in ASD, of which 192 form a single interconnected cluster. On average, those patients with copy number changed genes from this network possess changes in 3 network genes, suggesting that epistasis mediated through the network is extensive. Correspondingly, genes that are highly connected within the network, and thus whose copy number change is predicted by the network to be more phenotypically consequential, are significantly enriched among patients that possess only a single ASD-associated network copy number changed gene (p = 0.002). Strikingly, deleted or disrupted genes from the network are significantly enriched in GO-annotated positive regulators (2.3-fold enrichment, corrected p = 2×10−5), whereas duplicated genes are significantly enriched in GO-annotated negative regulators (2.2-fold enrichment, corrected p = 0.005). The direction of copy change is highly informative in the context of the network, providing the means through which perturbations arising from distinct deletions or duplications can yield a common outcome. These findings reveal an extensive ASD-associated molecular network, whose topology indicates ASD-relevant mutational deleteriousness and that mechanistically details how convergent aetiologies can result extensively from CNVs affecting pathways causally implicated in ASD.
The generation of deeply phenotyped patient cohorts offers an enormous potential to identify disease subtypes with prognostic and therapeutic utility. Here, we quantify diverse Parkinson9s disease patient phenotypes on continuous scales by identifying the underlying axes of phenotypic variation using a Bayesian multiple phenotype mixed model that incorporates genotypic relationships. This approach overcomes many of the limitations associated with clustering methods and better reflects the more continuous phenotypic variation observed amongst patients. We identify three principal axes of Parkinson9s disease patient phenotypic variation which are reproducibly found across three independent, deeply and diversely phenotyped UK and US Parkinson9s disease cohorts. These three axes explain over 75% of the observed clinical variation and remain robustly captured with a fraction of the clinically-recorded features. Using these axes as quantitative traits, we identify significant overlaps in the genetic risk associated with each axis and other human complex diseases, namely coronary artery disease and schizophrenia, providing new avenues for disease-modifying therapies. Our study demonstrates how deeply phenotyped cohorts can be used to identify latent heritable disease-modifying traits.
Article Figures and data Abstract eLife digest Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Craniofacial characteristics are highly informative for clinical geneticists when diagnosing genetic diseases. As a first step towards the high-throughput diagnosis of ultra-rare developmental diseases we introduce an automatic approach that implements recent developments in computer vision. This algorithm extracts phenotypic information from ordinary non-clinical photographs and, using machine learning, models human facial dysmorphisms in a multidimensional 'Clinical Face Phenotype Space'. The space locates patients in the context of known syndromes and thereby facilitates the generation of diagnostic hypotheses. Consequently, the approach will aid clinicians by greatly narrowing (by 27.6-fold) the search space of potential diagnoses for patients with suspected developmental disorders. Furthermore, this Clinical Face Phenotype Space allows the clustering of patients by phenotype even when no known syndrome diagnosis exists, thereby aiding disease identification. We demonstrate that this approach provides a novel method for inferring causative genetic variants from clinical sequencing data through functional genetic pathway comparisons. https://doi.org/10.7554/eLife.02020.001 eLife digest Rare genetic disorders affect around 8% of people, many of whom live with symptoms that greatly reduce their quality of life. Genetic diagnoses can provide doctors with information that cannot be obtained by assessing clinical symptoms, and this allows them to select more suitable treatments for patients. However, only a minority of patients currently receive a genetic diagnosis. Alterations in the face and skull are present in 30–40% of genetic disorders, and these alterations can help doctors to identify certain disorders, such as Down's syndrome or Fragile X. Extending this approach, Ferry et al. trained a computer-based model to identify the patterns of facial abnormalities associated with different genetic disorders. The model compares data extracted from a photograph of the patient's face with data on the facial characteristics of 91 disorders, and then provides a list of the most likely diagnoses for that individual. The model used 36 points to describe the space, including 7 for the jaw, 6 for the mouth, 7 for the nose, 8 for the eyes and 8 for the brow. This approach of Ferry et al. has three advantages. First, it provides clinicians with information that can aid their diagnosis of a rare genetic disorder. Second, it can narrow down the range of possible disorders for patients who have the same ultra-rare disorder, even if that disorder is currently unknown. Third, it can identify groups of patients who can have their genomes sequenced in order to identify the genetic variants that are associated with specific disorders. The work by Ferry et al. lays out the basic principles for automated approaches to analyze the shape of the face and skull. The next challenge is to integrate photos with genetic data for use in clinical settings. https://doi.org/10.7554/eLife.02020.002 Introduction Genetic disorders affect almost 8% of people (Baird et al., 1988), about a third of whom will have symptoms that greatly reduce their quality of life. While there are over 7000 known inherited disorders, only a minority of patients with a suspected developmental disorder receive a clinical, let alone a genetic, diagnosis (Hart and Hart, 2009). A genetic diagnosis allows more specific therapeutic interventions to be investigated and can aid the identification of primary vs secondary symptoms. The introduction of whole genome and exome sequencing into modern clinical medicine will be instrumental in raising the current low rate of genetic diagnoses for ultra-rare diseases. Nevertheless, tools to accurately assign functional and disease relevance to sequence variants are substantially lacking. Projects that apply next generation sequencing to patients in clinical settings fail to report genetic diagnoses for approximately 80% of cases (de Ligt et al., 2012). The difficulty lies in identifying the causal variant in an individual patient: even when ignoring experimental error, each individual carries approximately 4 million differences, in the case of whole genome sequencing, relative to the reference genome. Computational analyses currently are able only to interpret the ∼2500 variants that alter protein sequence at evolutionarily conserved positions and ∼400 very rare variants that are likely to be causal for pathogenic processes (Abecasis et al., 2012). Notably, of the ∼10% of the genome that is functional all except the 1.2% that is protein-coding is often disregarded (Weischenfeldt et al., 2013). Therefore, the prediction of causal inherited variants in an individual can result in high false positive and high false negative rates. The most powerful approach to associate a particular gene with an ultra-rare disease is to identify multiple unrelated individuals with the disorder whose genomes harbor deleterious alleles in a shared gene, regulatory element or pathway (Schuurs-Hoeijmakers et al., 2012). However, this approach relies on at least two individuals with the same disorder being available for comparison, an unlikely event given that these two individuals are selected for comparison from the roughly 100 million people affected by rare developmental disorders (prevalence of less than 2 per 100,000 around the world) (Orphanet, 2013). For the past 65 years, clinical geneticists have studied, diagnosed, and characterized developmental disorders on the basis of common characteristics among patients (Rimoin and Hirschhorn, 2004). When a given causal variant is ultra-rare, however, this presents substantial difficulties. Consequently, to realize the full potential of next generation sequencing in clinical diagnostics, phenotypic characterization must also become correspondingly high throughput and sensitive (Hennekam and Biesecker, 2012). The facial gestalt provides valuable information to identify similarities between patients because 30–40% of genetic disorders manifest craniofacial abnormalities (Hart and Hart, 2009). The utility of computer vision for diagnosis and phenotyping of dysmorphic disorders has been explored previously by several groups and with varying approaches (Loos et al., 2003; Hammond et al., 2005; Hammond, 2007; Boehringer et al., 2006; Dalal and Phadke, 2007; Vollmar et al., 2008; Boehringer et al., 2011, reviewed in Hammond and Suttie, 2012; Baynam et al., 2013). The computational analysis of facial morphology using 3D imaging has been applied to conditions such as fetal alcohol syndrome (Suttie et al., 2013), schizophrenia (Buckley et al., 2005; Hennessy et al., 2006, 2007) and autism (Aldridge et al., 2011). While 3D imaging studies have shown high discriminatory power in terms of classification they have relied on specialized imaging equipment and patient cooperation. Previous work with 2D images has relied on manual annotation of images, controlling lighting, pose and expression to allow consistent analyses. These factors greatly limit the availability, and ultimately the potential widespread clinical utility of such approaches. We have adopted a complementary approach that takes advantage of the wealth of data available for human faces, an indirect result of the ubiquitous availability of cameras. To do so we provide a new representation ('Clinical Face Phenotype Space'), which is an application of computer vision and machine learning algorithms for analyzing craniofacial dysmorphisms from ordinary photographs. We have ensured that Clinical Face Phenotype Space is robust to spurious variations such as lighting, pose, and image quality which would otherwise bias analyses. The approach is fully automated and provides objective and consistent computational descriptions of facial gestalt. Our method both greatly narrows the search space for investigating known disorders and will increase the power of inferring causative variants in previously unidentified genetic disease. Results We sought to construct a database of patient photos within which faces would be automatically identified and their key features annotated. Our intent was to build a model of dysmorphic variation from a set of syndromes that, additionally, would be able to cluster syndromes not used in model training. Our schema by which a patient photo is automatically analyzed within the context of Clinical Face Phenotype Space is provided in Figure 1A. Figure 1 with 2 supplements see all Download asset Open asset Overview of the computational approach and average faces of syndromes. (A) A photo is automatically analyzed to detect faces and feature points are placed using computer vision algorithms. Facial feature annotation points delineate the supra-orbital ridge (8 points), the eyes (mid points of the eyelids and eye canthi, 8 points), nose (nasion, tip, ala, subnasale and outer nares, 7 points), mouth (vermilion border lateral and vertical midpoints, 6 points) and the jaw (zygoma mandibular border, gonion, mental protrubance and chin midpoint, 7 points). Shape and Appearance feature vectors are then extracted based on feature points and these determine the photo's location in Clinical Face Phenotype Space (further details on feature points in Figure 1—figure supplement 1). This location is then analyzed in the context of existing points in Clinical Face Phenotype Space to extract phenotype similarities and diagnosis hypotheses (further details on Clinical Face Phenotype Space with simulation examples in Figure 1—figure supplement 2). (B) Average faces of syndromes in the database constructed using AAM models ('Materials and methods') and number of individuals which each average face represents. See online version of this manuscript for animated morphing images that show facial features differing between controls and syndromes (Figure 2). https://doi.org/10.7554/eLife.02020.003 Image database composition We first collected a database of 2878 images, including 1515 healthy controls and 1363 pictures for eight known developmental disorders from publically available sources across the internet (Table 1, references for image sources are available from Supplementary file 1). Manual checks were performed to exclude images where the face or an eye was not clearly visible, or where an expert clinician (DRF) could not verify the diagnosis. Manual annotation of facial features points was performed on all images to allow training and testing of an automated annotation algorithm. These initial requirements for manual intervention are dispensed with in the final automatic algorithm (see below). Table 1 Composition of the database https://doi.org/10.7554/eLife.02020.006 SyndromeNr imagesSyndromeNr imagesPublic images onlinePublished images Angelman205 PACS12 Apert203 BRAF35 Cornelia de Lange179 CFC1 Down199 Costello10 Fragile X164 ERF5 Progeria78 HRAS5 Treacher Collins103 KRAS12 Williams-Beuren232 MAP2K15 MAP2K24 Controls1515 MEK15 NRAS2 22q118 PTPN1119 Marfan18 RAF19 Sotos36 SHOC28 Turner12 SOS130The Gorlin Collection Aarskog19 Klippel-Trenaunay10 Achondroplasia12 Langer-Giedion14 Alagille8 Larsen11 Albright7 Lenz_Majewski17 Angelman13 Lymphedema-Lymphangiectasia-MR8 Apert49 Melnick_Needles17 Beckwith-Wiedemann11 Moebius9 Bloom9 Muenke15 BOF15 Myotonicdystrophy9 Cartilagehair13 Neurofibromatosis7 CHARGE12 Noonan29 Cherubism20 OAVdysplasia18 CleidoCranialdysostosis13 ODD21 Coffin-Lowry20 OFCD10 Costello9 OFD18 CriduChat17 OPD31 Crouzon16 Osteopetrosis2 Crouzonodermoskeletal5 Osteosclerosis5 Cutislaxa11 Otodental2 DeLange17 Poland4 Diastrophicdysplasia5 Prader–Willi16 Down8 Progeria14 Dubowitz12 Proteus6 Dyggve-Melchior-Clausen8 Rieger4 EEC6 Rothmund-Thomson13 Ehlers-Danlos17 Rubinstein-Taybi8 Ellis-vanCreveld3 Saethre-Chotzen25 FG11 Sclerosteosis4 FragileX27 SeckelMOD7 Frontometaphysealdysplasia12 SEDcongenita6 Gorlin91 Sotos16 Gorlin_Chaudry_Moss13 Stickler42 Greig7 TRP24 Hallermann-Streiff9 Waardenburg39 Incontinentiapigmenti4 Weaver13 Kabuki25 Williams-Beuren19 Klippel-Feil3 Computer vision algorithms We proceeded to train a computer vision algorithm for automatic annotation of 36 feature points of interest across the face (Figure 1A). Our approach takes advantage of a variety of facial detection algorithms (OpenCV [Bradski, 2000], Viola Jones [Viola and Jones, 2001] and Everingham [Everingham et al., 2009]) and custom learning (consensus of exemplars [Belhumeur et al., 2011]) to accurately place feature points on a given face ('Materials and methods'). Across all images in our database, manual checking found that our algorithm detected and annotated 99.5% of tested faces correctly with accuracies in the range 6–60% of the width of an eye (individual feature point accuracies are provided in Figure 1—figure supplement 1). We used an Active Appearance Model ('Materials and methods') to calculate an average face within any set of images, representing consistent shape and appearance features within the group (Figure 1B and animated morphs in Figure 2). The average faces for each syndrome show that the algorithm effectively captures characteristic features of dysmorphic syndromes (Figure 2—figure supplement 1). For each feature point, the algorithm extracts a feature vector describing appearance of the surrounding patch. The algorithm then constructs a feature vector describing shape based on the relative pairwise distances between all feature points ('Materials and methods'). We next sought to compare the syndrome relevant information content of the feature descriptors to previous studies (Hammond et al., 2005; Boehringer et al., 2006; Hammond, 2007; Vollmar et al., 2008). We found that classification analysis based on support vector machines provided similar accuracies to previous work, despite disparities in image variability (average classification accuracy 94.4%, see Figure 4—figure supplement 1, Figure 4—figure supplement 2 and 'Materials and methods'). Figure 2 Download asset This video cannot be played in place because your browser does support HTML5 video. You may still download the video for offline viewing. Download as MPEG-4 Download as WebM Download as Ogg Animated morphs of average faces from controls to syndromes. (A) Angelman, (B) Apert, (C) Cornelia de Lange, (D) Down, (E) Fragile X, (F) Progeria, (G) Treacher-Collins, (H) Williams-Beuren. Delineation of syndrome gestalt relative to controls with distortion graphs in Figure 2—figure supplement 1. https://doi.org/10.7554/eLife.02020.008 Figure 2—figure supplement 1 Download asset Open asset Distortion graphs representing the characteristic deformation of syndrome faces relative to the average control face. Each line reflects whether the distance is extended or contracted compared with the control face. White—the distance is similar to controls, blue—shorter relative to controls, and red—extended in patients relative to controls. https://doi.org/10.7554/eLife.02020.009 It is important to emphasize that the analyzed images vary greatly, as there were minimal restrictions imposed on image selection placed by the two exclusion criteria (both eyes visible and diagnosis verified by DRF). Photos were analyzed irrespective of the subject's age, gender, facial expression or ethnicity or the background scenery. Principal component analysis (PCA) of facial descriptor vectors illustrates that the main sources of variation among images are indeed lighting, pose, and facial expression, rather than phenotypic features (Figure 1—figure supplement 2). Constructing a Clinical Face Phenotype Space with metric learning We next performed Metric Learning using a Large Margin Nearest Neighbor (Weinberger and Saul, 2009) approach for the eight syndromes in the database. This approach linearly transformed the multidimensional space of PCA feature vectors to optimize the separation of syndromes: dimensions informative for dysmorphism phenotypes are expanded while uninformative dimensions are compressed (thus changing the relative importance for clustering). We denote the resulting transformed 270 dimensional space as 'Clinical Face Phenotype Space' (see 'Materials and methods'). Due to its design, Clinical Face Phenotype Space clusters patient faces based on diagnostically relevant phenotypic features, while tolerating spurious variation. Relative importance of spurious and phenotypic variation for clustering in Clinical Face Phenotype Space was tested using simulated faces ('Materials and methods'). For these faces feature dimensions that reflected known spurious variation such as lighting and head orientation were compressed and hence were of less relevance for clustering (Figure 1—figure supplement 2). For the eight syndromes with which Clinical Face Phenotype Space was created, we performed tests with supervised learning and clustering. A kNN-classifier applied within Clinical Face Phenotype Space was able to correctly classify images with an accuracy of 99.5% using the leave-one-out method. However, to avoid biases introduced by training data size, we also assessed the improvements in clustering by measuring the search space reduction (hereafter referred to as the Clustering Improvement Factor or CIF, 'Materials and methods'). This estimates the factor by which the Clinical Face Phenotype Space improves the clustering of syndromes when compared with random chance (to 95% confidence). On average, the clustering of the eight syndromes within the database was improved by 11.0-fold (geometric mean of improved clustering, CIF range 9.1–23.5, maximum possible mean 12.5; Figure 3). Figure 3 Download asset Open asset Clinical Face Phenotype Space enhances the separation of different dysmorphic syndromes. The graph shows a two dimensional representation of the full Clinical Face Phenotype Space, with links to the 10 nearest neighbors of each photo (circle) and photos placed with force-directed graphing. The Clustering Improvement Factor (CIF, fold better clustering than random expectation) estimate for each of the syndromes is shown along the periphery. https://doi.org/10.7554/eLife.02020.010 Next, we tested and confirmed our hypothesis that Clinical Face Phenotype Space could be generalized to dysmorphic syndromes that were not used in the training. We had access to 75 syndromes from the Gorlin collection (a kind gift curated and annotated by Professor Raoul Hennekam, Academic Medical Center, University of Amsterdam), which we supplemented with additional images of 22q11, Marfan and Sotos syndromes. Furthermore, we collected images of patients with verified genetic mutations in PACS1 or in specific genes from the RAS/MEK pathway (Supplementary file 1 references for image sources in 'Materials and methods'). The number of individuals within each syndrome varied between 2 and 223. The search space reduction was on average 27.6-fold better than random chance (CIF range 1.0–700.0, maximum possible average CIF was 150.0; Figure 4A). That is to say, that among 2754 patients' faces associated with any of 90 syndromes Clinical Face Phenotype Space makes it 27.6-fold easier to make the correct diagnosis. This demonstrates that Clinical Face Phenotype Space is an effective approach to the identification of multiple individuals sharing ultra-rare, previously undocumented, genetic disorders. Figure 4 with 4 supplements see all Download asset Open asset Clinical Face Phenotype Space is generalizable to dysmorphic syndromes that are absent from a training set. (A) Clustering Improvement Factor (CIF) estimates are plotted vs the number of individuals per syndrome grouping in the Gorlin collection or patients with similar genetic variant diagnoses. As expected, the stochastic variance in CIF is inversely proportional to the number of individuals available for sampling. The median CIF across all groups is 27.6-fold over what is expected by clustering syndromes randomly. That is to say, the CIF of a randomly placed set is 1. The maximum CIF is fixed by the total number of images in the database and by the cardinality of a syndrome set: the theoretical maximal CIF upper bound is plotted as a red dotted line. The CIF for the minimum and maximum, Cutislaxa syndrome and Otodental syndrome, were 1.0 and 700.0 respectively. (B) Average probabilistic classification accuracies of each individual face placed in Clinical Face Phenotype Space (class prioritization by 20 nearest neighbors weighted by prevalence in the database). The 8 initial syndromes used to train Clinical Face Phenotype Space are shown in color. For syndromes with fewer than 50 examples, accuracies were averaged across all syndromes binned by data set size (i.e., the average accuracy is shown for syndromes with 2–5, 6–10, 11–25, and 26–50 images in the database, Supplementary file 1). Classification accuracies increase proportional to the number of individuals with the syndrome present in the database. Accuracies using support vector machines with binary and forced choice classifications are shown in Figure 4—figure supplement 1 and Figure 4—figure supplement 2. A simulation example of probabilistic querying of Clinical Face Phenotype Space is shown in Figure 4—figure supplement 3. https://doi.org/10.7554/eLife.02020.011 We proceeded to test if Clinical Face Phenotype Space recapitulates the modularity of genetic diseases, where clusters of phenotypically similar disorders reflect functional relationships among the genes involved (see Oti and Brunner, 2007 for a review). We have shown that individuals with the same underlying genetic disease automatically cluster in Clinical Face Phenotype Space. We next tested whether disorders caused by mutations in different genes result in meaningful clusters in Clinical Face Phenotype Space. We selected disorders with a known genetic origin, using either gene associations from OMIM or publications describing the identification of causative genes (see 'Materials and methods'). For each pair of genes, the shortest path in a protein–protein interaction network was obtained from Dapple (Rossin et al., 2011), giving a protein interaction distance relevant to that gene pair. We compared genes underlying monogenic syndromes linked by 1, 2, or 3 path distances, with those with a path distance of 4 or that was unknown; unknown distances are those where no genes are associated with a syndrome, the syndrome is multigenic, or when DAPPLE has no known interaction documented, see 'Materials and methods'. For each pair of syndromes, an average Euclidean distance in Clinical Face Phenotype Space was calculated. The distance in Clinical Face Phenotype Space is significantly shorter between syndromes associated to genes with protein interaction distances of 1, 2, or 3 compared with syndromes with 4 or no known interactions (p< 0.01, p< 0.05 and p< 0.001 respectively, Figure 5). This demonstrates that the distance in Clinical Face Phenotype Space partly recapitulates the functional relatedness of underlying developmental processes known to be disrupted in genetic diseases. Figure 5 Download asset Open asset Clinical Face Phenotype Space recapitulates features of functional gene links between syndromes. Protein–protein interaction distances of 1–3 for genetically characterized syndromes are associated with significantly shorter Euclidean distance (arbitrary units) between syndromes in Clinical Face Phenotype Space as compared to syndromes with distance 4 or no known interaction distance (shown in orange) (Kruskal–Wallis tests with Bonferroni corrected p-values indicated as *p<0.05, **p<0.01, ***p<0.001). The Spearman correlation across all distances was r = 0.09, p<0.001. The numbers of pairwise syndrome comparisons underlying each of the interaction distances are listed within the respective boxes. https://doi.org/10.7554/eLife.02020.016 Querying Clinical Face Phenotype Space Clinical Face Phenotype Space can provide clinical phenotyping and clustering to known genetic disorders that is objective and high-throughput. The method is, however, neither sufficiently accurate nor intended to determine diagnosis, yet it can help to narrow the diagnostic search space in an unprejudiced manner. A clinician could easily photograph a patient and immediately obtain clinically useful diagnostic hypotheses and matching cases. To this end, we implemented two primary methods to automatically and objectively query Clinical Face Phenotype Space. For any given image located in Clinical Face Phenotype Space, we obtain confidence ranked classifications to known disorders (see 'Materials and methods' and Figure 4—figure supplement 4). In addition, we objectively compare the image to others within the space. For any given query image, a probabilistic ranking of similar syndromes is obtained through nearest neighbor representation compared to random expectation of clustering among the 90 syndromes and 2754 faces. The classification confidence for a particular disorder depends on its location within the space, but also on the local densities of similar faces. We find that for the eight initial syndromes used to construct Clinical Face Phenotype Space, 93.1% (range 81.0–99.2%) are correctly classified as the top rank, cumulatively converging on 99.1% (95.8–100%) by the 20th rank (Figure 4B). Of syndromes not part of the Clinical Face Phenotype Space training, the classification accuracies positively correlated strongly with the number of instances in the database (Figure 4B). For the 20 syndromes where the database held 5 or fewer examples (Table 1), we classify on average 20.3% correctly by the 6th rank (exceeding 16.3-fold better than by chance alone). For individuals with a suspected ultra-rare or an undocumented novel disorder, we developed a metric, p0p1, which assesses their similarity to others within Clinical Face Phenotype Space. The metric estimates the relative closeness of two faces given an average local density with the space: a p0p1 value exceeding 1 indicates a potentially new cluster, see 'Materials and methods'. The 2 PACS1 cases reported by Schuurs-Hoeijmakers et al. (2012) placed within Clinical Face Phenotype Space have a p0p1 value of 1.05 meaning that they are 5% closer to one another than the geometric mean of the distances to their 20 nearest neighbors. Taking into account that this is a local density estimate among 2754 faces in Clinical Face Phenotype Space, the search space to find them has been reduced ∼690.4-fold (CIF, see 'Materials and methods'). The combination of syndrome clustering and de novo similarity metrics should aid the diagnosis of known syndromes and provides a means of clustering patients where no documented diagnosis exists. Discussion We have developed our algorithm on normal-everyday 2D photographs and have focused on 36 facial feature points. Given the orders of magnitude lower dimensionality of our data as compared to a 3D imaging capture (Hammond et al., 2005), we were initially concerned that this would be insufficient to capture facial phenotypes. However, we then demonstrated that the approach is able to describe and discriminate between syndromes with a comparable accuracy to previous studies (Loos et al., 2003; Hammond et al., 2005; Hammond, 2007; Boehringer et al., 2006; Dalal and Phadke, 2007; Vollmar et al., 2008; Boehringer et al., 2011). The accessibility of normal 2D photographs (as opposed to 3D imaging) should outweigh any lower data resolution obtained from any one image and in future developments using multiple profile perspectives will allow 3D structure to be inferred. With accurate registration of a person's face from multiple images across time, from a family photo album for instance, it would capture not only the 3D structure but also the progression and development of dysmorphic gestalt. The automatic image analysis algorithm enables phenotypic metrics to be obtained with objective consistency from each image (Figure 1). Clinical Face Phenotype Space was instantiated using eight syndromes that were well populated in our database so as to be robust against spurious variation. In doing so, it has become a generalizable model for craniofacial dysmorphic variation (Figure 5). The high fidelity of the current Clinical Face Phenotype Space (Figure 3) shows promise given that known deficiencies have yet to be addressed: (1) We used only single image examples of individuals. (2) The spectrum of phenotypes represented was limited. (3) The average image quality in the database was low. (4) The current 36 facial feature points only capture full frontal facial phenotypes, and thus miss valuable information from the full cranium and profile perspectives. Among the approaches that will be tested in future work are: increasing the number of feature points across the cranium, using profile images and taking advantage of multiple images of the same individual. Furthermore, we will be exploring performing explicit modelingmodeling of the 3D variation for 2D images (Ramnath et al., 2008), other types of feature descriptors, alternative metric learning and dimensionality reduction approaches (Simonyan et al., 2013). As Clinical Face Phenotype Space is developed and po
Background Array comparative genomic hybridization (aCGH) to detect copy number variants (CNVs) in mammalian genomes has led to a growing awareness of the potential importance of this category of sequence variation as a cause of phenotypic variation. Yet there are large discrepancies between studies, so that the extent of the genome affected by CNVs is unknown. We combined molecular and aCGH analyses of CNVs in inbred mouse strains to investigate this question. Principal Findings Using a 2.1 million probe array we identified 1,477 deletions and 499 gains in 7 inbred mouse strains. Molecular characterization indicated that approximately one third of the CNVs detected by the array were false positives and we estimate the false negative rate to be more than 50%. We show that low concordance between studies is largely due to the molecular nature of CNVs, many of which consist of a series of smaller deletions and gains interspersed by regions where the DNA copy number is normal. Conclusions Our results indicate that CNVs detected by arrays may be the coincidental co-localization of smaller CNVs, whose presence is more likely to perturb an aCGH hybridization profile than the effect of an isolated, small, copy number alteration. Our findings help explain the hitherto unexplored discrepancies between array-based studies of copy number variation in the mouse genome.