Abstract Inspecting concordance between self-reported sex and genotype-inferred sex from genomic data is a significant quality control measure in clinical genetic testing. Numerous tools have been developed to infer sex for genotyping array, whole-exome sequencing, and whole-genome sequencing data. However, improvements in sex inference from targeted gene sequencing panels are warranted. Here, we propose a new tool, seGMM, which applies unsupervised clustering (Gaussian Mixture Model) to determine the gender of a sample from the called genotype data integrated aligned reads. seGMM consistently demonstrated > 99% sex inference accuracy in publicly available (1000 Genomes) and our in-house panel dataset, which achieved obviously better sex classification than existing popular tools. Compared to including features only in the X chromosome, our results show that adding additional features from Y chromosomes (e.g. reads mapped to the Y chromosome) can increase sex classification accuracy. Notably, for WES and WGS data, seGMM also has an extremely high degree of accuracy. Finally, we proved the ability of seGMM to infer sex in single patient or trio samples by combining with reference data and pinpointing potential sex chromosome abnormality samples. In general, seGMM provides a reproducible framework to infer sex from massively parallel sequencing data and has great promise in clinical genetics.
In clinical genetic testing, checking the concordance between self-reported gender and genotype-inferred gender from genomic data is a significant quality control measure because mismatched gender due to sex chromosomal abnormalities or misregistration of clinical information can significantly affect molecular diagnosis and treatment decisions. Targeted gene sequencing (TGS) is widely recommended as a first-tier diagnostic step in clinical genetic testing. However, the existing gender-inference tools are optimized for whole genome and whole exome data and are not adequate and accurate for analyzing TGS data. In this study, we validated a new gender-inference tool, seGMM, which uses unsupervised clustering (Gaussian mixture model) to determine the gender of a sample. The seGMM tool can also identify sex chromosomal abnormalities in samples by aligning the sequencing reads from the genotype data. The seGMM tool consistently demonstrated >99% gender-inference accuracy in a publicly available 1,000-gene panel dataset from the 1,000 Genomes project, an in-house 785 hearing loss gene panel dataset of 16,387 samples, and a 187 autism risk gene panel dataset from the Autism Clinical and Genetic Resources in China (ACGC) database. The performance and accuracy of seGMM was significantly higher for the targeted gene sequencing (TGS), whole exome sequencing (WES), and whole genome sequencing (WGS) datasets compared to the other existing gender-inference tools such as PLINK, seXY, and XYalign. The results of seGMM were confirmed by the short tandem repeat analysis of the sex chromosome marker gene, amelogenin. Furthermore, our data showed that seGMM accurately identified sex chromosomal abnormalities in the samples. In conclusion, the seGMM tool shows great potential in clinical genetics by determining the sex chromosomal karyotypes of samples from massively parallel sequencing data with high accuracy.
Abstract Purpose The transcription factor TBX2 plays a critical role in inner hair cells development in mice. Yet, the link between TBX2 malfunction and human hearing-related disorders remains unexplored. Methods Linkage analysis combined with whole genome sequencing was applied to identify the causative gene in two autosomal dominant Chinese families characterized by late-onset progressive sensorineural hearing loss and incomplete penetrance of horizontal oscillatory nystagmus. Functional evaluation of TBX2 variants was performed through protein expression, localization, and transcriptional activity analysis in vitro , phenotypic analysis and mechanism study in knockout mice model in vivo . Results Multipoint parametric linkage analysis of Family 1 revealed a maximum LOD score of 3.01 on chromosome 17q23.2. Whole genome sequencing identified distinct TBX2 variants, c.977delA (p.Asp326Alafs*42) and c.987delC (p.Ala330Argfs*38) in each family, co-segregating with hearing loss. These variants resulted in premature termination and the generation of a new peptide segment, reducing transcriptional activity. Further, heterozygous Tbx2 knockout mice exhibited late-onset progressive hearing loss, along with ectopic expression of Prestin in IHCs and a gradual decrease in expression from P7 to P42. Conclusion Our findings indicate that heterozygous TBX2 frameshift variants are the genetic cause of late-onset progressive hearing loss and incomplete penetrance of nystagmus. The heterozygous Tbx2 knockout mouse model mirrored the human hearing loss phenotype, further validating TBX2’s role in auditory function. These insights enhance our understanding of TBX2 in the auditory system, providing valuable information for molecular diagnostics and genetic counseling in related hearing disorders.