Processed genes--genes that resemble processed RNA transcripts rather than interrupted genomic sequences--have been identified as dispersed members of several gene families. Here we describe a processed gene that is one of the three human IgE-like sequences present in the human genome. The processed IgE gene has precisely lost its three intervening sequences, thereby fusing its four coding domains. The homology of the gene to its functional counterpart ends in an adenine-rich tail followed by an 11-base-pair sequence that is directly repeated 150 base pairs 5' to its first coding domain. In addition, the processed gene is located on human chromosome 9 rather than on chromosome 14, the site of the active immunoglobulin locus. The structure and evident mobility of this sequence support the concept that sequences can move about in the genome via RNA intermediates and that processed genes are a prominent feature of genomic structure.
Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. Contact:phillip.lord@newcastle.ac.uk
The chromosomal location of human constant region light chain immunoglobulin (Ig) genes has been determined by analyzing a group of human fibroblast/rodent somatic cell hybrids with nucleic acid probes prepared from cloned human kappa and lambda constant region genes. Human chromosomes in each cell line were identified by isoenzyme analysis. The DNA from hybrid cells was digested with restriction endonucleases, size fractionated by gel electrophoresis, transferred to nitrocellulose or DBM paper, and hybridized with (32)P-labeled nucleic acid probes. The C(kappa) gene was assigned to human chromosome 2 and the C(lambda) genes to chromosome 22, based upon analysis of these hybrid cell lines, and these assignments were confirmed by analysis of subclones. A group of previously unassigned loci can be mapped to chromosome 2 by virtue of their close linkage to C(kappa). The lambda and kappa light chain and heavy chain Ig genes have now been assigned to all three human chromosomes that are involved in translocations with chromosome 8 in human B cell neoplasms. These techniques and probes provide a means to study the detailed arrangement of human Ig genes and their pseudogenes.
Each file contains all SNPs in the individual matching an annotated SNP in SNPedia. SNPedia annotations contain a magnitude value (subjective measure of the importance of the potential phenotypical effect) and a phenotype description of the condition of particular genotype affects.
Hybridization kinetic analyses with synthetic DNA indicate that there are only two to three copies of the κ constant region gene per haploid genome. This result lends weight to the argument that the immunoglobulin light chain is encoded by more than one continuous gene sequence.
Introduction: A cholesteatoma is a mass of keratinising epithelium in the middle ear. It is a rare disorder, associated with significant morbidity. Its OMIM entry (#604183) cites minimal evidence for Mendelian inheritance, but we have observed 31 multiply affected families in Norfolk; including individuals with bilateral disease, suggesting a genetic component for its aetiology. Methods: We conducted a systematic literature review (SR) to identify any published studies about the genetics of cholesteatoma and established a national biobank for subsequent whole exome sequencing (WES) studies of familial disease. We have also completed a pilot sequencing study to identify candidate variants that segregate with the disease phenotype (using NimbleGen exome capture; and the Illumina HiSeq4000 platform). Results: In our SR, we identified 8 case-series with multiply-affected families and associations with congenital malformation syndromes. DNA and clinical data have been collected from 42 participants (from 9 multiply affected Norfolk families) to date. In 2018, participants will also be recruited from 10 additional UK centres. Our pilot: WES study of 16 participants from 4 families identified 95,437 variants. Variant filtering, using pedigree analysis, has identified 430 candidate genes for further filtering using the Ensembl Variant Effect Predictor. Conclusion: We have completed our SR (see PROSPERO register CRD42015023579) and established the first biobank to explore the genetics-of-cholesteatoma. A WES strategy and bioinformatics pipeline have been developed in the pilot study; and preliminary filtering has identified candidate variants that could have an impact on TGF β signalling and inflammatory processes.