DOCUMENTATION

On this page you can find a more detailed explanation of the data present in DIDA. DIDA provides information under the form of four tables: GENES, VARIANTS, DIGENIC COMBINATIONS and DISEASES. These tables can be accessed and downloaded through the BROWSE page.

GENES TABLE

  • Uniprot ACC – The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The Uniprot ACC represents the unique protein accession number for the gene-of-interest provided by UniProt
  • Pathway (Reactome) Reactome is a free, open-source, curated and peer reviewed pathway database.  The column reports the unique pathway ID for the gene-of-interest provided by Reactome.
  • Pathway (KEGG) – Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. The column reports the unique pathway ID for the gene-of-interest provided by KEGG.
  • InterPro – InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.  The column reports the unique protein family, domain, repeat or site ID for the gene-of-interest provided by InterPro.
  • Pfam – Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The column reports the unique protein family ID for the gene-of-interest provided by Pfam.
  • GO molecular function – .GO terms for molecular function from UniProt.
  • Interactors (intAct) – IntAct provides a freely available, open source database for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available. The IntAct ID has the following format: “name of interacting gene[pubmed id(s)]”.
  • Interactors (BioGrid) – BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Like for IntAct, the BioGrid ID has the following format: “name of interacting gene[pubmed id(s)]”.
  • Interactors (ConsensusPath) – ConsensusPathDB integrates interaction networks in Homo sapiens including binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 32 public resources for interactions and curated interactions from literature.  The ConsensusPath ID has the following format: “name of interacting gene[confidence index]”
  • P(haploinsufficiency) – Estimated probability of haploinsufficiency of the gene. You can find more information in the original publication: Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and Predicting Haploinsufficiency in the Human Genome. PLoS Genet 6, e1001154 (2010). PMID:20976243
  • P(recessiveness)  Estimated probability that the gene is a recessive disease gene. You can find more information in the original publication: MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012). PMID:22344438
  • GDI– gene damage index score, “a genome-wide, gene-level metric of the mutational damage that has accumulated in the general population” from here . The higher the score the less likely the gene is to be responsible for monogenic diseases.
  • Essential in Mouse – Known essential status of the gene. Two different values can be found: 1) essential (E) or 2) non-essential phenotype-changing (N) based on the Mouse Genome Informatics database. You can find more information in the original publication: Georgi, B., Voight, B. F. & Bućan, M. From Mouse to Human: Evolutionary Genomics Analysis of Human Orthologs of Essential Genes. PLoS Genet 9, (2013). PMID:23675308
  • Expression (GNF/Atlas) – The GNF/Atlas expression data provides information on gene expression patterns under different biological conditions such as a gene knock out, a plant treated with a compound, or in a particular organism part or cell. We annotated the tissues and/or organs in which the gene-of-interest is expressed.


VARIANTS TABLE

  • Variant ID – The variant identifier in DIDA.
  • Genomic position – This value represent the position of the variant-of-interest in the human genome version hg19.
  • Ref allele – This represents the nucleotide(s) that is (are) present in the human reference genome at the position of the variant-of-interest.
  • Alt allele – This represents the new nucleotide(s) that is (are) present at the position of the variant-of-interest.
  • Transcript – This represents the unique NCBI transcript ID for the gene-of-interest. The cDNA change and protein change values are based on the transcript written in this column.
  • DNA strand – This represents the DNA strand on which the gene-of-interest is located in the human genome. Two different values are possible: “+” for the positive (forward) strand and “-” for the negative (reverse) strand.
  • cDNA change – This represents the change at the c(oding)DNA level for the variant-of-interest. Example: c.1022C>A represents a change in the cDNA at position 1022, where the reference nucleotide is a ‘C’ that changes to an ‘A’.
  • Protein change – This represents the change at the amino acid level for the variant-of-interest. Example: p.(A341E) represents a change in the protein sequence at position 341 where the reference amino acid ‘A (Alanin)’ changes to ‘E (Glutamic acid)’.
  • Variant effect – This represents the effect the variant-of-interest has on the amino acid level. Eight possible values can be found: intronic (located in an intronic sequence so no immediate effect on the amino acid sequence), silent (variant causes no amino acid change), missense (variant causes one amino acid change), splicing (variant changes an essential splice-site position (-2, -1, +1 or +2 )), insertion (nucleotide(s) insertion), deletion (nucleotide(s) deletion), frameshift (variant causes a change in the reading frame), nonsense (variant causes the creation of a stop codon).
  • Polyphen2 prediction – PolyPhen-2 (Polymorphism Phenotyping v2) is a software tool which predicts possible impact of amino acid substitutions on the structure and function of human proteins using straightforward physical and evolutionary comparative considerations. Three different values can be found: 1) missense genetic variant is predicted as probably Damaging (D), 2) is predicted as Possibly damaging (P) or 3) is predicted as Benign (B). Predictions are based on the HumVar dataset. You can find more information in the original publication: Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations POLYPHEN2. Nat Methods 7, 248–249 (2010). PMID: 20354512
  • CADD prediction -CADD raw score for functional prediction of a SNP. The larger the score the more likely the SNP has damaging effect. Scores range from -7.535037 to 35.788538 in dbNSFP. Please note the following copyright statement for CADD: “CADD scores (http://cadd.gs.washington.edu/) are Copyright 2013 University of Washington and Hudson-Alpha Institute for Biotechnology (all rights reserved) but are freely available for all academic, non-commercial applications. For commercial licensing information contact Jennifer McCullar (mccullaj@uw.edu).”. Please refer to Kircher et al., Nature Genetics (2014) for more details.
  • DEOGEN prediction -DEOGEN is a novel variant effect predictor for missense SNVs. It integrates information from different biological scales, mimicking the complex mixture of effects that lead from the variant to the phenotype. You can find more information at Raimondi et al., Bioinformatics (2016).
  • SIFT prediction SIFT predicts whether an amino acid substitution affects protein function. A SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST. SIFT can be applied to naturally occurring nonsynonymous polymorphisms or laboratory-induced missense mutations. Two different values can be found: 1) missense genetic variant is predicted as deleterious (D) or 2) is predicted as tolerated (T). You can find more information in the original publication: Kumar P et al. Predicting the effects of coding non-synonymous variants onprotein function using the SIFT algorithm. Nat Protoc 4(7), 1073-1081 (2009). PMID: 19561590
  • rs ID  The dbSNP database serves as a central repository for both single base nucleotide subsitutions and short deletion and insertion polymorphisms. It is hosted by the The National Center for Biotechnology Information (NCBI). The column reports the unique variant ID for dbSNP version 141.
  • ExAC allele frequency – Allele frequency in total ExAC samples (60,706 samples). The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The data set provided on this website spans 60,706 unrelated individuals sequenced as part of various disease-specific and population genetic studies. You can find more information here.
  • 1000Gp3 Allele frequency – The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. The Phase 3 part of the project contains whole genome sequence data for 2,504 individuals from 26 populations. The column reports the variant allele frequency for all individuals sequenced in phase 3 of the project.
  • ESP6500 Allele frequency (AA)  The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community. The current ESP6500 release contains whole exome sequence data for approximately 6500 individuals. The column reports the variant allele frequency for all sequenced African American (AA) individuals.
  • ESP6500 Allele frequency (EA) The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community. The current ESP6500 release contains whole exome sequence data for approximately 6500 individuals. The column reports the variant allele frequency for all sequenced European American (EA) individuals.

 

DIGENIC COMBINATIONS TABLE

  • Digenic combination ID – The digenic combination identifier in DIDA.
  • Gene A – This represents the name of gene A following HGNC nomenclature. There is not an order or any rule to define a primary or a secondary gene in DIDA. So defining gene A as ‘A’ and gene B as ‘B’ was done arbitrarily.
  • Allele 1 Gene A cDNA change – This represents the cDNA position and change for the first allele in gene A of the digenic combination. The term ‘wild type’ represents the reference allele.
  • Allele 2 Gene A cDNA change – This represents the cDNA position and change for the second allele in gene A of the digenic combination.  The term ‘wild type’ represents the reference allele.
  • Zygosity Gene A – This represents the zygosity status of the variant in gene A of the digenic combination. Four possible values can be found: 1) ‘heterozygote’ when one variant and one reference allele are present, 2) ‘homozygote’ when two variant alleles are present, 3) ‘compound heterozygote’ when two different variant alleles are present or 4) ‘hemizygote’ when the variant is located in a gene on the X chromosome and the ‘digenic’ patient is a male. Males only have one copy of the X chromosome, so hemizygous refers to the presence of one variant allele.
  • Gene B – This represents the name of gene B following HGNC nomenclature. There is not an order or any rule to define a primary or a secondary gene in DIDA. So defining gene A as ‘A’ and gene B as ‘B’ was done arbitrarily.
  • Allele 1 Gene B cDNA change This represents the cDNA position and change for the first allele in gene B of the digenic combination. The term ‘wild type’ represents the reference allele.
  • Allele 2 Gene B cDNA change – This represents the cDNA position and change for the second allele in gene B of the digenic combination. The term ‘wild type’ represents the reference allele.
  • Zygosity Gene B – This represents the zygosity status of the variant in gene B of the digenic combination. Four possible values can be found: 1) ‘heterozygote’ when one variant and one reference allele are present, 2) ‘homozygote’ when two variant alleles are present, 3) ‘compound heterozygote’ when two different variant alleles are present or 4) ‘hemizygote’ when the variant is located in a gene on the X chromosome and the ‘digenic’ patient is a male. Males only have one copy of the X chromosome, so hemizygous refers to the presence of one variant allele.
  • Disease name (ORPHANET) – Orphanet is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. This column represents the name of the disease as present in Orphanet.
  • Oligogenic effect – The majority of instances in DIDA are categorised into one of two simplified classes: either the digenic combination provides data on two variants in two genes that are both mandatory for the appearance of the disease, or a variant in one gene is enough to develop the disease but carrying a second one on another gene impacts the disease phenotype or affects the severity or age of onset. These two classes are a coarse-grained simplification of the original definition provided by Schaffer. The first class represents true digenic instances (labelled as “on/off” in the previous version of DIDA): mutations at both loci are required for disease, mutations at one of the two loci result in no phenotype. The second class we will refer to as the composite class as it includes different possibilities (labelled as “severity” in the previous version of DIDA): A composite instance in DIDA could refer to a dual molecular diagnoses, wherein mutations at each locus may segregate independently and result in expression of part/all of the phenotype, or a oligogenic mutational burden, when a driver mutation is necessary for phenotype but rare variants in other genes, usually related to the same pathway/organ system, may modify the phenotype. Throughout this paper the true digenic class will be annotated by TD and the composite class by CO. Further fine-tuning of these classes will become possible when more digenic diseases data become available. Yet for now we can limit ourselves to the current constraint, exploring the reason why a certain digenic combination belongs to the TD or CO class. You can find more information at Gazzo et. al, Nucleic Acids Res (2017).
  • Familial evidence – When reading the original publication, in which the digenic combination was reported, we checked if the digenicity was supported by a familial study. In this study family members are genetically tested to determine their variant carrier status. Two values are possible: 1) YES when a family study provided evidence for digenicity or 2) NO when there was no family study conducted or the study was inconclusive.
  • Functional evidence – When reading the original publication, in which the digenic combination was reported, we checked if the digenicity was supported by a functional study. In this study the combined functional effect of the two variants was tested. Two values are possible: 1) YES when a functional study provided evidence for digenicity or 2) NO when there was no functional study conducted or the study was inconclusive.
  • Allelic state – This represents the allelic state of the digenic combination. There are three possible values: 1) ‘di-allelic’ when two variant alleles are present, 2) ‘tri-allelic’ when three variant alleles are present or 3) ‘tetra-allelic’ when four variant alleles are present.
  • Gene relationship – As already described in literature, digenic diseases are caused by mutations in two genes which often have a physical or functional relationship (1,2). For each digenic combination in DIDA we determined the relationship between the two genes carring the mutations. There are 5 different types of relationship:
    1. Direct interaction: there is a direct protein-protein interaction between the proteins products of the two genes. This information was retrieved from protein-protein-interaction databases (BioGrid, IntAct and ConsensusPathDb).
    2. Indirect interaction: there is an indirect or “two step” interaction between the protein products of the two genes. If protein “A” and protein “B” interact with protein “C”, protein A and protein B are indirectly interacting. In other words, they share a common interactor. This information was retrieved from protein-protein-interaction databases (BioGrid, IntAct and ConsensusPathDb).
    3. Pathway membership: the protein products from both genes belong to the same pathway. This information was retrieved from pathway databases (KEGG and REACTOME).
    4. Co-expression: the protein products from both genes are expressed in at least one common tissue or organ. This information was retrieved from GNF/Atlas.
    5. Similar function: the protein products from both genes contain the same functional conserved motifs or conserved domains. This information was retrieved from protein domain databases (InterPro and Pfam).
  • HPO  The Human Phenotype Ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. The column reports the unique ID(s) linked to each phenotypic abnormality present in the ‘digenic’ patient. It was manually retrieved from the database through phenomizer.
  • Biological distance -The Human Gene Connectome (HGC) is the set of all biologically plausible routes, distances, and degrees of separation between all pairs of human genes. A gene-specific connectome contains the set of all available human genes sorted on the basis of their predicted biological proximity to the specific gene of interest. You can find more information at Itan et al.,PNAS (2013).
  • Reference – When the digenic combination is retrieved from a publication in a peer-reviewed journal, the PubMed ID will be visible. PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. When the digenic combination stems from unpublished data the term ‘unpublished’ will be visible in this column.

 

DISEASES TABLE

  • ORPHANET ID – Orphanet is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. The column reports the identifier for the disease as retrieved from Orphanet.
  • Disease name (ORPHANET)Orphanet is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. This column represents the name of the disease as present in Orphanet.
  • Disease category (ICD10) – The International Classification of Diseases (ICD) is the standard diagnostic tool for epidemiology, health management and clinical purposes. ICD is used to classify diseases and other health problems recorded on many types of health and vital records, including death certificates and health records. The column reports the ICD10 chapter disease category as obtained from the ICD10 online version: 2015.
  • Disease ID (OMIM) – The Online Mendelian Inheritance in Man (OMIM) is an online catalog of human genes and genetic disorders. The column reports the OMIM disease identifiers linked to the disease-of-interest as present in Orphanet.
  • NUMBER OF DIGENIC COMBINATIONS – The total number of digenic combinations for the disease-of-interest.
  • NUMBER OF GENES – The total number of genes for the disease-of-interest.
  • NUMBER OF VARIANTS – The total number of variants for the disease-of-interest

 

GENERAL RESOURCE

  • dbNSFP: A large part of the information described above was retrieved from dbNSFPv3.4. This a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. It contains information both at the gene and the variant level. The following columns of data were retrieved from dbNSFP:
    • at the gene level:
      • Expression (GNF/Atlas)
      • Pathway (KEGG)
      • Interactors (BioGrid, intAct and ConsensusPath)
      • P(haploinsufficiency)
      • P(recessiveness)
      • Gene Damaging Index
      • Essential in Mouse
    • at the variant level:
      • Predictions of variant pathogenicity (SIFT, Polyphen2, CADD)
      • Variant allele frequencies (1000G , ESP6500, Exac)