Six lessons for variant interpretation
Today sees the publication of seven papers describing the creation and key analyses of the gnomAD resource [Karczewski 2020, Minikel 2020, Cummings 2020, Whiffin 2020, Whiffin, Kleinman & Armean 2020, Collins & Brand 2020, Wang 2020]. I was lucky enough to become part of this huge team effort when I visited Daniel MacArthur’s group at the Broad Institute for three months back in spring 2018. It has been really exciting to see all this work come together, and even lead on a couple of the analyses [Whiffin 2020, Whiffin, Kleinman & Armean 2020].
Clinical labs across the globe use gnomAD daily, to annotate every single candidate variant - gnomAD is a crucial resource for helping to identify which variants are too common to be causing a patient’s specific disease. But as an enormous population resource, gnomAD can also inform much more of our variant interpretation workflows.
In this post, I wanted to take a moment to highlight the important lessons within these seven papers that relate to interpreting variants for their role in disease.
Lesson 1 – careful annotation and curation of loss-of-function variants is essential
Three of the papers focus on analysis of putative loss-of-function (pLoF) variants [Karczewski 2020, Minikel 2020, Whiffin, Kleinman & Armean 2020] - nonsense, essential splice site and frameshift variants (note that here we use pLoF to indicate identified variants that have not been functionally proven to cause loss of protein function). It has long been known that given the rarity of pLoF variants, especially in haploinsufficient genes, those that we do observe are enriched for errors [MacArthur 2012].
The majority of these errors are due to incorrect annotations i.e. a variant is called “LoF” when it doesn’t actually result in a loss of protein function. A common example are nonsense variants in the final exon of a gene that are not recognised by the nonsense mediated decay (NMD) pathway, and therefore the truncated protein is still created and somewhat functional. Whilst in some instances, a final exon truncation may remove a crucial protein domain and in effect cause a LoF, the impact of the vast majority of these end truncations is not as deleterious as complete LoF, where the protein isn’t made at all. In fact, when we look across all of the terminal truncation variants that occur in gnomAD, a much lower proportion of them are very rare when compared to high-confidence pLoF variants – an indication that they are under lower selective constraint (see Figure 2 of the flagship paper).
Common types of LoF annotation errors can easily be identified by the Loss-Of-Function Transcript Effect Estimator, or LOFTEE, created by Konrad Karczewski. LOFTEE annotates each LoF variant and outputs a ‘confidence’ in it being true LoF (https://github.com/konradjk/loftee).
As an example – when we applied LOFTEE to the 123 pLoF variants in the LRRK2 gene in gnomAD, it highlighted six that are unlikely to cause true LoF. Whilst this removes only ~5% of the identified pLoF variants, these variants are by far the most common in gnomAD – comprising 45% of all LRRK2 pLoF alleles [Whiffin, Kleinman & Armean 2020].
When we carefully inspected the remaining LRRK2 pLoF variants, we uncovered six more that look unlikely to cause true LoF, due to them being low quality erroneous calls, or likely rescued splicing defects. After both running LOFTEE and performing a manual curation, only 40% of the initial pLoF alleles in gnomAD are predicted to be real.
What this LRRK2 example demonstrates is how important it is to not solely trust the output of an effect annotator such as VEP. Careful curation of any candidate pLoF variant to ensure it is in fact likely to cause LoF is essential.
Lesson 2 – not all exons are expressed equally
For some genes, we observe variants in gnomAD despite the fact we would not expect them to be there. One example is pLoF variants in TCF4, which are a known cause of the severe developmental disorder Pitt-Hopkins syndrome. 56 individuals in gnomAD have one of 20 TCF4 pLoF variants that look to be high confidence. On closer inspection using the GTEx database, it becomes clear that every single one of these variants is found in an exon with no evidence of being expressed across a wide set of tissues [Cummings 2020].
Beryl Cummings has created the super useful pext metric – a score that measures the proportion of expressed transcripts that would be affected by a variant of interest [Cummings 2020]. This metric is not only useful for pLoF variants, as in the TCF4 example, but can also be applied to any variant consequence. There is a track in the gnomAD browser where you can easily visualise the mean pext score across transcripts for a gene or click to look at this on a tissue-specific level.
For genes where transcript expression differs between exons, annotating candidate disease-causing variants with pext can flag any in lowly expressed regions that may be less likely to be pathogenic.
Lesson 3 – not all LoF disease genes are severely depleted of pLoF variants
Back in 2016, in the ExAC paper [Lek 2016], we were introduced to Kaitlin Samocha’s fantastic pLI metric – a way of identifying genes that are likely intolerant to LoF, by comparing the number of pLoF variants that are observed to what we would expect under a null mutational model. The flagship gnomAD paper builds on this idea, introducing the new LOEUF (Loss-Of-Function Observed/Expected Upper bound Fraction) score [Karczewski 2020]. Whilst pLI was purposefully dichotomous, classifying variants as intolerant or not, the larger size of gnomAD allows LOEUF to rank all genes along a continuous spectrum of tolerance to LoF.
As expected, the vast majority of haploinsufficient genes appear at the very bottom of this ranking as they are severely depleted of pLoF variants in gnomAD (see Figure 3a in the flagship paper). This is not the case for all of them, however. It is also interesting to note the position of recessive genes, which span the center of the LOEUF distribution.
When trying to assess if a pLoF variant in a novel gene is associated with disease we might be tempted to only consider genes that are in the most constrained set. What these data tell us is that for recessive genes, or later onset phenotypes, this might be too restrictive. This makes perfect sense - we expect the strength of selection against heterozygous pLoF variants in these genes to be much lower than for those that cause severe developmental phenotypes.
The history of these constraint metrics and the differences between pLI and LOEUF are beautifully explained in this blog post by Eric Minikel.
Lesson 4 – an adjacent variant can change the predicted effect
Multi-nucleotide variants (MNVs) are variants that are very close together on the same haplotype. When these variants occur within the same codon, they can result in a consequence that differs from that of either individual variant. It is therefore important to consider the effect of these variants in tandem, rather than individually. This is not a novel concept, but it is one that is often ignored in variant calling and interpretation pipelines. Indeed, a previous paper investigated the importance of MNVs in developmental disorders [Kaplanis 2019].
>30,000 MNVs are identified in gnomAD including 405 that result in a new nonsense variant. Conversely a variant that initially appears to be nonsense may in fact be rescued by an adjacent variant, again highlighting the importance of curating pLoF variants discussed above [Wang 2020]. MNVs in gnomAD are now flagged in the browser with MNV-specific landing pages with more information, but it is important to also look out for them in our own data.
Lesson 5 – we can now assess the frequency of structural variants
One of the new papers describes the herculean effort of Ryan Collins and Harrison Brand to create a resource of structural variants (SVs) identified in nearly 15,000 whole genomes included in gnomAD [Collins & Brand 2020]. These SVs are found to account for >25% of all rare protein-truncating events per genome, indicating how SVs are likely a major contributor to rare disease. These structural variants are now displayed on the gnomAD browser providing a vital resource to annotate SVs identified in clinical sequencing.
Lesson 6 – we should be looking outside of coding regions
Finally, the lesson in this list that is closest to my heart. I am incredibly proud of this paper [Whiffin 2020] especially given that it is already helping to find new diagnoses for patients who have long been in the dark about the cause of their disease.
In the majority of clinical settings, one of the first steps in an annotation pipeline is to restrict candidate variants to only those that directly disrupt the protein sequence. In fact, for the most part we only sequence these regions of the genome (and some amount of buffer). This strategy makes perfect sense for two main reasons: (1) we only have to analyse a small amount of sequence (~1.5%) to diagnose a lot of cases (~50%, although for some genes/diseases >90%); (2) we have a straightforward way of predicting the effect of any identified variant on the protein by using the triplet amino acid code.
But half of all rare disease patients are still without a genetic diagnosis.
Untranslated regions (UTRs) are control elements that regulate the stability of an mRNA molecule, and the rate at which it is translated into protein. Variants within these regions can be equivalent to a coding LoF variant and lead to disease i.e. those that result in mRNA degradation or abolish translation. We analysed 5’UTR variants in gnomAD that create or disrupt upstream open reading frames (uORFs) and found a subset of these that are under strong negative selection [Whiffin 2020] i.e. they appear less commonly than we would expect in gnomAD. We discuss examples of these variants that lead to rare disease, including neurofibromatosis. We have created a tool to identify and annotate these variants: here.
Here is a video I made when we first released this work that explains more about it:
Whilst this work highlights 5’UTR variants that are good candidates for being disease causing, interpreting these variants is still not straightforward. UTRs are complex regulatory elements, often containing multiple uORFs, and elements encoding secondary structure. It will be important to functionally validate variants found in these regions to confirm their downstream effect.
One of the very cool things about gnomAD is that it just keeps growing. Whilst these papers focus on v2.1.1, v3 of the dataset already exists. As I write this, a v4 dataset of exomes is already in the pipeline. The insights that these ever-growing datasets can give us into what variation naturally exists in the population are hugely powerful for identifying variants that are likely to cause disease. To me, the excitement comes in using all those genomes to learn more about the non-coding genome.