Executive Summary
Variant annotation is the enrichment step that converts a raw VCF into a shortlist of clinically relevant candidates.
- Start with rigorous population filters using resources like dbSNP and gnomAD to remove common variants before deeper analysis.
- Layer functional consequence predictions from engines such as SnpEff or VEP with meta-predictor scores like CADD and REVEL to highlight high-impact events.
- Cross-reference clinical databases including ClinVar, HGMD, and OMIM, then extend to structural variant annotation and ACMG classification to synthesize an actionable interpretation.
Your whole exome sequencing run is complete, leaving you with a VCF file containing 80,000 genetic variants. Buried in that file is the single data point that could explain a patient's rare disease, predict their response to a drug, or unlock a novel therapeutic target. Finding it is the critical challenge.
The journey from a raw VCF file to a short list of clinically relevant candidates is a systematic process of enrichment known as variant annotation. It is the crucial step that translates raw genomic data into actionable biological and clinical insights. For teams in translational and clinical research, mastering this process is not just a technical exercise—it is the key to accelerating discovery.
This guide provides a step-by-step walkthrough of a best-practice variant annotation workflow, explaining the tools and databases that transform a list of variants into a powerful, interpretable dataset.
Step 1: The Starting Point – The Variant Call Format (VCF) File
The VCF file is the standard output from a variant calling pipeline. Each row in the file represents a genetic variant, defined by its core components:
- CHROM: The chromosome on which the variant occurs.
- POS: The genomic position of the variant.
- REF: The reference allele (the base found in the reference genome).
- ALT: The alternate allele (the base observed in your sample).
In its raw form, the VCF file is simply a list of differences. The annotation process enriches this file by adding layers of contextual information into its ID and INFO columns, allowing us to systematically filter and prioritize.
Step 2: Population Filtering – Is This Variant Rare?
For Mendelian (single-gene) disorders, a disease-causing variant is almost always rare in the general population. The first and most powerful filter, therefore, is to check a variant's frequency against large-scale population databases. If a variant is common, it is highly unlikely to be the cause of a rare genetic disease.
- dbSNP: The Single Nucleotide Polymorphism Database assigns a unique identifier (an "rsID") to previously observed variants. This ID serves as a useful cross-reference for logging and searching but does not provide the frequency information needed for robust filtering. This rsID is added to the
IDcolumn of the VCF file. - gnomAD (Genome Aggregation Database): This is the essential tool for population frequency filtering. gnomAD aggregates data from over 125,000 exomes and 15,000 genomes from healthy, unrelated individuals, providing a powerful baseline of benign human genetic variation. By annotating our VCF with gnomAD data, we can filter out any variant with an allele frequency (AF) above a certain threshold (e.g., >1% or >0.1%), dramatically reducing our list of candidates.
Step 3: Predicting Functional Impact – What Does This Variant Do?
Once we have a list of rare variants, the next question is: what is their predicted effect on the gene and its protein product? This is accomplished using a suite of functional annotation and in silico prediction tools.
First, we predict the basic functional consequence of each variant using an annotation engine.
- SnpEff and VEP (Variant Effect Predictor): These are two of the most widely used annotation engines. They take a VCF file as input and, based on a database of known gene and transcript models (like Ensembl or RefSeq), predict the functional consequence of each variant.
This process adds critical information to the VCF INFO field, including:
- Gene Name: The official symbol of the affected gene.
- Consequence: The predicted effect, such as:
- High Impact (Loss-of-Function):
frameshift_variant,stop_gained(nonsense),splice_acceptor_variant. These variants typically lead to a non-functional protein and are strong candidates. - Moderate Impact:
missense_variant(changes one amino acid to another). The effect of these variants is highly variable and requires further investigation. - Low Impact:
synonymous_variant(does not change the amino acid). These are usually benign.
- Putative Impact: A simple classification of the effect's severity (HIGH, MODERATE, LOW, MODIFIER).
This step allows us to focus on variants with a predicted moderate or high impact. However, missense variants—the most common type in disease studies—require deeper analysis. To estimate how damaging a specific amino acid change is likely to be, we use a second category of tools.
These in silico tools use evidence like evolutionary conservation and protein biochemistry to generate a deleteriousness score. While foundational tools like SIFT (predicting "tolerated" vs. "deleterious") and PolyPhen-2 (predicting "benign" vs. "possibly/probably damaging") pioneered this space, modern pipelines rely on more advanced "meta-predictors."
- CADD (Combined Annotation Dependent Depletion): Provides a single, integrated score of deleteriousness for all types of variants, with higher scores indicating a higher likelihood of being damaging.
- REVEL (Rare Exome Variant Ensemble Learner): A leading meta-predictor that combines scores from multiple individual tools to produce a single, highly accurate score for missense variants. It is considered one of the best-performing predictors for discriminating pathogenic from benign variants.
These scores are added as another layer of information, helping us prioritize the missense variants that are most likely to be functionally significant.
Step 4: Clinical Annotation – Has This Variant Been Linked to Disease Before?
The final layer of annotation involves cross-referencing our variants with databases that aggregate clinical and research findings. This step directly connects a variant to human health.
- ClinVar: A free, public archive from the NIH that aggregates reports of the relationships between human variations and phenotypes. Submissions from clinical testing labs, research studies, and expert panels classify variants as "Pathogenic," "Likely Pathogenic," "Benign," "Likely Benign," or "Variant of Uncertain Significance (VUS)". A "Pathogenic" classification in ClinVar is a very strong piece of evidence.
- HGMD (Human Gene Mutation Database): A comprehensive, manually curated collection of disease-causing mutations. While the full, up-to-date HGMD Professional database requires a license, a more limited and less current version is available for free to academic and non-profit users as HGMD Public. It remains an excellent resource for known pathogenic variants.
- OMIM (Online Mendelian Inheritance in Man): A catalog of human genes and genetic disorders. While not a variant database per se, it provides the crucial link between a gene and a specific human disease, which is essential context for interpretation.
Step 5: Beyond SNPs and Indels - Annotating Structural and Copy Number Variants
While small variants are critical, a complete analysis must also consider larger changes like Structural Variants (SVs) and Copy Number Variants (CNVs), which can involve deletions, duplications, or rearrangements of entire gene segments. These are often missed by standard exome pipelines but are a significant cause of genetic disease.
- Specialized Callers: Identifying these variants requires specialized algorithms such as Manta, Lumpy, or Canvas, which analyze sequencing data for signatures of large-scale genomic changes.
- SV/CNV Databases: Once called, these variants are annotated against databases like gnomAD-SV, which catalogs structural variants in a healthy population, and the Database of Genomic Variants (DGV). This allows for filtering of common, benign SVs and prioritization of rare events that overlap with clinically relevant genes.
Step 6: Synthesis and Classification – Applying the ACMG Framework
With all these layers of annotation in place, the final step is to synthesize the evidence and make a formal classification. The standard for this process is the framework developed by the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP).
The ACMG guidelines provide a set of 28 criteria for weighing different types of evidence (e.g., population, computational, functional, and segregation data). Each piece of evidence is assigned a weight (e.g., Very Strong, Strong, Moderate, or Supporting) for either a pathogenic or benign assertion. By combining these evidence codes according to a defined set of rules, a variant is ultimately classified into one of five tiers:
- Pathogenic
- Likely Pathogenic
- Variant of Uncertain Significance (VUS)
- Likely Benign
- Benign
This rigorous, evidence-based framework is the gold standard for clinical variant interpretation, ensuring that classifications are systematic, transparent, and reproducible.
Conclusion: From Data to Diagnosis
Variant annotation is a multi-layered process that transforms a raw VCF file into a clinically interpretable report. By systematically integrating population, functional, computational, and clinical data, we filter tens of thousands of variants down to a handful of high-confidence candidates.
Mastering this workflow is fundamental to leveraging genomic data in translational research. While the principles are established, the field continues to evolve, with new algorithms for predicting pathogenicity and growing databases that help resolve Variants of Uncertain Significance (VUS). This rigorous, systematic approach is the critical bridge between sequencing and discovery.
Need to turn your sequencing data into actionable insights? Our variant annotation services deliver the population, functional, and clinical context you need for confident decisions.