Executive Summary
Choosing the right variant caller is a critical trade-off between precision, recall, and computational cost. In this benchmark, we pitted the industry standard (GATK) against the deep-learning challenger (DeepVariant) and the speed demon (Strelka2).
The Verdict:
- DeepVariant is the accuracy king, achieving the highest F1 scores for Indels, particularly in difficult genomic regions (low complexity, homopolymers).
- Strelka2 is the efficiency champion, offering an unmatched balance of speed and accuracy, making it ideal for high-throughput population studies where compute cost is a primary constraint.
- GATK HaplotypeCaller remains the "lingua franca" for compatibility and joint genotyping but shows its age in runtime and Indel sensitivity compared to newer methods.
The Methodology
To ensure a fair comparison, we standardized the upstream pipeline. All tests were run on the Genome in a Bottle (GIAB) HG002 dataset (Ashkenazi Son), aligned to the GRCh38 reference genome using BWA-MEM2. We utilized the "out-of-the-box" settings for all tools to simulate a typical user experience without extensive parameter tuning.
- Coverage: ~35x Mean Coverage (downsampled from deep sequencing).
- Hardware: AWS EC2 instances.
c5.4xlarge(16 vCPU, 32GB RAM) for CPU-only tools, andg4dn.xlarge(4 vCPU, 16GB RAM, 1x NVIDIA T4 GPU) for DeepVariant GPU runs. - Ground Truth: GIAB v4.2.1 high-confidence benchmark regions.
- Versions: GATK v4.6.0.0, DeepVariant v1.8.0, Strelka2 v2.9.10.
- Benchmarking Tool: GA4GH hap.py (v0.3.15) for precision/recall calculation.
The Contenders
GATK HaplotypeCaller
The Industry Standard
Developed by the Broad Institute. While the new "DRAGEN-GATK" mode offers speed improvements, we benchmarked the classic HaplotypeCaller algorithm which remains the most widely compatible baseline.
DeepVariant
The AI Challenger
Google's deep learning caller treats variant calling as an image recognition problem. It converts read pileups into RGB images and uses a Convolutional Neural Network (CNN) to classify the center base.
Strelka2
The Speed Demon
Illumina's highly optimized caller. Although largely in maintenance mode, its tiered mixture model remains unmatched for speed, performing fast initial scans and expensive alignment only where necessary.
Benchmark Results
The following metrics represent performance on the HG002 sample.
| Metric | GATK 4.6 (Standard) | DeepVariant (v1.8) | Strelka2 (v2.9) |
|---|---|---|---|
| SNP F1 Score | 99.3% | 99.7% | 99.4% |
| Indel F1 Score | 96.8% | 99.4% | 97.8% |
| Runtime (30x WGS) | ~14 hours | ~4 hours (GPU) / ~20h (CPU) | ~1.5 hours |
| Est. Cloud Cost | $10 - $15 | $3 - $5 | < $1 |
Detailed Analysis
1. Accuracy: The "Black Box" Wins on Indels
For simple SNPs in high-complexity regions, all three callers are excellent (F1 > 99%). The differentiator is Indels (Insertions/Deletions) and complex regions. DeepVariant's neural network has learned to model sequencing artifacts—such as strand bias and homopolymer errors—that often confuse statistical models like GATK. While GATK requires complex hard-filtering or VQSR (Variant Quality Score Recalibration) to clean up calls, DeepVariant produces high-quality calls "out of the box" with a simple quality score cutoff. However, this comes with a trade-off: DeepVariant is a "black box." Unlike GATK, where you can trace the Bayesian logic, DeepVariant's decisions are hidden within the weights of its CNN.
2. Speed: Strelka2 is in a League of Its Own
If you are processing thousands of genomes, speed is not just a convenience; it's a budget constraint. Strelka2 is blazingly fast, completing a 30x WGS sample in under 2 hours on a standard server. This is achieved through careful engineering and a "cascading" probability model. In contrast, standard GATK is computationally intensive, often becoming the bottleneck unless you adopt the newer (but sometimes restrictive) DRAGEN-GATK mode. DeepVariant sits in the middle—slow on CPU, but competitive on GPU.
3. The "Hidden" Costs of Cloud Genomics
When running on AWS or Google Cloud, time is money.
GATK is CPU-bound and slow, leading to high instance costs unless heavily parallelized (e.g., using Spark or scattering by chromosome).
DeepVariant requires GPU instances for optimal speed. While GPU instances (like g4dn) are more expensive per hour, the reduced runtime often makes the total cost per sample lower than GATK.
Strelka2 is the most cost-effective, capable of running on cheaper, standard CPU instances in a fraction of the time.
4. Ease of Deployment & DevOps
For bioinformaticians managing pipelines, "ease of use" often means "ease of installation."
DeepVariant is distributed primarily as a Docker image. This makes it incredibly easy to run if you have a container runtime (Docker/Singularity), but challenging if you are on a legacy HPC system without container support.
Strelka2 provides a pre-compiled binary that "just works" on most Linux systems, making it the easiest to deploy in traditional environments.
GATK requires a Java environment (JVM). While improved in version 4, managing Java dependencies and memory heaps (`-Xmx`) remains a common source of frustration and runtime errors.
5. The Joint Genotyping Factor
If your goal is to analyze a cohort of N=100+ samples, you need Joint Genotyping—the ability to call variants across all samples simultaneously to rescue weak signals in individuals.
GATK is the undisputed king here. Its HaplotypeCaller in -ERC GVCF mode, followed by GenomicsDBImport and GenotypeGVCFs, is the standard workflow for population genomics.
DeepVariant recently added gVCF support and a "GLnexus" merger tool to support joint genotyping, but the ecosystem is less mature and documentation is sparser than GATK's.
Strelka2 supports gVCF output, but its joint genotyping workflows are less commonly used in large-scale academic consortia compared to GATK.
Our Recommendation
Which tool should you choose?
- Choose DeepVariant if accuracy is paramount (e.g., clinical diagnostics, rare disease). It is the easiest to use "out of the box" and handles noisy data (like PCR-free or older sequencers) better than the others.
- Choose Strelka2 if you are cost-constrained or processing massive cohorts (1000+ samples). It is also an excellent choice for somatic variant calling (tumor/normal pairs), where it shines even brighter.
- Choose GATK if you require strict adherence to the "Best Practices" for publication, need robust Joint Genotyping for population genetics, or are calling variants from RNA-seq data (where GATK has specialized tools).
References & Further Reading
- DeepVariant GitHub Repository
- Strelka2 GitHub Repository
- GATK Documentation
- Poplin, R., et al. "A universal SNP and small-indel variant caller using deep neural networks." Nature Biotechnology (2018).
- Kim, S., et al. "Strelka2: fast and accurate calling of germline and somatic small variants." Nature Methods (2018).