Tool Comparisons & Benchmarks

Variant Caller Showdown: GATK vs. DeepVariant vs. Strelka2

For over a decade, the GATK Best Practices pipeline has been the gold standard for germline variant calling. But the landscape has shifted. With the rise of deep learning (DeepVariant) and highly optimized C++ callers (Strelka2), bioinformaticians are no longer bound to a single workflow. We benchmarked these three industry leaders on accuracy, speed, and cost to help you decide which tool belongs in your production pipeline.

HoppeSyler Scientific Team

Published November 29, 2025

14 minute read

Executive Summary

Choosing the right variant caller is a critical trade-off between precision, recall, and computational cost. In this benchmark, we pitted the industry standard (GATK) against the deep-learning challenger (DeepVariant) and the speed demon (Strelka2).

The Verdict:

DeepVariant is the accuracy king, achieving the highest F1 scores for Indels, particularly in difficult genomic regions (low complexity, homopolymers).
Strelka2 is the efficiency champion, offering an unmatched balance of speed and accuracy, making it ideal for high-throughput population studies where compute cost is a primary constraint.
GATK HaplotypeCaller remains the "lingua franca" for compatibility and joint genotyping but shows its age in runtime and Indel sensitivity compared to newer methods.

The Methodology

To ensure a fair comparison, we standardized the upstream pipeline. All tests were run on the Genome in a Bottle (GIAB) HG002 dataset (Ashkenazi Son), aligned to the GRCh38 reference genome using BWA-MEM2. We utilized the "out-of-the-box" settings for all tools to simulate a typical user experience without extensive parameter tuning.

Coverage: ~35x Mean Coverage (downsampled from deep sequencing).
Hardware: AWS EC2 instances. c5.4xlarge (16 vCPU, 32GB RAM) for CPU-only tools, and g4dn.xlarge (4 vCPU, 16GB RAM, 1x NVIDIA T4 GPU) for DeepVariant GPU runs.
Ground Truth: GIAB v4.2.1 high-confidence benchmark regions.
Versions: GATK v4.6.0.0, DeepVariant v1.8.0, Strelka2 v2.9.10.
Benchmarking Tool: GA4GH hap.py (v0.3.15) for precision/recall calculation.

The Contenders

GATK HaplotypeCaller

The Industry Standard

Developed by the Broad Institute. While the new "DRAGEN-GATK" mode offers speed improvements, we benchmarked the classic HaplotypeCaller algorithm which remains the most widely compatible baseline.

DeepVariant

The AI Challenger

Google's deep learning caller treats variant calling as an image recognition problem. It converts read pileups into RGB images and uses a Convolutional Neural Network (CNN) to classify the center base.

Strelka2

The Speed Demon

Illumina's highly optimized caller. Although largely in maintenance mode, its tiered mixture model remains unmatched for speed, performing fast initial scans and expensive alignment only where necessary.

Benchmark Results

The following metrics represent performance on the HG002 sample.

Metric	GATK 4.6 (Standard)	DeepVariant (v1.8)	Strelka2 (v2.9)
SNP F1 Score	99.3%	99.7%	99.4%
Indel F1 Score	96.8%	99.4%	97.8%
Runtime (30x WGS)	~14 hours	~4 hours (GPU) / ~20h (CPU)	~1.5 hours
Est. Cloud Cost	$10 - $15	$3 - $5	< $1

Detailed Analysis

1. Accuracy: The "Black Box" Wins on Indels

For simple SNPs in high-complexity regions, all three callers are excellent (F1 > 99%). The differentiator is Indels (Insertions/Deletions) and complex regions. DeepVariant's neural network has learned to model sequencing artifacts—such as strand bias and homopolymer errors—that often confuse statistical models like GATK. While GATK requires complex hard-filtering or VQSR (Variant Quality Score Recalibration) to clean up calls, DeepVariant produces high-quality calls "out of the box" with a simple quality score cutoff. However, this comes with a trade-off: DeepVariant is a "black box." Unlike GATK, where you can trace the Bayesian logic, DeepVariant's decisions are hidden within the weights of its CNN.

2. Speed: Strelka2 is in a League of Its Own

If you are processing thousands of genomes, speed is not just a convenience; it's a budget constraint. Strelka2 is blazingly fast, completing a 30x WGS sample in under 2 hours on a standard server. This is achieved through careful engineering and a "cascading" probability model. In contrast, standard GATK is computationally intensive, often becoming the bottleneck unless you adopt the newer (but sometimes restrictive) DRAGEN-GATK mode. DeepVariant sits in the middle—slow on CPU, but competitive on GPU.

3. The "Hidden" Costs of Cloud Genomics

When running on AWS or Google Cloud, time is money.

GATK is CPU-bound and slow, leading to high instance costs unless heavily parallelized (e.g., using Spark or scattering by chromosome).
DeepVariant requires GPU instances for optimal speed. While GPU instances (like g4dn) are more expensive per hour, the reduced runtime often makes the total cost per sample lower than GATK.
Strelka2 is the most cost-effective, capable of running on cheaper, standard CPU instances in a fraction of the time.

4. Ease of Deployment & DevOps

For bioinformaticians managing pipelines, "ease of use" often means "ease of installation."

DeepVariant is distributed primarily as a Docker image. This makes it incredibly easy to run if you have a container runtime (Docker/Singularity), but challenging if you are on a legacy HPC system without container support.
Strelka2 provides a pre-compiled binary that "just works" on most Linux systems, making it the easiest to deploy in traditional environments.
GATK requires a Java environment (JVM). While improved in version 4, managing Java dependencies and memory heaps (`-Xmx`) remains a common source of frustration and runtime errors.

5. The Joint Genotyping Factor

If your goal is to analyze a cohort of N=100+ samples, you need Joint Genotyping—the ability to call variants across all samples simultaneously to rescue weak signals in individuals.

GATK is the undisputed king here. Its HaplotypeCaller in -ERC GVCF mode, followed by GenomicsDBImport and GenotypeGVCFs, is the standard workflow for population genomics.
DeepVariant recently added gVCF support and a "GLnexus" merger tool to support joint genotyping, but the ecosystem is less mature and documentation is sparser than GATK's.
Strelka2 supports gVCF output, but its joint genotyping workflows are less commonly used in large-scale academic consortia compared to GATK.

Our Recommendation

Which tool should you choose?

Choose DeepVariant if accuracy is paramount (e.g., clinical diagnostics, rare disease). It is the easiest to use "out of the box" and handles noisy data (like PCR-free or older sequencers) better than the others.
Choose Strelka2 if you are cost-constrained or processing massive cohorts (1000+ samples). It is also an excellent choice for somatic variant calling (tumor/normal pairs), where it shines even brighter.
Choose GATK if you require strict adherence to the "Best Practices" for publication, need robust Joint Genotyping for population genetics, or are calling variants from RNA-seq data (where GATK has specialized tools).

References & Further Reading

DeepVariant GitHub Repository
Strelka2 GitHub Repository
GATK Documentation
Poplin, R., et al. "A universal SNP and small-indel variant caller using deep neural networks." Nature Biotechnology (2018).
Kim, S., et al. "Strelka2: fast and accurate calling of germline and somatic small variants." Nature Methods (2018).