
Variant Effect Predictor
Examples and use cases
Example commands
-
Read input from STDIN, output to STDOUT
perl variant_effect_predictor.pl -cache -o stdout
-
Add regulatory region consequences
perl variant_effect_predictor.pl -cache -i variants.txt -regulatory
-
Input file variants.vcf.txt, input file format VCF, add gene symbol
identifiers
perl variant_effect_predictor.pl -cache -i variants.vcf.txt -format vcf -symbol
-
Filter out common variants based on 1000 genomes data
perl variant_effect_predictor.pl -cache -i variants.txt -filter_common
-
Force overwrite of output file variants_output.txt, check for existing
co-located variants, output only coding sequence consequences, output
HGVS names
perl variant_effect_predictor.pl -cache -i variants.txt -o variants_output.txt -force -check_existing -coding_only -hgvs
-
Specify DB connection parameters in registry file ensembl.registry, add SIFT
score and prediction, PolyPhen prediction
perl variant_effect_predictor.pl -database -i variants.txt -registry ensembl.registry -sift b -polyphen p
-
Connect to Ensembl Genomes db server for A.thaliana
perl variant_effect_predictor.pl -database -i variants.txt -genomes -species arabidopsis_thaliana
-
Load config from ini file, run in quiet mode
perl variant_effect_predictor.pl -config vep.ini -i variants.txt -q
-
Use cache in /home/vep/mycache/, use gzcat instead of zcat
perl variant_effect_predictor.pl -cache -dir /home/vep/mycache/ -i variants.txt -compress gzcat
-
Convert RefSeq-based HGVS notations to genomic coordinates in VCF format
perl variant_effect_predictor.pl -database -i hgvs.txt -o hgvs.vcf -refseq -convert vcf
-
Filter input file on consequence type to include only variants that
cause a coding sequence change, write output in original input format
perl variant_effect_predictor.pl -cache -i variants.vcf -o variants_filtered.vcf -filter coding_change -original
-
Add custom position-based phenotype annotation from remote BED file
perl variant_effect_predictor.pl -cache -i variants.vcf -custom ftp://ftp.myhost.org/data/phenotypes.bed.gz,phenotype
-
Use the plugin named MyPlugin, output only the variation name, feature, consequence type and MyPluginOutput fields
perl variant_effect_predictor.pl -cache -i variants.vcf -plugin MyPlugin -fields Uploaded_variation,Feature,Consequence,MyPluginOutput
GERP / conservation scores
You can use the VEP's custom annotation feature to add conservation scores to your output. For example, to add GERP scores, download the bigWig file from the list below, and run VEP with the following flag:
perl variant_effect_predictor.pl -cache -i example.vcf -custom All_hg19_RS.bw,GERP,bigwig
Note that to make use of bigWig files you will need the bigWigToWig utility from the Kent source package installed in your path.
Example conservation score files:
All files provided by the UCSC genome browser - files for other species are available from their FTP site, though be sure to use the file corresponding to the correct assembly.
dbNSFP
dbNSFP - "a lightweight database of human nonsynonymous SNPs and their functional predictions" - provides pathogenicity predictions from many tools (including SIFT, PolyPhen, LRT, MutationTaster, FATHMM) across every possible missense substitution in the human proteome. The data is available to download, and while it cannot be immediately used by the VEP it is simple to process the data into a format that the dbNSFP.pm plugin can use.
After downloading the file, you will need to process it so that tabix can index it correctly. This will take a while as the file is very large! Note that you will need the tabix utility in your path to use dbNSFP.
unzip dbNSFP2.0.zip cat dbNSFP2.0_variant.chr* > dbNSFP rm dbNSFP2.0_variant.chr* bgzip dbNSFP tabix -s 1 -b 2 -e 2 dbNSFP.gz
Then simply download the dbNSFP VEP plugin and place it either in $HOME/.vep/Plugins/ or a path in your $PERL5LIB. When you run the VEP with the plugin, you will need to select some of the columns that you wish to retrieve; to list them run the VEP with the plugin and the path to the dbNSFP file and no further parameters:
perl variant_effect_predictor.pl -cache -force -plugin dbNSFP,dbNSFP.gz 2014-04-04 11:27:05 - Read existing cache info 2014-04-04 11:27:05 - Auto-detected FASTA file in cache directory 2014-04-04 11:27:05 - Checking/creating FASTA index 2014-04-04 11:27:05 - Failed to instantiate plugin dbNSFP: ERROR: No columns selected to fetch. Available columns are: #chr,pos(1-coor),ref,alt,aaref,aaalt,hg18_pos(1-coor),genename,Uniprot_acc, Uniprot_id,Uniprot_aapos,Interpro_domain,cds_strand,refcodon,SLR_test_statistic, codonpos,fold-degenerate,Ancestral_allele,Ensembl_geneid,Ensembl_transcriptid, aapos,SIFT_score,Polyphen2_HDIV_score,Polyphen2_HDIV_pred,Polyphen2_HVAR_score, Polyphen2_HVAR_pred,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred, MutationAssessor_score,MutationAssessor_pred,FATHMM_score,GERP++_NR,GERP++_RS, phyloP,29way_pi,29way_logOdds,LRT_Omega,UniSNP_ids,1000Gp1_AC,1000Gp1_AF, 1000Gp1_AFR_AC,1000Gp1_AFR_AF,1000Gp1_EUR_AC,1000Gp1_EUR_AF,1000Gp1_AMR_AC, 1000Gp1_AMR_AF,1000Gp1_ASN_AC,1000Gp1_ASN_AF,ESP6500_AA_AF,ESP6500_EA_AF [Ctrl-C]
Note that some of these fields are replicates of those produced by the core VEP code (e.g. SIFT, PolyPhen, the 1000 Genomes and ESP frequencies) - you should use the options to enable these from the VEP code in place of the annotations from dbNSFP as the dbNSFP file covers only missense substitutions. Other fields, such as the conservation scores, may be better served by using genome-wide files as described above.
To select fields, just add them as a comma-separated list to your command line:
perl variant_effect_predictor.pl -cache -force -plugin dbNSFP,dbNSFP.gz,LRT_score,FATHM_score,MutationTaster_score
One final point to note is that the dbNSFP scores are frozen on a particular Ensembl release's transcript set; check the readme file on their download site to find out exactly which. While in the majority of cases protein sequences don't change between releases, in some circumstances the protein sequence used by the VEP in the latest release may differ from the sequence used to calculate the scores in dbNSFP.
Citations and VEP users
The VEP is used by many organisations and projects:
- VEP forms a part of Illumina's VariantStudio software
- Gemini is a framework for exploring genome variation that uses the VEP
- The DECIPHER project uses VEP in its analysis pipelines
Other citations and use cases:
- VAX is a suite of plugins for VEP that expands its functionality
- pViz is a visualisation tool for VEP results files
- McCarthy et al compares the VEP to AnnoVar
- Pabinger et al reviews variant analysis software, including the VEP