Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools

San Lucas, F. Anthony; Wang, Gao; Scheet, Paul; Peng, Bo

doi:10.1093/bioinformatics/btr667

Abstract

Motivation: Storing, annotating and analyzing variants from next-generation sequencing projects can be difficult due to the availability of a wide array of data formats, tools and annotation sources, as well as the sheer size of the data files. Useful tools, including the GATK, ANNOVAR and BEDTools can be integrated into custom pipelines for annotating and analyzing sequence variants. However, building flexible pipelines that support the tracking of variants alongside their samples, while enabling updated annotation and reanalyses, is not a simple task.

Results: We have developed variant tools, a flexible annotation and analysis toolset that greatly simplifies the storage, annotation and filtering of variants and the analysis of the underlying samples. variant tools can be used to manage and analyze genetic variants obtained from sequence alignments, and the command-line driven toolset could be used as a foundation for building more sophisticated analytical methods.

Availability and implementation: variant tools consists of two command-line driven programs vtools and vtools_report. It is freely available at http://varianttools.sourceforge.net, distributed under a GPL license.

Contact: bpeng@mdanderson.org

1 INTRODUCTION

Tracking samples and predicted variants from next-generation sequencing projects often requires building custom analysis pipelines. Data standards such as the Browser Extensible Data (BED) (Hinrichs et al., 2006), General Feature Format and Variant Call Format (VCF) (Danecek et al., 2011) file specifications can be used to represent these variants in a common format, simplifying integration of tools and the construction of these analysis pipelines. Difficulties include the integration of diverse annotation sources and the management of many large intermediate files containing millions of predicted variants and millions more associated annotations for each sample. These annotation sources and intermediate files often have fundamental inconsistencies using either 0- or 1-based coordinates and potentially different genomic builds, which can complicate their management and integration.

For biologists or analysts who have familiarity with programming and running tools from the command line, there are many useful tools that can be integrated into custom pipelines to annotate and filter variants. These tools include ANNOVAR (Wang et al., 2010) and BEDTools (Quinlan and Hall, 2010). However, building effective pipelines that relate variants to their samples and sample attributes (such as cases and controls), while applying multiple annotation sources require a large customization effort. A framework for building pipelines that facilitate simple, reproducible and recurrent analyses is currently lacking. Therefore, we have developed variant tools, a flexible, open-source toolset upon which custom pipelines can be easily constructed. This toolset facilitates the storage of variants (alongside their sample details) as well as the annotation, filtering and reporting of these variants at multiple levels—starting with variant reports based on individual samples to project-wide variant reports.

2 METHODS

variant tools is a command-line driven toolset written in the Python scripting language that incorporates either SQLite or MySQL as a backend database management system. The toolset is used to create a variant project, which is conceptually designed around a master variant table that often consists of millions of variants for all of the samples in a sequencing project along with variant attributes (called fields in variant tools). Variant fields can include sample statistics, which variant tools can generate, or information provided by annotation data sources. Regardless of the source of these fields, they can be used to select, output and analyze genetic variation from the project. As illustrated in Figure 1, analyzing genetic variants from next-generation sequencing projects typically involves four steps, namely importing, annotating, filtering and reporting. variant tools installs easily and sets up a working environment with human genome annotation sources that can be downloaded automatically. Since variant tools manages project variants and annotation sources for the user, it is easier to reanalyze variants as genomic builds change and as annotation sources are updated or become available. The burden of tracking VCF files, annotation files and numerous scripts is reduced. variant tools along with accompanying source code, documentation and tutorials are freely available at http://varianttools.sourceforge.net.

Sample and variant import: variant tools accommodates a variety of variant file formats. It supports import of VCF files or other tab-delimited formats such as intermediate output from ANNOVAR or BEDTools. It is capable of annotating and reporting on all types of variants, including indels, as long as annotation sources are available. The toolset also supports annotation and reporting of project variants using multiple genomic builds, by automatically downloading and integrating the UCSC liftOver tool (Hinrichs et al., 2006). As an example, if variants are imported to a project using build hg18, they can be annotated using annotation sources designed for build hg19, and exported based on either hg18 or hg19 coordinates.
Annotation: variant tools can incorporate databases that annotate individual variants or genomic regions, such as genes or pathways. A growing number of annotation sources such as dbNSFP (Liu et al., 2011) or KEGG pathways (Kanehisa and Goto, 2000) can be downloaded automatically whereas customized annotation databases could be created following a well-documented procedure. Any genomic data source can be imported into variant tools and used for annotations as long as variants can be linked to the new data through an existing variant attribute (such as genomic coordinates or a gene name).
Variant/sample selection (or filtering for quality control): variants or genotypes can be selected by read depth, by any of the available annotation fields (e.g. variant type, pathway or predicted damaging effects) or by sample frequency across subsets of samples, which is useful for comparing variants across populations (e.g. cases and controls). Complex criteria involving multiple fields from different annotation sources can be used to select or filter variants. Selected variants can be counted, exported or saved to separate underlying tables, where they can be annotated and filtered separately from other variants.
Report generation: variants can be exported in VCF or tab-delimited formats with an arbitrary number of fields. In addition to details provided by annotation sources, users can output sample statistics such as numbers of homozygous and heterozygous genotypes in samples for selected variants. Basic arithmetic operations and aggregate functions can also be used to output summary statistics of variants such as the average depth of coverage for a particular set of variants.

Fig. 1.

Open in new tab Download slide

Example pipeline constructed from vtools utilities. variant tools provides a simple command-based interface that communicates with a back-end database. A large focus of variant tools is managing a master variant table that contains all the distinct variants of the project.

3 DISCUSSION

Despite an intuitive command-line interface, some high-level reports, such as calculating sample transition/transversion ratios or reporting the number of variants per gene, involve several vtools commands. To simplify the use of variant tools, we provide a reporting command vtools_report that generates example summary reports. These reports make the use of variant tools more practical, and the vtools_report source code provides examples of how to combine and further customize vtools commands.

Within variant tools, variants are linked to but stored separately from their annotations within a relational database, helping to conserve disk space by removing the need to store large and repetitive intermediate annotation files. Database indexes are automatically created to improve query performance during annotation and filtering, though these do add to the storage requirements of variant tools.

For an example, we created a vtools project with 44 whole-genome VCF files with 161 million predicted sample variants. This required 3.3GB of disk space to store the variants and indexes within an SQLite database compared to 2GB of disk space for the VCF files compressed or 9GB uncompressed. A benefit, however, of the vtools approach is that these variants were stored using both hg18 and hg19 genomic coordinates within SQLite. When using a MacPro workstation with two Quad-Core Intel Xeon Processors at 2.26 GHz and 8GB of RAM, the project creation required 3.5 hours. This time can be reduced to an hour if variants are processed in parallel by vtools on a cluster system before they are merged to a larger project. The time required for subsequent annotation and filtering of these variants ranged from 1 to 10 min. Additional details and other examples can be found in the tutorials section of the software website.

We have provided a preconfigured but customizable framework for the analysis of variants from next-generation sequence data. Although our efforts were motivated by a desire to produce initial, non-statistical analyses, we are currently expanding our software to include a suite of powerful tests for association studies. Our general framework will allow the implementation and comparison of a wide array of analytical methods.

Funding: National Institutes of Health (grants R01AR044422, U01 GM 92666, 5R03CA143982 and 1R01HG005859); Schissler Foundation (to F.A.S.L.); Lyda Hill Foundation (to B.P.).

Conflict of Interest: none declared.

REFERENCES

Danecek

P.

, et al.

The variant call format and VCFtools

,

Bioinformatics

,

2011

, vol.

27

(pg.

2156

-

2158

)

Hinrichs

A.S.

, et al.

The UCSC genome browser database: update 2006

,

Nucleic Acids Res.

,

2006

, vol.

34

Suppl. 1

(pg.

D590

-

D598

)

Kanehisa

M.

,

Goto

S.

.

KEGG: Kyoto Encyclopedia of Genes and Genomes

,

Nucleic Acids Res.

,

2000

, vol.

28

(pg.

27

-

30

)

Liu

X.

, et al.

dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions

,

Hum. Mutat.

,

2011

, vol.

32

(pg.

894

-

899

)

Quinlan

A.R.

,

Hall

I. M.

.

BEDTools: a flexible suite of utilities for comparing genomic features

,

Bioinformatics

,

2010

, vol.

26

(pg.

841

-

842

)

Wang

K.

, et al.

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

,

Nucleic Acids Res.

,

2010

, vol.

38

pg.

e164

Author notes

Associate Editor: Alex Bateman

Download all slides

Month:	Total Views:
December 2016	3
January 2017	6
February 2017	18
March 2017	25
April 2017	7
May 2017	15
June 2017	5
July 2017	20
August 2017	20
September 2017	7
October 2017	11
November 2017	26
December 2017	27
January 2018	26
February 2018	77
March 2018	42
April 2018	29
May 2018	61
June 2018	21
July 2018	43
August 2018	64
September 2018	54
October 2018	41
November 2018	46
December 2018	38
January 2019	31
February 2019	39
March 2019	52
April 2019	61
May 2019	56
June 2019	29
July 2019	58
August 2019	41
September 2019	39
October 2019	45
November 2019	34
December 2019	26
January 2020	40
February 2020	31
March 2020	24
April 2020	37
May 2020	17
June 2020	35
July 2020	23
August 2020	18
September 2020	25
October 2020	22
November 2020	13
December 2020	22
January 2021	25
February 2021	26
March 2021	28
April 2021	30
May 2021	27
June 2021	8
July 2021	22
August 2021	18
September 2021	15
October 2021	19
November 2021	25
December 2021	23
January 2022	36
February 2022	42
March 2022	36
April 2022	68
May 2022	38
June 2022	34
July 2022	23
August 2022	35
September 2022	20
October 2022	23
November 2022	20
December 2022	33
January 2023	23
February 2023	17
March 2023	27
April 2023	32
May 2023	22
June 2023	21
July 2023	19
August 2023	22
September 2023	21
October 2023	15
November 2023	30
December 2023	33
January 2024	26
February 2024	28
March 2024	36
April 2024	12

Article Contents

Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools

Abstract

1 INTRODUCTION

2 METHODS

3 DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools

Abstract

1 INTRODUCTION

2 METHODS

3 DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only