Skip to main content

calculate Genotype Networks from VCF files

Project description

VCF2Networks is a python script that allows to calculate Genotype Networks from a Variant Call Format (VCF) file.

Genotype Networks are a method used in the field of Systems Biology to study the “Evolvability” or the “Innovability” of a genetic system. They have been applied to a wide range of systems, like genetic circuits ([1], [2], [3])⁠⁠, metabolic networks [4], [5], [6], [7], [8]), and RNA and protein foldings ([9], [10])⁠. In recent times, they have also been used to reconcile the neutralist and selectionist views on evolution ([11])⁠, and to examine exaptation in metabolic networks ([12])⁠.

This library allows to apply genotype networks to Single Nucleotide Polimorphism data, thus studying these networks at the level of population genetics, or in case-control studies.

INSTALLING

VCF2Networks is a suite of python scripts that run on the terminal. At the moment, there is no graphical interface, and only UNIX systems are supported.

The recommended way to install VCF2Networks is to use easy_install or pip, installing it from the Python Package Index (PyPI):

$: easy_install vcf2networks

If the above command does not work, or if for whatever reason you are not able to install VCF2Networks, please consult the detailed Install Instructions. In particular, VCF2Networks is based on the python-igraph>0.6 package, which at the moment is not fully supported in PyPI, and must be installed manually. See the instructions for more details.

BASIC TUTORIAL

VCF2Networks allows to parse a VCF file, and produce a tabular report of the Genotype Network properties of the file.

Here is a short tutorial on how to run it on a sample VCF file, exploring the most important options.

Datafiles to follow this tutorial

This tutorial contains some examples, based on a sample VCF file and some parameters. To follow the tutorial, please download the example files from here: tutorial_data.zip, and decompress them to your working directory.

Checking if VCF2Networks is installed

The main script in VCF2Networks is called vcf2networks. If you installed VCF2Networks succesfully, this script should be installed in your PATH folder. To check if vcf2networks is installed, type:

$: vcf2networks --help

You should get a summary of all the options available for the script.

Simple example

The following example shows the basic usage of VCF2Networks:

$: vcf2networks  --vcf tutorial/MOGS.recode.vcf --individuals_file tutorial/individuals_annotations.txt
file_prefix continent   n_datapoints    n_components    av_path_length  av_degree   window      whole_gene_nsnps
MOGS        glob        2184            17              3.847           2.5037      window_0    37
MOGS        EUR         758             7               2.6508          2.3256      window_0    37
MOGS        ASN         572             6               2.9             2.0         window_0    37
MOGS        AMR         362             8               2.6477          2.1277      window_0    37
MOGS        AFR         492             15              3.4103          2.0         window_0    37

This command will calculate the properties of the Genotype Network of the file “MOGS.recode.vcf”. The first line of the output (after the headers) will show the properties of the network built using all the individuals in the file. The next lines give details of the networks built for each population separately.

Each line contains some information about the Genotype Networks of a population. The first three columns give details about the file name, the population, and the number of haplotypes for each population. The fourth, fifth and sixth columns show the Number of Components, Average Path Length and the Average Degree respectively. It seems that for this region, African populations have an high number of components (15), while Europeans and Americans have a very low average path length (2.6508 and 2.6477).

The last two columns contain debugging information, like a unique id for each network, and the total of SNPs available.

(note: the actual output may different from the one shown. In particular, fields will be separated by a space, instead of tabulation. Open the output in a spreadsheet to see it better.)

Let’s have a look at the options

Let’s have a look at the options used in the previous example.

–vcf tutorial/MOGS.recode.vcf:

The –vcf option is used to indicate which VCF file must be parsed to generate the networks. The VCF file must be sorted by position, and contain only diploid individuals. At the moment only Single Nucleotide Polimorphism data is supported, so you must remove any other type of variation before using the file.

–individuals_file tutorial/individuals_annotations.txt:

The –individuals_file option points to the file containig the list of individuals, and their annotations. This file also contains any information on the phenotype of the individuals. Have a look at the file data/individuals/individuals_annotations.txt for an example.

Another useful option is –config, which describes a parameters file containing the network properties that have to be calculated.

–config tutorial/simple_params.yaml:

The –config option is used to point to a configuration file, containing the list of the Genotype Network properties to be calculated. Have a look at the file params/simple.yaml to have an idea of which properties can be included in the output file.

The config file: let’s add new Network Properties to the output

If you look at the tutorial/simple_params.yaml file, you will see that there are a lot of properties that can be printed, apart from the ones in the first example.

Here are the contents of the tutorial/simple_params.yaml file:

$: cat tutorial/simple_params.yaml

# Properties to be included in the report of the network properties
networkproperties_report_fields:
#    - gene
    - file_prefix
#    - distance_definition
    - n_snps
    - central_snp
    - chromosome
    - central_snp_position
#    - upstream_position
#    - downstream_position
    - region_size  # size of the window, in nucleotides pairs
#    - distance_from_upstream_margin
#    - distance_from_downstream_margin
    - n_datapoints
    - av_datapoints_per_node
    - n_vertices
    - n_edges
    - n_components
    - av_path_length
#    - var_path_length
    - diameter
    - av_w_path_length
#    - av_w_path_length_inv
    - av_degree
#    - var_degree
#    - median_degree
    - max_degree
    - av_w_degree
#    - av_w_degree_inv

As you can see, there is a long list of network properties, the majority of which is uncommented.

Uncomment any of these network properties to include it in the output report, and then run the script using the –config option:

$: vcf2networks  --vcf tutorial/MOGS.recode.vcf --config tutorial/simple_config.yaml --individuals_file tutorial/individuals_annotations.txt
file_prefix continent   distance_definition n_snps  central_snp     region_size n_datapoints n_components   av_path_length  diameter    av_closeness max_closeness  av_betweenness  window  whole_gene_nsnps
MOGS        glob        1                   37      rs188042257     23805       2184            17          3.847           10          0.011       0.0132          57.6148         window_0    37
MOGS        EUR         1                   37      rs188042257     23805       758             7           2.6508          5           0.0414      0.0517          14.5116         window_0    37
MOGS        ASN         1                   37      rs188042257     23805       572             6           2.9             7           0.0514      0.064           11.2812         window_0    37
MOGS        AMR         1                   37      rs188042257     23805       362             8           2.6477          6           0.0292      0.0354          9.8511          window_0    37
MOGS        AFR         1                   37      rs188042257     23805       492             15          3.4103          8           0.0194      0.0252          21.726          window_0    37

A detailed description of all the columns available for VCF2Network is provided in the Supplementary Materials 1 of the paper (this part of the documentation will be updated after publication).

Grouping individuals by phenotype, instead of populations

Another useful option in VCF2Network is to generate networks using information on the phenotype, instead that by population. For example, you can create two networks for the “Tall” and the “Short” individuals, and see if they have different properties.

To produce networks based on phenotype, you must first be sure that the “individuals” file contains a column for the phenotype. For example, have a look at the individuals file provided in this repository:

$: cat tutorial/individuals_file.txt

#ID         POP     CONTINENT        PHENOTYPE1
HG00096     GBR     EUR             disease
HG00097     GBR     EUR             control
HG00099     GBR     EUR             control
HG00100     GBR     EUR             disease
HG00101     GBR     EUR             control
HG00102     GBR     EUR             control
HG00103     GBR     EUR             disease
HG00104     GBR     EUR             control
HG00106     GBR     EUR             control
HG00108     GBR     EUR             disease
HG00109     GBR     EUR             control
HG00110     GBR     EUR             control
HG00111     GBR     EUR             disease
HG00112     GBR     EUR             control
HG00113     GBR     EUR             control
HG00114     GBR     EUR             disease
HG00116     GBR     EUR             disease
HG00117     GBR     EUR             disease
HG00118     GBR     EUR             control
HG00119     GBR     EUR             disease

The fourth column in this file contains a phenotype called “PHENOTYPE1”, which classifies individuals into “disease” and “controls”.

Let’s use the –phenotype option to see if these two sets of individuals have different properties:

$: vcf2networks  -g tutorial/MOGS.recode.vcf -c tutorial/simple_config.yaml -i tutorial/individuals_annotations.txt --phenotype PHENOTYPE1
file_prefix PHENOTYPE1  n_components    av_path_length  av_degree   window      whole_gene_nsnps
MOGS        glob        17              3.847           2.5037      window_0    37
MOGS        disease     14              3.5768          2.1667      window_0    37
MOGS        control     15              3.8127          2.3148      window_0    37

Well, it seems that the “disease” individuals have a lower average degree than the control one (2.1667 agains 2.3148), although overall, the differences are not very big in this disease.

Subsampling individuals

When the dataset in your VCF file contains a different number of individuals for each population (phenotype group), it is useful to sample only a fixed number of individuals for each population. This can be done using the –sample option. The following example samples only 500 haplotypes for each group:

$: vcf2networks --sample 500 --vcf tutorial/MOGS.recode.vcf --config tutorial/simple_config.yaml -p PHENOTYPE1
file_prefix PHENOTYPE1  n_datapoints    n_components    av_path_length  av_degree   window      whole_gene_nsnps
MOGS        glob        2184            17              3.847           2.5037      window_0    37
MOGS        global_sub  500             12              3.3886          1.9697      window_0    37
MOGS        disease     500             12              3.3939          1.8113      window_0    37
MOGS        control     500             16              3.6625          2.2105      window_0    37

The first line of the output shows the values for the global population, using all the individuals. The second line shows the values of t

Applying a sliding window approach

An useful option in vcf2networks is -w, which allows to apply a sliding windows approach.

$: python src/vcf2networks.py  -g data/vcf_filtered/MOGS.recode.vcf -c params/default.yaml     -w 12 -p PHENOTYPE1
file_prefix PHENOTYPE1 n_snps central_snp chromosome central_snp_position region_size n_datapoints av_datapoints_per_node n_vertices n_edges n_components av_path_length diameter av_w_path_length av_degree max_degree av_w_degree av_closeness max_closeness max_betweenness window whole_gene_nsnps
MOGS glob 12 rs73949668 2 74681774 8871 2184 80.8889 27 35 1 3.8063 9 0.4238 2.5926 10 0.2244 0.2736 0.3662 168.2455 window_0 37
MOGS pink 12 rs73949668 2 74681774 8871 1508 62.8333 24 31 1 3.5471 8 0.5224 2.5833 10 0.2383 0.2949 0.4035 150.8242 window_0 37
MOGS blue 12 rs73949668 2 74681774 8871 676 32.1905 21 23 1 3.7905 9 0.6178 2.1905 8 0.1984 0.2785 0.3922 123.4333 window_0 37
MOGS glob 12 rs184552219 2 74690931 8383 2184 84.0 26 29 3 2.6346 6 0.3841 2.2308 7 0.2031 0.0713 0.0865 60.5 window_1 37
MOGS pink 12 rs184552219 2 74690931 8383 1508 65.5652 23 26 2 2.4331 5 0.4191 2.2609 7 0.2342 0.0811 0.0957 49.3333 window_1 37
MOGS blue 12 rs184552219 2 74690931 8383 676 28.1667 24 26 3 2.5188 6 0.4074 2.1667 7 0.2051 0.0778 0.0954 55.0 window_1 37
MOGS glob 12 rs139210283 2 74698509 5737 2184 66.1818 33 48 1 3.178 7 0.1858 2.9091 8 0.1363 0.3246 0.4923 234.2405 window_2 37
MOGS pink 12 rs139210283 2 74698509 5737 1508 68.5455 22 27 1 2.9567 5 0.2262 2.4545 6 0.1712 0.3494 0.5385 123.8667 window_2 37
MOGS blue 12 rs139210283 2 74698509 5737 676 23.3103 29 40 1 3.1133 7 0.1921 2.7586 8 0.1229 0.3324 0.5 189.9 window_2 37

As you can see, the data in the file MOGS.recode.vcf, which contains 37 SNPs, is splitted into windows of 12 SNPs (the remaining SNP is ignored), and a different output is given for each region of 12 SNPs.

The sliding windows approach is useful to compare files that have a different number of SNPs.

An useful option to be used with the sliding windows approach is the -l, which allows to do a sliding window approach using overlapping windows.

The GenotypeNetwork Class

Apart from the main vcf2networks script, there are many functions hidden in the code of the library, which allow better performances, and customization.

For example, the file GenotypeNetwork.py contains the definition of the GenotypeNetwork class, which can be used to calculate the network properties manually, or to implement new functions.

Basic usage of the Genotype Network class:

Create an empty Genotype Network: >>> mynetwork = GenotypeNetwork(chromosome_len=3, name=’mynetwork’)

Populate the graph from a set of genotypes:

>>> genotypes = ['001', '010', '000']
>>> mynetwork.populate_from_binary_strings(genotypes)

# Print informations about the network: >>> print(mynetwork) Genotype Network (name = ‘mynetwork’, chromosome lenght = 3, |V| = 3, |E| = 2)

>>> print(mynetwork.summary()) #doctest: +NORMALIZE_WHITESPACE
3 nodes, 2 edges, undirected
<BLANKLINE>
Number of components: 1
Diameter: 2
Density: 0.6667
Average path length: 1.3333

# Print node degree distribution for the network >>> mynetwork.degree() [1, 1, 2]

>>> print mynet.csv_report()
mynet glob 3 5 5 3 2 2 0.3 1.0 1.5 1.2 3 0.6 0.38

Availability

This repository is available at http://bitbucket.org/dalloliogm/vcf2networks/overview

Project outline, TO-DO list and bugs

Project Proposal and TO-DO lists are implemented in separate trello boards.

Bugs are usually posted to the TO-DO trello board, in one of the TO-DO columns.

REFERENCES

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf2networks-1.0.3.tar.gz (77.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page