Genomic Variant Aggregation and Mutation Correlation
mucor is software to aggregate variant information sourced from multiple VCF files (and some others, see below) into a variety of summary files with varying levels of detail and statistics. The outputs range from high-level summaries to full, detailed reports in text and Microsoft Excel formats. The intended audience is both biologists and bioinformaticians.
The following modules are required:
- numpy (http://www.numpy.org/)
- pandas (http://pandas.pydata.org/)
- HTSeq (http://www-huber.embl.de/HTSeq/)
The following modules are optional:
- bgzip and Tabix (optional, see Databases, below; https://github.com/samtools/htslib)
mucor requires a reference annotation file (in GTF or GFF3 format) for feature definition. Contig (chromosome) names must match between the input VCF and reference annotation file; for example, for Homo sapiens, UCSC uses the form chr17 while Ensembl uses the form 17.
Protip: If your human data are aligned and variants called with chr style contig names but you wish to use Ensembl genes, you can try the GENCODE annotation which contains most Ensembl genes but using chr style contig numbering.
mucor can optionally check variants against any number of supplied databases and report presence (including identifier) or absence in the database. Common choices would be one or more releases of dbSNP (e.g., dbSNP 137 and latest dbSNP; dbSNP clinvar and dbSNP all), 1000 Genomes, NHLBI 6500 Exomes, or COSMIC. Users may also supply their own custom databases.
Databses must meet the following requirements: * Conform to VCF standard format * be compressed with bgzip (not gzip; see below) * be indexed with tabix (see below).
bgzip and tabix provide massive speedups looking up variants in these external files. The database lookup feature is not available if files are not correctly bgzipped or tabix indexed, or if the Pytabix module is not installed. bgzip and tabix are available as part of htslib (https://github.com/samtools/htslib)
Beginning with an example VCF dbSNP.vcf, execute the following:
$ bgzip dbSNP.vcf $ tabix -p vcf dbSNP.vcf.gz
Note that bgzipped files can be treated as regular .gz files by other tools.
Variant call data
These data are the data that you wish to aggregate and summarize. In theory, mucor will work with any well-formed generic VCF, but it has been tested with and contains specific code to handle VCF files generated by the following tools:
- Ion Torrent PGM (default machine output)
- Illumina Miseq (default machine output)
- GATK SomaticIndelDetector
- GATK HaplotypeCaller
In addition, mucor can read the more detailed .out files produced by MuTect.
clone from https://github.com/blachlylab
Setup and Operation
Running mucor to aggregate and summarize variants is a two step process:
- Project setup / configuration
- mucor_config.py creates a JSON config file
- Variant Aggregation!
Configuration with mucor_config.py
The configuration step is completed using the provided mucor_config.py utility. It accepts the following command line arguments and creates a JSON file that will be passed to the main mucor script.
-ex, --example Print a valid, example JSON config file and exit. Function that will write a template of the JSON config. It can be edited manually and supplied to mucor.
-g GFF, --gff GFF Reference annotation GFF/GTF for feature binning. Required
-f FEATURETYPE, --featuretype FEATURETYPE Feature type into which to bin. Gencode GTF example: gene_name, gene_id, transcript_name, transcript_id, etc. Required
-db DATABASES, --databases DATABASES Colon delimited name and path to variant database in bgzipped VCF format. Can be declared >= 0 times. Ex: -db name1:/full/user/path/name1.vcf.gz. Optional
-s SAMPLES, --samples SAMPLES Text file containing sample names. One sample per line. mucor_config.py attempts to guess which files belong with which sample IDs using globbing (wildcard filename matching). This means that the auto-configuration may be incorrect if any sample names are contained within another sample name. Ex: U-23 and U-238. U-23 would erroneously identify U-238 files, requiring manual modification of the JSON file. Sample IDs U-023 and U-238 would not exhibit this problem. Required
-d PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY Working/project directory, in which to find variant call files to aggregate. Variant calls can be in the provided directory, or any of its subdirectories. Default: current working directory
-vcff VCF_FILTERS, --vcf_filters VCF_FILTERS Comma separated list of VCF filters to allow. Default: PASS
-a ARCHIVE_DIRECTORY, --archive_directory ARCHIVE_DIRECTORY Specify directory in which to read/write archived annotations. This step will significantly speed up future runs that use the same annotation and feature type, even if the sample data changes. Undefined will prevent using the annotation archive features. Optional
-r REGIONS, --regions REGIONS Comma separated list of bed regions and/or bed files by which to limit output. Bed regions can be specific positions, or entire chromosomes. Ex: chr1:10230-10240,chr2,my_regions.bed. Optional
-u, --union Join all items with same ID for feature_type (specified by -f) into a single, continuous bin. For example, if you want intronic variants counted with a gene, use this option. WARNING, this will lead to spurious results for features that are duplicated on the same contig. When feature names are identical, the bin will range from the beginning of the first instance to the end of the last, even if they are several megabases apart. Refer to the documentation for a resolution using ‘detect_union_bin_errors.py.’ Optional.
-jco JSON_CONFIG_OUTPUT, --json_config_output JSON_CONFIG_OUTPUT Name of JSON configuration output file. This is the configuration file fed into mucor. Required
-outd OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY Name of directory in which to write mucor output. Required
-outt OUTPUT_TYPE, --output_type OUTPUT_TYPE Comma separated list of desired output types. Options include: counts, txt, longtxt, xls, longxls, bed, featXsamp, featmutXsamp, all. Default: counts,txt (See the detailed description of Output File Formats, below)
python ./mucor_config.py -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name -s samples.txt -a ./fast -u -jco mucor_config.json -outd ./mucor_output -outt all -db 1000G:~/references/human/1000_genomes/chrm_1000Genomes.20130502.genotypes.vcf.gz -db dbsnp:~/references/human/dbsnp/common_all.hg19.sorted.leftalign.vcf.gz
JSON config file
mucor_config.py produces a JSON-formatted configuration file embodying the selected options for the subsequent mucor run. The configuration could be edited manually (e.g., to tweak a database name or path) or programmatically (e.g., if mucor_config.py’s guesses about sample IDs were incomplete) at this stage.
Alternatively, mucor_config.py -ex produces a syntactically valid example JSON file for editing.
Producing a configuration file is intended to facilitate the following: * Consistency between runs * Documentation of settings * Easier use with dozens to thousands of input files
Execution: mucor.py <config.json>
mucor is executed by launching the main mucor.py script, passing as a single parameter the name of the previously generated/edited JSON file.
Output files of the specified type are placed in the output directory specified during configuration.
Output File Formats
Output types are specified at the time of configuration. The user may select any number and combination of output types from the list below.
all Execute all output types
counts Print counts of mutations per feature. Output: counts.txt
txt Print all information about each variant, with one-per-row, irrespective of how many samples in which it appears. Useful for variant-centric projects. Identical to xls in layout. Output: variant_details.txt
longtxt Similar to txt above, but writes each instance of a variant to a new row. Each variant is written once per source file, instead of combining recurrent variations into 1 unique row. Identical to longxls in layout. Output: long_variant_details.txt
xls Print all information about each variant, with one-per-row, irrespective of how many samples in which it appears. Useful for variant-centric projects. Identical to txt in layout. Output: variant_details.xls/xlsx
longxls Similar to xls above, but writes each instance of a variant to a new row. Each variant is written once per source file, instead of combining recurrent variations into 1 unique row. Identical to longtxt in layout. Output: long_variant_details.xls/xlsx
NB: The XLS format has a hard limit of 2^16 rows; in long record format, a moderate sized study could exceed this (2,000 total variants/sample * 32 samples = 65,536 rows). mucor can use Python’s xlwt module to write .xls format, but it is preferrable to have XlsxWriter or openPyxl installed for .xlsx support.
bed Print bed file of the variant locations Output: variant_locations.bed
vcf Print vcf file of the variant locations, features, depths, and variant frequencies. Output: variant_locations.vcf
featXsamp Print table of mutation counts per feature per sample. Samples are in columns, while features are in rows. The count of unique mutations per sample per feature are the table values. This output is useful for examining patterns in variation across samples, for example, to look at combinatoric mutation status for selected recurrently mutated genes. This output could be used directly to make a heatmap. Output: feature_by_sample.txt
mutXsamp Print table of mutations per sample. Unlike featXsamp, this differentiates among different variants within the same features. For example, in acute leukemia, the functional effect of mutations in DNMT3A depends on whether it is an R882 mutation or non-R882 mutation. As before, samples are in columns, with features in rows. However, rows 2-4 contain information about chromosome, position, ref, and alt. The table values are boolean: 1 for present mutation, 0 for missing mutation. This output could be used directly or with appropriate filtering to make an Oncoprint. Output: feature_and_mutation_by_sample.txt
mutXsampVAF Print table of mutations per sample. Identical to mutXsamp except boolean values are replaced by the respective variant VAF. Output: feature_and_mutation_by_sample.txt
DepthGauge, a companion utility to mucor, measures coverage at regions of interest to increase confidence in variant calls. The purpose of this tool is not only to verify sufficient depth at mutations, but more importantly, to verify sufficient depth in places that are not called mutant. The lack of a mutation call may indicate no mutation, or simply that the region of interest had insufficient coverage for analysis. Like mucor, DepthGauge is dependent only on the JSON formatted configuration file which may be entirely hand created, entirely automatically generated by mucor_config.py, or automatically generated and hand-tuned. DepthGauge has three additional options, the latter of which can override the regions, if any, specified in the JSON configuration file.
depth_gauge.py config_file.json [-p] [-c] [-r REGIONS]
-p, –point Instead of reporting an average depth for every location within a window (default), take the middle coordinate within the range and calculate the depth at that point as a surrogate for the entire region
-c, –count Instead of reporting an average depth for every location within a window (default), or taking a point estimate from the middle (-p), instead count the total number of reads mapped within the region.
-r REGIONS This option specifies a list of regions to query for depth. If present, it overrides the region(s) specified in the JSON configuration file. As an alternative, the JSON configuration file could be edited before running DepthGauge.
The –union feature will behave inappropriately when genomic feature names are duplicated on the same contig. For example, if gene “ABC” is duplicated on the beginning and the end of chromosome 1, the feature bin for gene “ABC” will cover the whole contig (from the beginning of the first copy of “ABC”, to the end of the last copy). Users may select another feature type, such as gene_id, which is unique to every copy of a gene. Otherwise, users may run the included python script [detect_union_bin_errors.py] to detect potential bin errors. It accepts a feature_type, a GTF/GFF annotation, and an output directory. The output is a text file list of feature names likely to cause large bin errors. Place this text document into the working directory where mucor will be executed. mucor will automatically search for the text file by name [‘union_incompatible_genes.txt’] in the current directory and print a message indicating when the file is found.
python ./detect_union_bin_errors.py -o ~/projects/mucor -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name
There is an issue when a variant file presents a contig that the pickled (archived) annotation does not have. This will throw a warning that shows how many contigs were unknown, and how many mutations were encountered on these contigs. The solution is to disable the archive feature by omitting the -a or --archive_directory option. This will permit the unknown contig in output, but all mutations on the unknown contig will be shown as having no feature.
*** WARNING: 18 Contigs and 39 mutations are in areas unknown to the genomic array of sets. If using --fast, perhaps try again without it. ***
The VCF files need to have columns #CHROM, POS … etc. The configuration script checks each VCF file for proper columns and will print a warning if any are missing or wrong. However, it does not halt execution and will include any malformed column VCF files and attempt to process them regardless. The main script may finish execution with the malformed VCF, but the output may be perturbed or useless.
File "mucor/inputs.py", in parse_VarScan position = int(row[fieldId['POS']]) KeyError: 'POS'
Users may not supply vcf files that have inconsistent ‘effect’ and/or ‘functional consequence’ for the same variant. Presumably, if the variant has the same location and reference and alternate allele, the effect and functional consequences would be predicted to be the same. The problem lies in collapsing mutations in the variant_details output type(s). This issue may arise when different platforms or pipelines are used for samples containing a common variant within a single run of mucor. The current solution is to annotate all VCF files with the same functional consequence decoration software.
Bug Reports and Issue Tracking
Check the issue tracker at the github repository. Please report bugs and request new features there for better tracking.