Skip to main content

Extracts mutational signatures from mutational catalogues

Project description

SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

PREREQUISITES

This module requires all the prerequisites of the SigProfilerMatrixGenerator and SigProfilerPlotting plus the following packages referred in the table below:

Name of packages Version How to install
nimfa 1.3.4 or more conda install -c cdeepakroy nimfa
sigProfilerPlotting 1.1 or more conda install sigProfilerPlotting
scipy 1.1 or more conda install scipy
scikit-learn 0.19.1 or more conda install scikit-learn
mpi4py(only to use MPI) 2.0 or more conda install mpi4py

"Conda" is a cross-platform, open source package manager that can be used to install different versions of software packages and libraries. To get help to install and use conda, please visit SigProfilerMatrixGenerator. Currently, SigProfilerExtractor supports only the Linux/Unix operating system.

INSTALLATION

To install the SigProfilerExtractor, the users should download the SigProfilerExtractor repository from github.

After downloading the repository users have to enter into the "SigProfilerExtractor" directory and run the following code:

bash installer.sh

This command will install the program which normally takes a couple of hours. After SigProfilerExtractor is installed, it is ready to use in the Linux/Unix commandline.

HOW TO USE

Accurately extracting signatures of mutational processes is computationally intensive. SigProfilerExtractor is usually executed on a computational cluster. This type of parallel execution can be performed in methods using SigProfilerExtractor.

  1. Multiprocessing ,
  2. Message Passing Interface (MPI)

The following commands are used to run the SigProfilerExtractor

COMMANDS

Required Arguments--

These commands are required to run SigProfilerExtractor

-t, --type: The input type. There are three available input types: "vcf", "text", "matobj". The "vcf" type input will load the mutational catalogue from a varriant caller data.

  • In order to use the "vcf" type input, users have to follow the following list of procedure first,

      1. Create a new folder for each project/job that you run within the SigProfilerMatrixGenerator/references/vcf files/ folder. Use a unique name for each project/job.
      1. Separate your INDEL mutations from your SNV mutations if \ they are present in the same files, and create a folder for each mutation type (ex: SigProfilerMatrixGenerator/references/vcf files/[project]/SNV/ or SigPro- filerMatrixGenerator/references/vcf files/[project]/INDEL/).
      1. Place your vcf files within these new folders (either SigProfilerMatrix- Generator/references/vcf files/[project]/SNV or SigProfilerMatrixGener- ator/references/vcf files/[project]/INDEL/).
  • In order to use the "matobj" type input, first users have to the place matlab object file in the input folder.

  • In order to use the "text" type input, first users have to the place tab delimited text file (usually generated by the SigProfilerMatrixGenerator) in the input folder.

-o, --output: Users have to set the name of the output directory where the results will be stored.

Semi Required Arguments--

These commands are required depending on the input type

-p or --project: Name of the project file (created earlier). This argument is mandatory for the "vcf" type input.

-r or --refgen: Name of the reference genome. This argument is mandatory for the "vcf" type input.

-i or --inputfile: The name of the input file. This argument is mandatory for "text" or "matobj" type input.

Optional Arguments--

These commands are optional to run SigProfilerExtractor

-s or --startprocess: The minimum number of processes to be extracted. The default value is 1.

-e or --endprocess: The maximum number of processes to be extracted. The default value is 2.

-n or --n_iterations: The number of iterations to be executed. The default value is 3.

-c or --cpu: The number of cpu to be executed for parallel computation. The default value will use the maximum number of the available cpus.

-m or --mtypes: The context of mutations and is optional. This is valid when the input type is "vcf". User should pass the inteded mutation types among to be analyzed separeted by coma "," with no space. The sigporfiler engine will analyze the specific mutation types those are passed to this argument. The valid mutation type are 6, 12, 96, 1536, 192, 3072 and DINUC. For example, if the user wants analyze mutation type 96, 192 and DINUC, that person should pass "--mtypes 96,192,DINUC" as in the argument. If the argument is not used, "96", "DINUC", "INDEL" (if --indel is used) will be extracted.

-l or --layer: Optional parameter that set if the signatures will be extrated in a hierarchical manner.

--indel: Optional parameter instructs script to create the catalogue for limited INDELs. This parameter is valid only for the "vcf" input.

--extended_indel: Optional parameter instructs script to create the catalogue for extended INDELs. This parameter is valid only for the "vcf" input.

--exome: Optional parameter instructs script to create the catalogues using only the exome regions. Whole genome context by default. This parameter is valid only for the "vcf" input.

EXECUTION

USING MULTIPROCESSING

To execute a job using SigProfilerExtractor, user need to run the sigpro.py file from the source directory inside SigProfilerExtractor. Users need to pass the parameters previously described argurments (above) for the sigpro.py file.

Example 1:

python3 sigpro.py -t vcf -o results -p projectA -r GRCh37 -s 1 -e 10 -n 500 -c 8 -m "96","192","INDEL" -l --indel

The above command will extract the mutational processes from a "vcf" input type , from "projectA", using referrence genomes "GRCh37". The start processes will be 1, end processes 10, the number of iteration will be 500 and number of cpus perticipating in the computaion will be 8. Processes will be extracted hierarchically (since -l is there) for the mutational contexts "96", "192 and "INDEL" (since --indel is there). Finally, the output information will stored in the "results" folder in the SigProfilerExtractor directory.

USING MESSAGE PASSING INTERFACE (MPI)

To execute a job using SigProfilerExtractor, user need to run the sigpro.py file from the source directory inside SigProfilerExtractor. Users need to pass the parameters previously described argurments for the mpi_sigpro.py file. However, in the command section, users have to add mpiexec command and its parameter -n before the original command to run the python file. Here, n is the number of cpus to be used. The "-c/--cpu" and "-l/--layer" arguments are NOT applicable for mpi_sigpro.py.

Example 2:

mpiexec -n 8 python3 mpi_sigpro.py -t vcf -o results -p projectA -r GRCh37 -s 1 -e 10 -n 500 -m "96","192","INDEL" --indel

The above command will perform the similam task as Example one except extracting the processes in a hierarchical manner since "-l/--layer" argument is not applicable for mpi_sigpro.py.

OUTPUT

After SigProfilerExtractor is successfully executed, a output directory will be generated according to the name of the parameter of the (-o/--output) argument. The directory will be created inside the SigProfilerExtractor directory. In the "output" directory there will be subfolder for each type of mutational contexts. Inside each mutational context subdirectory, there will be subdirectories for the each layer (L) if the (-l/--layer) parameters are used to extract the matutational processes in a hierarchical manner. Inside each layer (L) subdirectories, there will be two subdirectories "All solution" and "Selected solution", one image file showing the "reconstruction error vs stability plot" for each number of processes and csvfile listing reconstruction errors and stabilities for each number of processes. Inside the "All solutions" subdirectory, there will be subdirectories for each number of processes. Each of the subdirectories will contain the processes files, exposures file, probability file (the probability of occuring processes in each mutations for a given sample) and a plot for the signatures. The "Selected solution" directories will contain the files for selected number of mutational processes automatically by the SigProfilerExtractor.

COPYRIGHT

This software and its documentation are copyright 2018 as a part of the sigProfiler project. The SigProfilerExtractor framework is free software and is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

CONTACT INFORMATION

Please address any queries or bug reports to S M Ashiqul Islam (Mishu) at m0islam.ucsd.edu

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigproextractor-0.0.5.18.tar.gz (32.7 MB view hashes)

Uploaded Source

Built Distribution

sigproextractor-0.0.5.18-py3-none-any.whl (33.1 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page