System for turnkey analysis of semi-automated genome annotations
Reason this release was yanked:
extra file tmp.py contains non python code and can cause errors when installing.
Project description
Segzoo
What is segzoo?
Segzoo is a tool that allows to automatically run various genomic analysis on a segmentation obtained with Segway. The results of each analysis are made available as well as a summarizing visualization of the results. The requirements for this tool include segtools, bedtools and python packages, but all of them are dependencies that will be taken care of during installation.
Quick start
This quick start needs you to have anaconda already installed in your local computer (either with python 2 or 3).
- Download the test segmentation and GMTK parameters and move them both in a directory called, for example,
segzoo
- Open a terminal in the mentioned directory and run
conda create -c bioconda -n segzooenv python=3.6 r-base r-latticeextra r-reshape2 r-cairo r-cluster bedtools -y
- After the last command has finished, run
source activate segzooenv
followed bypip install segzoo
- When finished, run
segzoo segway.bed.gz --parameters params.params
- After around 30 min, the resulting visualization will be stored in the current's directory
outdir/plots
folder
How to install
Segzoo is a python 3 tool, so if you have python 2 installed it is highly recommended for you to install segzoo in a separate python 3 environment.
Although it can, this tool is not designed to run on a cluster node without internet access, so all the following steps should be done in a local computer.
To create such an environment run conda create -n segzooenv python=3.8.* seaborn segtools ggd snakemake pybedtools -y
where you can change the name of the environment, segzooenv
.
Next, you need to activate this environment. Run conda activate segzooenv
specifying the name of the environment you chose before.
Now that you already are in it, you can install segzoo. You can do that by running pip install segzoo
,
Note: currently it's being worked on uploading Segzoo to bioconda.
When this is finished it will be possible to install it just by using conda install -c bioconda segzoo
which will take care of all the dependencies.
After accepting all installations, segzoo will be good to go!
How to use
To access the help to know how to run segzoo you can run segzoo -h
or segzoo --help
. Here's a look at all possible arguments:
--version
to check the current version of segzoo installed--parameters
to specify a params.params file resulting from segway's training to obtain GMTK parameters in the final visualization. If not specified, GMTK parameters won't show in the final visualization--prefix
to specify where you want all needed data (like the genome assembly) to be downloaded (default: the installation environment's directory)-o
or--outdir
to specify the folder where all the results and the final visualization will be created (default: outdir)-j
to specify the number of cores to use (default: 1)--species
and--build
specify the species and the build for which the segmentation was created (default: Homo_sapiens and hg38)--download-only
is an option to support cluster use. Running Segzoo using this argument will only run the downloading rules of the pipeline, and store the data in using the specified prefix. After that, runs on nodes without internet access can be done by specifying that same prefix--mne
allows specify anmne
file to translate segment labels and track names on the shown on the figure. seeUsing mne files
section for details.--normalize-gmtk
allows normalization of gmtk parameters table row-wise (i.e. across a segmentation label)--dendrogram
is an option to perform hierarchical clustering of gmtk parameters row-wise
If you are interested in obtaining information on different gene biotypes than protein coding and lincRNA, which are the default,
you can get to the installation folder of segzoo and modify the file gene_biotypes.py
as you wish.
The same can be said for the final visualization, which can be altered by modifying some variables on top of visualization.py
After running the command segzoo
by specifying the segmentation file and all the optional arguments that you want, the execution of the pipeline will begin.
All necessary data will be downloaded, tools will run the different analysis and the final visualization will be created. This execution may take some time.
Results
After the execution has finished, the new directory will be created (outdir is the default name). In the data folder you will be able to find the results for all the tools' analysis. In results you will find the tables of processed results used in the visualization. Finally, the visualization will be in the plots directory. It will look something like this:
The y-axis are the labels of the segmentation for all the heatmaps, while the x-axis are the different results obtained for each of them.
- In the left there are the learned parameters during the training of Segway.
- Next, a heatmap that has each different column normalized so that the maximum and minimum values are the limits of the color map used. This applies to all but the GC content, which is normalized between 35% and 65% always. All this information is displayed in the table below
- The aggregation tables are shown in the same order as specified in
gene_biotypes.py
, and can contain duplicates - The aggregation results displayed for each label are the percentage of counts in one component in comparison to all the idealized gene, so notice that each row adds up to 100
- The number of genes found for each biotype shown is specified after the biotype's name
Using mne files
The mne
file can be used to translate segment labels and track names in the final figure.
The file is tab delimited and should contain three columns in any order. Each row represent a translation rules. The columns are defined as follow:
old
: the orginal label or track name that you can see from runningsegzoo
with default parameters. The values in this column will be the keys in a python dict or look up table.new
: replace theold
value by thenew
value from this column.type
: indicates whether the row should be used to translate a track or a label. It is specifically useful when tracks and labels have the sameold
name.
The file header is mandatory and should contain the three fields listed above: old, new and type.
Note that only the tracks and labels defined in the mne
file will be updated. Specifically, it is possible to define more rows than needed in order to reuse the same files for different projects. The tracks and labels that are not defined in the mne
files will remain unchanged.
Example of mne
file:
old new type
0 Quiescent label
1 TSS label
H3K4me3_robust_peaks H3K4me3 track
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.