convertes an MSA in FASTA to a GFA1 file with imbedded paths
Project description
MSA to GFA
Is a tool to turn MSA to GFA v1 with embedded paths corrisponding to groups original sequences. Results can be then visualized with gfaviz, or bandage or any other avialble GFA visualization tools out there.
The tool is split into two separate steps, building the graph, and adding the paths.
Global optional arguments
Before supplying one of the subcommands, the tool takes two general arguments
usage: test_main.py [-h] [--log LOG_FILE] [--dir OUTDIR] {build_graph,add_paths} ...
Build GFA v1 from MSA
Subcommands:
{build_graph,add_paths}
Available subcommands
build_graph Command for building the GFA from a given MSA
add_paths Add paths to the graph that was built
optional arguments:
-h, --help show this help message and exit
--log LOG_FILE Log file name/path. Default = out_log.log
--dir OUTDIR Output directory where to put the output files, default: .
--log
name of out log file, default islog_file.log
--dir
output directory for outputs, if this is specified then you don't need to specify full path for outputs, just the name of file and this directory path will be used.
Building the Graph
The following options are available for this step when calling msa_to_gfa biuld_graph
subcommand:
usage: test_main.py build_graph [-h] [-f MSA_PATH] [--compact] [-o OUT_GFA] [-n SEQ_NAMES]
[-c COLORS]
optional arguments:
-h, --help show this help message and exit
-f MSA_PATH, --in_msa MSA_PATH
Input MSA in FASTA format
--compact If this give, the graph will be compacted before writing
-o OUT_GFA, --out OUT_GFA
Output GFA name/path
-n SEQ_NAMES, --seq_name_tsv SEQ_NAMES
A tsv with two columns, first is sequence names, second is a shortened or
abbreviated name
-c COLORS, --nodes_info COLORS
Output JSON file with nodes information
Required arguments here are:
-f, --in_msa
for input MSA in FASTA format-o, --out
the name of the output graph file, if--dir
is specified, then you do not need to give the full path, just a name and the directory you gave will be used.
Other arguments:
--compact
which is recommended to compact linear stretches of nodes in the graph-n, --seq_name_tsv
which is a TSV file with two columns, first column is the original fasta sequence ids without the ">" character, and the second column is a shorter name for the same sequences, in case the user wants to use abbreviations or a smaller sequence ids for the sequences. E.g: We have the following 3 amino acid sequences:
>seq1
-MEPTPEQ
>seq2
---T--MA
>seq3
MSETQSTQ
The TSV file you can provide would be:
seq1 s1
seq2 s2
seq3 s3
- The option
-c, --nodes_info
will output a json file with information related to the nodes: node id, node sequence, and the sequences associated with that node. highly conserved regions will have nodes with many colors.
When running the first subcommand, you'll automatically get a JSON file containing the path and grouped paths informatin, this file is needed for the second step. The graph of the previous sequences would look like
Adding paths
The JSON file outputted along with the GFA file has information related to each sequencs and its path in the graph, as well, groupings of sequences that have the same path in the graph. The following options are available with this subcommand:
usage: test_main.py add_paths [-h] [-g GFA] [--in_groups IN_GROUPS] [--all_groups]
[--some_groups SOME_GROUPS]
optional arguments:
-h, --help show this help message and exit
-g GFA, --gfa GFA input graph to add paths to
--in_groups IN_GROUPS
the groups json file
--all_groups Adds all paths to the GFA file given
--some_groups SOME_GROUPS
adds only the specified paths
Required arguments:
-g, --gfa
is GFA file from the first step--in_groups
is the JSON file outputted along with the GFA file in the first step Optional arguments:--all_groups
all groups are added to the graph as paths--some_groups
user need to give a text file with one group or sequence name on each line, if abbreviated names were used in the first step, then the abbreviations need to be used
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file msa_to_gfa-1.0.1.tar.gz
.
File metadata
- Download URL: msa_to_gfa-1.0.1.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ef958e5e1ce5a572cff4eefea2c79fc68d989c20104bcc865331daafd741028 |
|
MD5 | e77ee08fe406cfbf9b5b43e3e9072635 |
|
BLAKE2b-256 | 4bfd694f35f05bcd329e9c37ba1dfaca25fbd2330fc6a6e86122979064aa2b6a |