PolyA is a sequence annotation adjudicator.
Project description
AAAAAAAAAAAAAAAA (PolyA):
Automatically Adjudicate Any And All Arbitrary Annotations, Astutely Adjoin Abutting Alignments, And Also Amputate Anything Amiss
A tool for adjudicating between competing annotations of biological sequences.
About
Annotation of a biological sequence is usually performed by aligning that sequence to a database of known sequence elements. When that database contains elements that are highly similar to each other, the proper annotation may be ambiguous, because several entries in the database produce high-scoring alignments. Typical annotation methods work by assigning a label based on the candidate annotation with the highest alignment score; this can overstate annotation certainty, mislabel boundaries, and fails to identify large scale rearrangements or insertions within the annotated sequence.
PolyA is a software tool that adjudicates between competing alignment-based annotations by computing estimates of annotation confidence, identifying a trace with maximal confidence, and recursively splicing/stitching inserted elements. PolyA communicates annotation certainty, identifies large scale rearrangements, and detects boundaries between neighboring elements.
Using
PolyA may be consumed as an ordinary Python package:
Install using the pip
tool:
pip install polyA
Run from the command line:
polyA -h
or
python -m polyA -h
Command Line
Command line usage is available with polyA -h
. It is also included below for
convenience.
usage: polyA [-h] [-v] [--chunk-size CHUNK_SIZE] [--confidence]
[--prior-counts FILE] [--shard-gap SHARD_GAP] [--sequences SEQS]
[--ultra-data FILE] [--easel-path BIN] [--ultra-path BIN]
[--log-file LOG] [--log-level LEVEL] [--matrix-position]
[--output-path PATH] [--sequence-position] [--soda]
ALIGNMENTS MATRICES
PolyA sequence adjudication tool
positional arguments:
ALIGNMENTS alignments file in Stockholm format
MATRICES substitution matrices file in PolyA matrix format
optional arguments:
-h, --help show this help message and exit
-v, --version show version and exit
--chunk-size CHUNK_SIZE
size of the window in base pairs analyzed together
--confidence run the confidence calculation and then exit
--prior-counts FILE file containing query genomic counts
--shard-gap SHARD_GAP
maximum alignment gap before sharding occurs
--sequences SEQS FASTA file of the target sequence for using ULTRA
--ultra-data FILE text file of the output from ULTRA ran on the FASTA file
of the target sequence
--easel-path BIN path to the esl_scorematrix program, if necessary
(assumed to be in PATH)
--ultra-path BIN path to the ULTRA binary to use, if necessary (assumed
to be in PATH)
--log-file LOG file to store log output in, defaults to stderr
--log-level LEVEL logging level to use, 'debug' is the most noisy
--matrix-position produce output in terms of the matrix position
--output-path PATH directory to write output files to, defaults to
working directory
--sequence-position produce output in terms of the target sequence
position
--soda write a SODA visualization file to the output
directory
--complexity-adjustment complexity-adjust alignment scores (scores of matches
between sequence regions of biassed nucleotide
composition are adjusted downwards)
Input Formats
PolyA accepts two required inputs and several optional inputs that affect its behavior. The required inputs are an alignment file which must contain alignments for all possible queries matching the target sequence. This file must be in Stockholm format with several custom metadata fields. The other required input is a set of substitution matrices. This file uses a custom, but extremely simple format.
Alignment File Format
Alignments for all possible queries matching the target sequence should be contained in a single file in Stockholm format.
There are several special metadata fields that must exist for each alignment in this file. See the example below. An explanation is indented to the right of each field with additional detail as noted.
#=GF ID MERX#DNA/TcMar-Tigger query sequence (1)
#=GF TR chr1:11543-28567 target sequence
#=GF SC 1153 alignment score
#=GF SD + strand
#=GF TQ -1 (2)
#=GF ST 127 alignment start position on target
#=GF SP 601 alignment stop position on target
#=GF CST 135 alignment start position on query
#=GF CSP 628 alignment stop position on query
#=GF FL 128 (3)
#=GF MX matrix_name (4)
#=GF GI -25 gap init
#=GF GE -5 gap extension
- (1) query sequence names must be in the format 'name#family/class'
- (2) valid values: 'q' if the alignment is on the reverse strand and the reversed sequence is the query; 't' if the alignment is on the reverse strand and the reversed sequence is the target; '-1' if the alignment is on the positive strand
- (3) the flanking region of the unaligned query sequence
- (4) the name of the substitution matrix file used to create alignment
Creating Alignment files from cross_match alignments
We include a parser to convert cross_match alignment files to stockholm
alignments. This can be executed with the cm_to_stockholm
script (installed
with PolyA) or through the polyA.converters
module (python -m polyA.converters.cm_to_stockholm
).
usage: cm_to_stockholm align_file.cm
outputs (can be input directly into polyA):
align_file.cm.sto
align_file.cm.matrix
Substitution Matrix Files
Substitution matrix file example format (can include ambiguity codes):
- this file must include all of the matrices specified in the "#=GF MX" field of the alignment file, with corresponding and matching matrix names
- if lambda is not included polyA will use esl_scorematrix to calculate it for all matrices
matrix_name lambda(optional)
A G C T N
8 -6 -13 -15 -1
-2 10 -13 -13 -1
-13 -13 10 -2 -1
-15 -13 -6 8 -1
-1 -1 -1 -1 -1
//
matrix_name2 lambda2(optional)
A G C T N
8 -6 -13 -15 -1
-2 10 -13 -13 -1
-13 -13 10 -2 -1
-15 -13 -6 8 -1
-1 -1 -1 -1 -1
//
...
Sequence File (Optional)
A FASTA file of the target sequence is needed when using ULTRA. The target sequence must be the same genomic region that was used to get the cross_match alignment file. This file must follow the format of
>chrom:start-end
target_sequence
as shown in the example.
Sequence file example format:
>chr1:152302175-152325203
AATAGTTTATTTTTAATTTAGATGCAGCTTACTATAATATTAATTATGTCCAAGATGATT
TTTTGAATACAGAATACTAGAATTCCAATAGAAGGATAATAGAGAAAGATGTGCTAGCCC
...
Output Formats
start stop ID name
----------------------------------------
11990879 11991268 eaa042dd09f944f68dba2fd4727c64e2 LTR40a#LTR/ERVL
11991272 11991444 fb5ef5e0e2ca4e05837ddc34ca7ef9e4 MSTA1#LTR/ERVL-MaLR
11991445 11991562 bdfc4039b7d947d0b25bf1115cc282ed AluJr4#SINE/Alu
11991563 11991573 4871d91441a146209b98f645feae68c8 FLAM_C#SINE/Alu
11991574 11991818 fb5ef5e0e2ca4e05837ddc34ca7ef9e4 MSTB1#LTR/ERVL-MaLR
11991819 11991875 eaa042dd09f944f68dba2fd4727c64e2 LTR40a#LTR/ERVL
* Matching IDnums correspond to partial sequences that originate from
the same ancestral sequence.
Confidence only output file format
Computes confidence of a single input alignment region. Does not perform annotation or adjudication, simply outputs the confidence of all competing queries given in the input.
query_label confidence
LTR40a#LTR/ERVL 0.875
LTR40b#LTR/ERVL 0.052
LTR40c#LTR/ERVL 0.001
...
Extensions
Visualizing annotations using SODA
The command line option --soda will output the annotation data to a json file (output.0.viz) that can be used for visualization in SODA (linked below). The json file can be submitted on the browser to view the TE annotations from PolyA as well as the annotations from the UCSC Genome Browser for the same region of the human genome (hg38). The PolyA visualization can display the confidence values for all competing annotations of a selected region as well as their corresponding sequence alignments.
https://sodaviz.cs.umt.edu/polya-soda.html
Prior Counts Files
Default confidence calculations assume a uniform distribution over all competing queries. In the case of non uniform priors, the command line option --prior-counts prior_counts.txt includes prior genome counts in confidence calculations (see paper for more details).
https://www.biorxiv.org/content/10.1101/2021.02.13.430877v1
Prior counts file example format:
subfamily genome_count
AluYk2 6855
LTR38 255
L1PA7_5end 13261
...
Using ULTRA
The optional use of ULTRA allows polyA to include tandem repeats (TRs) in the competing annotations of the target sequence. Doing so removes the dependency on pre-masking TRs prior to annotation, allows TRs to outcompete potentially weak fragmentary family annotation , and allows a family annotation to outcompete a TR. The command line option --sequences seq.fasta (with --ultra-path if necessary) will run ULTRA with polyA or --ultra-data ultra_data.txt can be used if ULTRA was ran on seq.fasta prior.
Additional software
The esl_scorematrix
utility is a part of the Easel package in the HMMER
software suite. Its source code is available at
[https://github.com/EddyRivasLab/easel/].
We use it to compute a lambda
value for each score matrix, unless one is given
in the matrix file. The easiest way to run the utility, either manually or as
part of the PolyA pipeline is to use the Docker
container we
provide.
It is then possible to run esl_scorematrix
with the following command (which
just shows the help information):
docker run --mount src="${PWD}",target=/data,type=bind traviswheelerlab/polya-esl_scorematrix -h
Paths passed to the command will need to be relative to /data
, which is
mounted as the current working directory outside the container.
- will detect tandem repeat regions in a sequence FASTA file
Development
This project uses Pipenv, which can be installed through Homebrew for Mac users. It must be installed before the Makefile targets, or the other commands listed in this document will work.
In order to run a command which relies on the project virtual environment, such
as python foo.py
, it is necessary to either run pipenv shell
first, which
will put you into a shell that has the correct Python in its PATH
, or prefix
the command with pipenv run
(e.g. pipenv run python foo.py
).
To build a PyPI package, use make build-package
. To publish, use
make publish-package
. We use Flit to
handle building and publishing a package, so it is also possible to invoke Flit
directly to do anything else it is capable of.
Makefile
A Makefile is available, run make
or make help
in the project root to see
the available targets.
Dependencies
If you prefer not to use the Makefile, or if you need to add or remove dependencies, the following commands will allow you to manage dependencies.
# Get to work (from within project directory)
# This drops you into a shell inside the project virtual environment which means
# that commands that "pipenv run" may be elided from other commands.
pipenv shell
# Fetch runtime dependencies
pipenv install
# Fetch runtime and development dependencies
pipenv install --dev
# Add a runtime dependency
pipenv install <package>
# Add a development dependency
pipenv install --dev <package>
Docker Image
There is a Dockerfile
in the repo root. The image it describes is
used for running tests in CI and can be used locally for convenience.
When dependencies change the image must be rebuilt and the new
version pushed to Docker Hub. This can be done with make container
if you have the correct permissions. Otherwise, ask a maintainer
to do it for you.
Unit Tests
Unit tests use pytest. The linked documentation contains examples of how to use it to both write and run tests. The examples are quite extensive.
Documentation tests are also fine for simple cases.
# Run tests
pipenv run python -m pytest
# or
make check-fast check-slow
# or
make check
Documentation
We use Sphinx for project documentation. Run make docs
to update the
documentation sources and build the HTML. Use make docs-serve
to serve the
documentation locally.
License
BSD license. See LICENSE
.
Authors
Wheeler Lab at the University of Montana.
- Kaitlin Carey
- Audrey Shingleton
- George Lesica
- Jack Roddy
- Travis Wheeler
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for polyA-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b4d89769ebdcae0959d13681c7c4eefe620df44c977e703561bc52d25439671 |
|
MD5 | d03677b8fd6896b2b981969801fb1661 |
|
BLAKE2b-256 | 5681a4c1c1a6cd2276522e9e9fa38aba023cb98311f12b53b10c6cc7bd39f4a8 |