Skip to main content

tools for working with a3m files for structure prediction

Project description

a3mtools

DEPRECATED: This package had been renamed from a3mtools to a3mcat. Please use the new package instead: a3mcat

Tools for working with a3m files. Designed to help generate input for structure prediction tools like alphafold.

>meow
---------/\_/\--
--------( o.o )-
--------- >^< --

Main features:

  • import a3m files from MMseqs2 search results into python objects
  • Easily slice MSAs while preserving insertions
  • Easily combine multiple MSAs into a single a3m file for complex prediction
  • Save manipulated MSAs to new a3m files
  • convert from fasta to a3m
  • coming soon: convert from a3m to fasta

If you have any questions, find any issues/bugs, or have suggestions, please open an issue on the github page.

Installation

pip install a3mtools

or if you want an editable version:

git clone https://github.com/jacksonh1/a3mtools.git
cd a3mtools
pip install -e .

What these tools do:

The current implementation of the tools are designed to handle MSAs retrieved using the colabfold MMseqs2 search tool. There are some strange quirks about those files specifically that I tried to account for. Mainly query sequence names are denoted by 3 digit integers starting at 101. So if an MSA only has 1 query sequence it will be named 101. If there are 2 query sequences they will be named 101 and 102 etc. This comes into play when combining MSAs for predicting protein complexes. Additionally, when combining 2 MSAs, the query sequences are combined into a single sequence. This is required to predict the complex structure. The query sequences are also added back to the MSA as individual sequences. MSAs are combined in unpaired format.

Here's an example (with no homologous sequences):
MSA1 a3m file:

#9	1
>101
ABCDEFGHI

MSA2 a3m file:

#17	1
>101
JKLMNOPQRSTUVWXYZ

MSA1 + MSA2 a3m file:

#9,17   1,1
>101    102
ABCDEFGHIJKLMNOPQRSTUVWXYZ
>101
ABCDEFGHI-----------------
>102
---------JKLMNOPQRSTUVWXYZ

The MSA1 + MSA2 a3m file could be used as input to many structure prediction tools to predict the structure of the complex formed by the 2 query sequences.

I don't know if maintaining this specific 101, 102, 103 ... naming scheme is strictly necessary, but we'll stick with it for now.

The tools also allow for slicing the MSA and saving the MSA to a file.
When slicing or combining MSAs, insertions are maintained (see output of examples).

Usage

There will eventually be a more complete guide. But for now, you can install the package and run the following code to see some of the basic functionality.
See the demo notebook (demo.ipynb) for additional examples and usage.

import a3mtools
import a3mtools.examples as examples # examples that come installed with a3mtools

import an a3m file

msa = a3mtools.MSAa3m.from_a3m_file(examples.a3m_file1)
print(msa)
#9	1
>101
ABCDEFGHI
>ortho1
xxABCDxxEFGZZ
>ortho2
A--DxxE-GHIxxxx
>ortho3
----xxEFGH-

The input to the a3mtools.MSAa3m.from_a3m_file function is the path to an a3m file.

slicing the alignment

print(msa[2:5])
#3	1
>101
CDE
>ortho1
CDxxE
>ortho2
-DxxE
>ortho3
--xxE

concatenating alignments

msa2 = a3mtools.MSAa3m.from_a3m_file(examples.a3m_file2)
print(msa2)
#17	1
>101
JKLMNOPQRSTUVW
>ortho1
JKLMNOPQRSTUVWxxxxxx
>ortho2
------PQRSTUVW

print(msa + msa2)
#9,17	1,1
>101	102
ABCDEFGHIJKLMNOPQRSTUVW
>101
ABCDEFGHI--------------
>ortho1
xxABCDxxEFGZZ--------------
>ortho2
A--DxxE-GHIxxxx--------------
>ortho3
----xxEFGH---------------
>102
---------JKLMNOPQRSTUVW
>ortho1
---------JKLMNOPQRSTUVWxxxxxx
>ortho2
---------------PQRSTUVW

print(msa + msa2[2:5] + msa[5:])
#9,3,4	1,1,1
>101	102	103
ABCDEFGHILMNFGHI
>101
ABCDEFGHI-------
>ortho1
xxABCDxxEFGZZ-------
>ortho2
A--DxxE-GHIxxxx-------
>ortho3
----xxEFGH--------
>102
---------LMN----
>ortho1
---------LMN----
>103
------------FGHI
>ortho1
------------FGZZ
>ortho2
-------------GHI
>ortho3
------------FGH-

saving the alignment to a file

complex_msa = msa + msa2
complex_msa.save("example_complex.a3m")

creating an empty MSAa3m object

empty_msa = a3mtools.MSAa3m.empty_MSA('ABCDEFG')
print(empty_msa)
#7	1
>101
ABCDEFG
>101
ABCDEFG

notice that there are 2 query sequences in the empty MSAa3m object. This is to mimic the colabfold behavior. I do not know if this is actually necessary or not.

importing a fasta file

fasta_msa = a3mtools.MSAfasta.from_fasta_file(examples.fasta_file1)
print(fasta_msa)
>q
LVT---FLAGCQ---
>a
LVTTTTFL--CQQQQ
>b
LVTTTTFLAGCQQQQ
>c
LVT---FLAGCQQQQ

convert MSAfasta to MSAa3m

To convert a fasta to an a3m, you have to select a query sequence, since a3m files are all formatted relative to a query sequence.

y = fasta_msa.to_a3m(query_header="q")
print(y)
#9	1
>101
LVTFLAGCQ
>a
LVTtttFL--CQqqq
>b
LVTtttFLAGCQqqq
>c
LVTFLAGCQqqq

notice the difference in formatting when we choose a different query sequence:

y = fasta_msa.to_a3m(query_header="a")
print(y)
#13	1
>101
LVTTTTFLCQQQQ
>q
LVT---FLagCQ---
>b
LVTTTTFLagCQQQQ
>c
LVT---FLagCQQQQ

important notes

  • slice numbering is relative to the query sequence
  • query sequence is always the first sequence in the MSA, and is named 101 or some combination of concatenated querys if the MSA is the result of concatenation (e.g. 101\t102)
  • slicing with a step size other than 1 is not supported yet and probably will not be supported
  • MSAs are combined in unpaired format
    • combining MSAs in paired format is not supported yet

future features:


  • documentation on readthedocs
    • examples and code
  • convert between fasta and a3m
  • convert between a3m and fasta
  • allow for more generic naming of query sequence
  • add better test functions
  • an option for combining MSAs in paired format

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a3mtools-0.1.0a6.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

a3mtools-0.1.0a6-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file a3mtools-0.1.0a6.tar.gz.

File metadata

  • Download URL: a3mtools-0.1.0a6.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for a3mtools-0.1.0a6.tar.gz
Algorithm Hash digest
SHA256 c1f5a7541d28049776b030c55540a92813eac269f2025520b4010d7a57817a8d
MD5 ef7a0aa1e347de55d65ffcc5c90c02b9
BLAKE2b-256 7248e6a2c11d515ba96ed5176dc60b4fdb4cdbdb1327a40313c338c6719b387a

See more details on using hashes here.

File details

Details for the file a3mtools-0.1.0a6-py3-none-any.whl.

File metadata

  • Download URL: a3mtools-0.1.0a6-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for a3mtools-0.1.0a6-py3-none-any.whl
Algorithm Hash digest
SHA256 da0cdcb4cafc104b28ef8806fc57ee4d9274e7247b5fb899cc462f5c6caa61da
MD5 d8f66c651a8c1c49fe935d266a112726
BLAKE2b-256 e69ade8586c23819a4c5c1d4240f1d4428cc4da91be16d634b224f9b03dcaff1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page