Skip to main content

A Python library for working with protein containing FASTA files.

Project description

ProFASTA

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Python Version from PEP 621 TOML pypi CI

Introduction

ProFASTA is a Python library for working with FASTA files containing protein records. It prioritizes simplicity while providing a practical set of features for proteomics-based mass spectrometry workflows.

Core functionality includes:

  • Parsing and writing FASTA files via profasta.io
  • Structured header parsing via a registry of built-in and user-defined parsers
  • A protein database (ProteinDatabase) for managing entries loaded from one or more FASTA files
  • Decoy database generation by sequence reversal
  • Header validation for non-ASCII characters

ProFASTA is developed as part of the computational toolbox for the Mass Spectrometry Facility at the Max Perutz Labs (University of Vienna).

Similar projects

If ProFASTA doesn't meet your requirements, consider exploring these alternative Python packages with a focus on protein-containing FASTA files:

  • fastapy is a lightweight package with no dependencies that offers FASTA reading functionality.
  • protfasta is another library with no dependencies that provides reading functionality along with basic validation (e.g., duplicate headers, conversion of non-canonical amino acids). The library also allows writing FASTA files with the ability to specify the sequence line length.
  • pyteomics is a feature-rich package that provides tools to handle various sorts of proteomics data. It provides functions for FASTA reading, automatic parsing of headers (in various formats defined at uniprot.org), writing, and generation of decoy entries. Note that pyteomics is a large package with many dependencies.

Requirements

Python >= 3.11

ProFASTA has no dependencies beyond the Python standard library.

Installation

Install from PyPI:

pip install profasta

Key concepts

FASTA parsing

The profasta.io.parse_fasta function reads a FASTA file and yields FastaRecord objects. Sequences are automatically normalized: letters are converted to uppercase, spaces are removed, and trailing * characters are stripped.

import profasta.io

with open("proteins.fasta", "r") as f:
    for record in profasta.io.parse_fasta(f):
        print(record.header, record.sequence)

Header parsers and the registry

ProFASTA uses a registry system for header parsers and writers. Built-in parsers are registered under the following names:

Name Description
"default" Splits on the first whitespace; never fails
"uniprot" Strict UniProt format parser
"uniprot_like" Tolerant UniProt-like format parser

Built-in writers follow the same naming convention and include an additional "decoy" writer that prepends a rev_ tag to the header.

Custom parsers and writers can be registered via:

profasta.parser.register_parser("my_parser", MyParser)
profasta.parser.register_writer("my_writer", MyWriter)

A parser must implement a parse(header: str) -> ParsedHeader classmethod, and a writer must implement a write(parsed_header: ParsedHeader) -> str classmethod.

ProteinDatabase

The ProteinDatabase class provides a dict-like interface for managing protein entries loaded from FASTA files:

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("proteins.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Multiple FASTA files can be added to the same database. Entries with unparseable headers can be skipped using skip_invalid=True.

A ProteinDatabase can also be created directly from one or more FASTA files using the from_fasta convenience constructor:

fasta_paths = ["proteome1.fasta", "proteome2.fasta"]
db = profasta.ProteinDatabase.from_fasta(*fasta_paths, header_parser="uniprot")

Entries can be filtered by a condition using the filter method, which returns a new ProteinDatabase:

human_db = db.filter(lambda e: e.header_fields.get("organism_identifier") == "9606")

Header validation

The profasta.validation module provides a function for checking FASTA records for non-ASCII characters in their headers, which can cause issues in downstream processing:

import profasta.validation

with open("proteins.fasta", "r") as f:
    records = list(profasta.io.parse_fasta(f))

issues = profasta.validation.find_header_ascii_issues(records)
for issue in issues:
    print(issue.header, issue.non_ascii_characters)

Usage examples

Load a UniProt FASTA file and access a protein entry

import profasta

db = profasta.ProteinDatabase()
db.add_fasta("./examples/uniprot_hsapiens_10entries.fasta", header_parser="uniprot")

entry = db["O75385"]
print(entry.header_fields["gene_name"])  # ULK1

Combine multiple FASTA files and add decoy entries

A common proteomics workflow is to combine one or more FASTA files and append reversed decoy sequences. Use profasta.write_decoy_fasta to write decoy entries directly to a FASTA file:

import profasta

# Load one or more forward databases
db = profasta.ProteinDatabase()
db.add_fasta("proteome.fasta", header_parser="uniprot")
db.add_fasta("additional.fasta", header_parser="uniprot")

# Write the forward entries, then append decoy entries with reversed sequences
output_path = "combined_with_decoys.fasta"
db.write_fasta(output_path, header_writer="default")
profasta.decoy.write_decoy_fasta(db, output_path, append=True)

Decoy headers are automatically prefixed with rev_. A custom prefix can be set via the decoy_tag argument:

profasta.decoy.write_decoy_fasta(db, output_path, append=True, decoy_tag="decoy_")

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profasta-0.1.0.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

profasta-0.1.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file profasta-0.1.0.tar.gz.

File metadata

  • Download URL: profasta-0.1.0.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for profasta-0.1.0.tar.gz
Algorithm Hash digest
SHA256 aa8c2e49a441c90b450bf55860c4cc0dc345e8441e5292e201570a91f4cc7d90
MD5 b88f2d28d469e438041081d37d5adc94
BLAKE2b-256 e0a164030ef34629e3bbaed3d5a378b31cbcdf629d3408d8661f3af443e1466b

See more details on using hashes here.

File details

Details for the file profasta-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: profasta-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for profasta-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d09c3ed6e5e395b3ac581b45f7a7a359e457cdf191d7c1d8c9ec7fc0af39db70
MD5 ff04194be32807c9b7facc99c5c160fc
BLAKE2b-256 6cc5cc9bfee96ef4814988bcf03e30364e9e44ba3192d024502cb55e25e2ed52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page