Skip to main content

A pure-Python library for reading and writing PEFF (PSI Extended FASTA Format) files.

Project description

pefftacular

PyPI Python Package License Python

Python library for reading and writing PEFF (PSI Extended FASTA Format) files. PEFF is a superset of FASTA used in proteomics that carries rich per-entry annotations — PTMs, variants, processed forms, and more — encoded directly in the sequence header.

Install

pip install pefftacular

Dev install:

just install

Quick start

read_peff — load everything into memory at once:

from pefftacular import read_peff

header, entries = read_peff("proteins.peff")

for entry in entries:
    print(entry.db_unique_id, entry.pname, len(entry.sequence))

PeffReader — iterate lazily without loading the full file:

from pefftacular import PeffReader

with PeffReader("proteins.peff") as reader:
    file_header = reader.header
    for entry in reader:
        process(entry)

Data model

read_peff and PeffReader yield SequenceEntry objects with these fields:

Field Type Description
prefix str Database prefix (e.g. sp, tr)
db_unique_id str Accession (e.g. P12345)
sequence str Amino acid sequence
pname str | None Protein name (\\PName=)
gname str | None Gene name (\\GName=)
ncbi_tax_id int | None NCBI taxonomy ID (\\NcbiTaxId=)
length int | None Sequence length (\\Length=)
sv int | None Sequence version (\\SV=)
ev int | None Entry version (\\EV=)
pe int | None Protein existence level (\\PE=)
variant_simple tuple[VariantSimple, ...] Simple sequence variants
variant_complex tuple[VariantComplex, ...] Multi-residue variants (start, end, new sequence, optional tag)
mod_res_unimod tuple[ModResUnimod, ...] UniMod modification sites
mod_res_psi tuple[ModResPsi, ...] PSI-MOD modification sites
mod_res tuple[ModRes, ...] Other named modification sites
processed tuple[Processed, ...] Processed sequence forms
custom_values dict[str, tuple[CustomKeyValue, ...]] Header-declared custom keys, parsed by their CustomKeyDef
extra dict[str, str] Non-standard keys with no CustomKeyDef

Annotations

Variants:

from pefftacular import read_peff

_, entries = read_peff("proteins.peff")
entry = entries[0]

for v in entry.variant_simple:
    print(v.position, v.new_amino_acid, v.tag)
    # e.g. 42, "K", "rs12345"

Modifications (UniMod):

for mod in entry.mod_res_unimod:
    print(mod.position, mod.accession, mod.name)
    # e.g. 17, "21", "Phospho"

Modifications (PSI-MOD):

for mod in entry.mod_res_psi:
    print(mod.position, mod.accession, mod.name)
    # e.g. 17, "MOD:00696", "phosphorylated residue"

Processed forms:

for proc in entry.processed:
    print(proc.start_pos, proc.end_pos, proc.accession, proc.name)
    # e.g. 1, 24, "PRO_0000012345", "Signal peptide"

Custom keys (declared via # CustomKeyDef= in the header):

When the database header declares a custom key, entry values for that key are parsed using its RegExp / FieldNames / FieldTypes and exposed as typed fields on entry.custom_values. The original item text is preserved in raw for lossless round-trips.

Header excerpt:

# CustomKeyDef=(KeyName=SecondaryStructure|Description="..."|ConceptCURIE=BAO:0000014|RegExp="([0-9]+)\|([0-9]+)\|([A-Za-z]+:[0-9]+)?\|(.+)"|FieldNames=StartPosition,EndPosition,CURIE,Description|FieldTypes=integer,integer,string,string)

Entry usage:

>cu:P00001 \SecondaryStructure=(10|20|ncithesaurus:C47937|Helix)

Access:

ss = entry.custom_values["SecondaryStructure"]
ss[0].fields["StartPosition"]    # 10 (int)
ss[0].fields["Description"]      # "Helix"

Supported FieldTypes are XSD basic types (string, integer, decimal, boolean, date, time) plus enumeration(a|b|c). Coercion failures and enumeration mismatches emit UserWarning and fall back to the raw string. If no RegExp is declared, the value is split on | and zipped with FieldNames.

Other non-standard keys (no CustomKeyDef registered) still land in entry.extra as raw strings:

value = entry.extra.get("MyCustomKey")

Writing

Build a header and entries, then write:

from pefftacular import DatabaseHeader, FileHeader, SequenceEntry, write_peff

db_header = DatabaseHeader(
    prefix="sp",
    db_name="SwissProt",
    db_version="2024_01",
    number_of_entries=1,
)

file_header = FileHeader(
    peff_version="1.0",
    databases=(db_header,),
)

entry = SequenceEntry(
    prefix="sp",
    db_unique_id="P12345",
    sequence="MKTIIALSYIFCLVFA",
    pname="Example protein",
    gname="EXMP",
)

write_peff(file_header, [entry], "output.peff")

dest can be a file path string, a pathlib.Path, or a text-mode file object.

Error handling

Parse errors raise PeffParseError:

from pefftacular import PeffParseError, read_peff

try:
    header, entries = read_peff("malformed.peff")
except PeffParseError as e:
    print(e.line)     # the offending line number
    print(e.context)  # surrounding context string

Write errors raise PeffWriteError:

from pefftacular import PeffWriteError

try:
    write_peff(file_header, entries, "/read-only/output.peff")
except PeffWriteError as e:
    print(e)

Development

just install      # install dependencies
just test         # run tests
just test-v       # run tests (verbose)
just test-file tests/test_reader.py   # run a single test file
just cov          # run tests with coverage
just lint         # ruff lint
just format       # ruff format
just check        # lint + type check + test
just build        # build the package
just clean        # remove cache files
just docs         # serve docs locally
just docs-deploy  # deploy docs to GitHub Pages

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pefftacular-0.3.0.tar.gz (748.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pefftacular-0.3.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file pefftacular-0.3.0.tar.gz.

File metadata

  • Download URL: pefftacular-0.3.0.tar.gz
  • Upload date:
  • Size: 748.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pefftacular-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4a61c16cbd24fcff5c3bde64a35e5c6cb05a6225973e4a24d2fdc41c5fec173c
MD5 f633b74d34c313cd097898b9069feac9
BLAKE2b-256 c85f398a5b0283accad698a94e33a581a2a7beb156caa8e6dbaccc8256cc2936

See more details on using hashes here.

File details

Details for the file pefftacular-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pefftacular-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pefftacular-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 16cd1a9c2a951c644e19375bce6db5ca90b8787cdb5f7ddbb193694c6b2378f8
MD5 dbf950ee2b857110f15b497e594abd3f
BLAKE2b-256 dbce2798f717b88344fcffa437dc55a9bab30e79722b75488cd89dd78c733340

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page