Skip to main content

A very simple fasta file parser.

Project description

CI PyPI Python

FastaFrames

Convert between UniProt FASTA files and pandas DataFrames.

Installation

pip install fastaframes

Quick Start

Read a FASTA file into a DataFrame

from fastaframes import to_df

df = to_df("proteins.fasta")
print(df.head())

Write a DataFrame back to FASTA

from fastaframes import to_fasta

to_fasta(df, output_file="output.fasta")

Work with individual entries

from fastaframes import fasta_to_entries, entries_to_fasta

for entry in fasta_to_entries("proteins.fasta"):
    print(entry.unique_identifier, entry.protein_name)

# Filter and write back
entries = [e for e in fasta_to_entries("proteins.fasta") if e.organism_name == "Homo sapiens"]
entries_to_fasta(entries, output_file="human_only.fasta")

Multiple input formats

from io import StringIO
from fastaframes import to_df

# From a file path
df = to_df("proteins.fasta")

# From a string
df = to_df(">sp|P12345|EXAMPLE_HUMAN Example protein OS=Homo sapiens OX=9606\nMSEQUENCE\n")

# From a file object
with open("proteins.fasta") as f:
    df = to_df(f)

# From a StringIO
df = to_df(StringIO(">sp|P12345|EXAMPLE_HUMAN\nMSEQUENCE\n"))

Skip malformed entries

from fastaframes import to_df

df = to_df("messy_data.fasta", skip_error=True)

DataFrame Columns

Given this FASTA entry:

>sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1
MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNLLHVDFQNTPYCFDQ

to_df produces:

db unique_identifier entry_name protein_name organism_name organism_identifier gene_name protein_existence sequence_version protein_sequence
sp A0A087X1C5 CP2D7_HUMAN Putative cytochrome P450 2D7 Homo sapiens 9606 CYP2D7 5 1 MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNLLHVDFQNTPYCFDQ

Column descriptions (following the UniProt FASTA header format):

Column Description
db Database source: sp (Swiss-Prot) or tr (TrEMBL)
unique_identifier Primary UniProtKB accession number
entry_name UniProtKB entry name
protein_name Recommended protein name (RecName or first SubName)
organism_name Scientific name of the source organism
organism_identifier NCBI taxonomy identifier
gene_name First gene name (if available)
protein_existence Numerical evidence code for protein existence
sequence_version Sequence version number
protein_sequence Amino acid sequence

Development

pip install -e ".[dev]"

Common commands via just:

just check      # Run all checks (lint, typecheck, test)
just lint       # Lint with ruff
just fmt        # Format with ruff
just typecheck  # Type check with ty
just test       # Run tests
just test -v    # Run tests verbosely

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastaframes-1.3.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastaframes-1.3.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file fastaframes-1.3.0.tar.gz.

File metadata

  • Download URL: fastaframes-1.3.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for fastaframes-1.3.0.tar.gz
Algorithm Hash digest
SHA256 2c349d3fbd64eb2da2fe000450317ccdf523579c3275bf1105138575e6786f87
MD5 9777d594e4d2715473ec95aba90d5907
BLAKE2b-256 3b1e92676251e7b0efbbf782b6351f317cba9e7a224b2ad607205aefacf0fe41

See more details on using hashes here.

File details

Details for the file fastaframes-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: fastaframes-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for fastaframes-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f3fd53943e56eaf8c716db635893cf192f82a6ebd15695442b6e42dae6248d6
MD5 8be30df9884e4af2617a49fd0e0c5a7a
BLAKE2b-256 d2820d3c8a1d614eb66c5cd11507a1d10b788eaffaaa305575972a97a102e80d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page