A very simple fasta file parser.
Project description
FastaFrames
Convert between UniProt FASTA files and pandas DataFrames.
Installation
pip install fastaframes
Quick Start
Read a FASTA file into a DataFrame
from fastaframes import to_df
df = to_df("proteins.fasta")
print(df.head())
Write a DataFrame back to FASTA
from fastaframes import to_fasta
to_fasta(df, output_file="output.fasta")
Work with individual entries
from fastaframes import fasta_to_entries, entries_to_fasta
for entry in fasta_to_entries("proteins.fasta"):
print(entry.unique_identifier, entry.protein_name)
# Filter and write back
entries = [e for e in fasta_to_entries("proteins.fasta") if e.organism_name == "Homo sapiens"]
entries_to_fasta(entries, output_file="human_only.fasta")
Multiple input formats
from io import StringIO
from fastaframes import to_df
# From a file path
df = to_df("proteins.fasta")
# From a string
df = to_df(">sp|P12345|EXAMPLE_HUMAN Example protein OS=Homo sapiens OX=9606\nMSEQUENCE\n")
# From a file object
with open("proteins.fasta") as f:
df = to_df(f)
# From a StringIO
df = to_df(StringIO(">sp|P12345|EXAMPLE_HUMAN\nMSEQUENCE\n"))
Skip malformed entries
from fastaframes import to_df
df = to_df("messy_data.fasta", skip_error=True)
DataFrame Columns
Given this FASTA entry:
>sp|A0A087X1C5|CP2D7_HUMAN Putative cytochrome P450 2D7 OS=Homo sapiens OX=9606 GN=CYP2D7 PE=5 SV=1
MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNLLHVDFQNTPYCFDQ
to_df produces:
| db | unique_identifier | entry_name | protein_name | organism_name | organism_identifier | gene_name | protein_existence | sequence_version | protein_sequence |
|---|---|---|---|---|---|---|---|---|---|
| sp | A0A087X1C5 | CP2D7_HUMAN | Putative cytochrome P450 2D7 | Homo sapiens | 9606 | CYP2D7 | 5 | 1 | MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNLLHVDFQNTPYCFDQ |
Column descriptions (following the UniProt FASTA header format):
| Column | Description |
|---|---|
db |
Database source: sp (Swiss-Prot) or tr (TrEMBL) |
unique_identifier |
Primary UniProtKB accession number |
entry_name |
UniProtKB entry name |
protein_name |
Recommended protein name (RecName or first SubName) |
organism_name |
Scientific name of the source organism |
organism_identifier |
NCBI taxonomy identifier |
gene_name |
First gene name (if available) |
protein_existence |
Numerical evidence code for protein existence |
sequence_version |
Sequence version number |
protein_sequence |
Amino acid sequence |
Development
pip install -e ".[dev]"
Common commands via just:
just check # Run all checks (lint, typecheck, test)
just lint # Lint with ruff
just fmt # Format with ruff
just typecheck # Type check with ty
just test # Run tests
just test -v # Run tests verbosely
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
fastaframes-1.3.0.tar.gz
(11.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastaframes-1.3.0.tar.gz.
File metadata
- Download URL: fastaframes-1.3.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c349d3fbd64eb2da2fe000450317ccdf523579c3275bf1105138575e6786f87
|
|
| MD5 |
9777d594e4d2715473ec95aba90d5907
|
|
| BLAKE2b-256 |
3b1e92676251e7b0efbbf782b6351f317cba9e7a224b2ad607205aefacf0fe41
|
File details
Details for the file fastaframes-1.3.0-py3-none-any.whl.
File metadata
- Download URL: fastaframes-1.3.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f3fd53943e56eaf8c716db635893cf192f82a6ebd15695442b6e42dae6248d6
|
|
| MD5 |
8be30df9884e4af2617a49fd0e0c5a7a
|
|
| BLAKE2b-256 |
d2820d3c8a1d614eb66c5cd11507a1d10b788eaffaaa305575972a97a102e80d
|