A lightweight Python library for efficient FASTA file parsing and DNA sequence manipulation.
Project description
Easy Fasta
A lightweight functional Python library for efficient FASTA file parsing and DNA sequence manipulation. No OOP bloat, only data.
Features
- Memory-efficient parsing: Stream through large FASTA files without loading everything into memory
- Random access: Jump directly to specific sequences with position tracking
- Sequence extraction: Filter sequences by identifiers
- DNA manipulation: Complete IUPAC-compliant complement and reverse complement operations
- Formatting: Convert sequences to multi-line FASTA format
- Does not validate input: user are responsible to provide correctly formatted file.
Installation
python 3.8+
> pip install easyfasta
or simply copy the module to your project
Quick Start
from easyfasta import *
# Parse FASTA file sequence by sequence (memory efficient)
with open('sequences.fasta') as f:
for header, sequence in fasta_iter(f):
print(f">{header}")
print(sequence[:50]) # First 50 bases
# Load entire FASTA into dictionary
sequences = load_fasta('sequences.fasta')
print(sequences['sequence_id'])
# Extract specific sequences
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_id('sequences.fasta', target_ids)
for header, seq in found:
print(f"Found: {header}")
# Extract specific sequences using indexes
index = build_index('sequences.fasta')
# using pickle you can save and load the index
#import pickle
#pickle.dump(index, "save_index_file.pkl")
#index = pickle.load("save_index_file.pkl")
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_index('sequences.fasta', target_ids, index, ignore_unfound=True)
for header, seq in found:
print(f"Found: {header}")
# DNA manipulation
dna = "ATCGGTAA"
print(complement(dna)) # TAGCCATT
print(reverse_complement(dna)) # TTACCGAT
API Reference
Parsing Functions
fasta_iter(open_file: TextIO) -> Generator[tuple[str, str], None, None]
Memory-efficient iterator over FASTA sequences.
with open('large_file.fasta') as f:
for header, sequence in fasta_iter(f):
# Process one sequence at a time
process_sequence(header, sequence)
load_fasta(fasta_path: str|Path) -> dict[str, str]
Load entire FASTA file into a dictionary mapping sequence IDs to sequences.
sequences = load_fasta('sequences.fasta')
my_sequence = sequences['sequence_id']
get_sequence_id(fasta_file: str|Path, identifiers: Iterable[str], identifier_only: bool = True) -> list[tuple[str, str]]
Extract sequences matching specific identifiers.
identifier_only: If True, match only the first part of headers (before whitespace)
wanted = ['seq1', 'seq2']
results = get_sequence_id('sequences.fasta', wanted)
build_index(fasta_file: str|Path) -> dict[str, int]
Build a fasta index as a dictionary
index = build_index(fasta_file)
get_sequence_index(fasta_file: str|Path, identifiers:Iterable[str], index_dict:dict[str, int], ignore_unfound: bool = True) -> list[tuple[str, str]]
use index to retrieve sequence (faster)
index = build_index(fasta_file)
wanted = ['seq1', 'seq2']
get_sequence_index(fasta_file, wanted, index)
Sequence Manipulation
complement(seq: str) -> str
Return the complement of a DNA sequence (A↔T, C↔G, supports all IUPAC codes).
reverse(seq: str) -> str
Return the reverse of a sequence.
reverse_complement(seq: str) -> str
Return the reverse complement of a DNA sequence.
wrap_sequence(sequence: str, chunk_size: int = 80) -> str
Format sequence with line breaks every chunk_size characters (standard multiline FASTA format).
formatted = wrap_sequence("ATCGATCGATCG" * 10, 60)
print(formatted) # 60 characters per line
# write to a file
with open(out_file, 'w') as fo:
fo.write(">{}\n{}\n".format('seq_id', wrap_sequence("ATCGATCGATCG" * 10, 80)))
Design Philosophy
This library prioritizes:
- Memory efficiency: Built for large genomic files that don't fit in RAM
- Simplicity: Clean, predictable API with minimal dependencies. Not OOP bloat, only data.
- Performance: Stream-based processing with O(1) memory usage for parsing
- Standards compliance: Full IUPAC nucleotide code support
Use Cases
- Processing large fasta file (metagenome)
- Common DNA sequence manipulation
- Common operations on fasta including parsing, indexing and sequence retrieval.
- Bioinformatics workflows requiring memory efficiency
Requirements
- Python 3.8+
- No external dependencies
License
MIT
Contributing
Feel free to ask for new features. I published it as lightweight because those are the feature I use the most and wanted to start with a solid fondation.
I used this library for years, and it has been extensively tested. As such I will only adress issue that come with a minimal reproducible problem.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easyfasta-1.0.14.tar.gz.
File metadata
- Download URL: easyfasta-1.0.14.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ce6aba33e97a09682d6e1a50dbadae419472d57052ec3b3e42f75754e34ed02
|
|
| MD5 |
da40deb6807acecf2b998e8d9718419a
|
|
| BLAKE2b-256 |
a387dfa1e116be5fb45dd73e7f4cac822720c43638231e48ba3b50f243f8beb5
|
File details
Details for the file easyfasta-1.0.14-py3-none-any.whl.
File metadata
- Download URL: easyfasta-1.0.14-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92a691732d23a55cadd974fdcc16dc0ccfe51b5f117185e71fcfef9950d454c8
|
|
| MD5 |
92e4353d265f21569a6707f0eced1f60
|
|
| BLAKE2b-256 |
a494b8881639f53c12a168d0b700410c026e59dac3f74f81d76a7928a87af80e
|