A lightweight Python library for efficient FASTA file parsing and DNA sequence manipulation.
Project description
Easy Fasta
A lightweight functional Python library for efficient FASTA file parsing and DNA sequence manipulation. No OOP bloat, only data.
Features
- Memory-efficient parsing: Stream through large FASTA files without loading everything into memory
- Random access: Jump directly to specific sequences with position tracking
- FAI indexing: Build and query standard
.faiindex files for fast random access - Sequence extraction: Filter sequences by identifiers
- DNA manipulation: Complete IUPAC-compliant complement and reverse complement operations
- Formatting: Convert sequences to multi-line FASTA format
- Does not validate input: users are responsible to provide correctly formatted files.
Installation
python 3.8+
> pip install easyfasta
or simply copy the module to your project
Quick Start
from easyfasta import *
# Parse FASTA file sequence by sequence (memory efficient)
with open('sequences.fasta') as f:
for header, sequence in fasta_iter(f):
print(f">{header}")
print(sequence[:50]) # First 50 bases
# Load entire FASTA into dictionary
sequences = load_fasta('sequences.fasta')
print(sequences['sequence_id'])
# Extract specific sequences
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_id('sequences.fasta', target_ids)
for header, seq in found:
print(f"Found: {header}")
# Extract specific sequences using a dictionary index
index = build_dico_index('sequences.fasta')
# using pickle you can save and load the index
#import pickle
#pickle.dump(index, "save_index_file.pkl")
#index = pickle.load("save_index_file.pkl")
target_ids = ['seq1', 'seq2', 'seq3']
found = get_sequence_dico_index('sequences.fasta', target_ids, index, ignore_unfound=True)
for header, seq in found:
print(f"Found: {header}")
# FAI index for fast random access
build_index('sequences.fasta') # creates sequences.fasta.fai
index = load_index('sequences.fasta') # load into memory for repeated queries
seq = query('sequences.fasta', 'seq1', 0, 100, strand='+', dico_index=index)
# DNA manipulation
dna = "ATCGGTAA"
print(complement(dna)) # TAGCCATT
print(reverse_complement(dna)) # TTACCGAT
API Reference
Parsing Functions
fasta_iter(open_file: TextIO) -> Generator[tuple[str, str], None, None]
Memory-efficient iterator over FASTA sequences.
with open('large_file.fasta') as f:
for header, sequence in fasta_iter(f):
# Process one sequence at a time
process_sequence(header, sequence)
load_fasta(fasta_path: str|Path) -> dict[str, str]
Load entire FASTA file into a dictionary mapping sequence IDs to sequences.
sequences = load_fasta('sequences.fasta')
my_sequence = sequences['sequence_id']
get_sequence_id(fasta_file: str|Path, identifiers: Iterable[str], identifier_only: bool = True) -> list[tuple[str, str]]
Extract sequences matching specific identifiers.
identifier_only: If True, match only the first part of headers (before whitespace)
wanted = ['seq1', 'seq2']
results = get_sequence_id('sequences.fasta', wanted)
Dictionary Index Functions
build_dico_index(fasta_file: str|Path) -> dict[str, int]
Build an in-memory index as a dictionary mapping sequence identifiers to their byte position in the file.
index = build_dico_index('sequences.fasta')
get_sequence_dico_index(fasta_file: str|Path, identifiers: Iterable[str], index_dict: dict[str, int], ignore_unfound: bool = True) -> list[tuple[str, str]]
Use a dictionary index to retrieve sequences faster than parsing through the file.
index = build_dico_index('sequences.fasta')
wanted = ['seq1', 'seq2']
results = get_sequence_dico_index('sequences.fasta', wanted, index)
FAI Index Functions
build_index(fasta: str|Path) -> None
Build a standard .fai index file next to the fasta file. Required before using load_index or query.
build_index('sequences.fasta') # creates sequences.fasta.fai
load_index(fasta: str|Path) -> dict[str, list]
Load a .fai index file into memory for repeated queries.
index = load_index('sequences.fasta')
query(fasta: str|Path, name: str, start: int, end: int, strand: str = "+", dico_index: dict = None) -> str
Query a fasta file for a sequence by name and coordinates using the FAI index. Returns the reverse complement if strand is "-".
build_index('sequences.fasta')
index = load_index('sequences.fasta')
seq = query('sequences.fasta', 'chr1', 1000, 2000, strand='+', dico_index=index)
Sequence Manipulation
complement(seq: str) -> str
Return the complement of a DNA sequence (A↔T, C↔G, supports all IUPAC codes).
reverse(seq: str) -> str
Return the reverse of a sequence.
reverse_complement(seq: str) -> str
Return the reverse complement of a DNA sequence.
wrap_sequence(sequence: str, chunk_size: int = 80) -> str
Format sequence with line breaks every chunk_size characters (standard multiline FASTA format).
formatted = wrap_sequence("ATCGATCGATCG" * 10, 60)
print(formatted) # 60 characters per line
# write to a file
with open(out_file, 'w') as fo:
fo.write(">{}\n{}\n".format('seq_id', wrap_sequence("ATCGATCGATCG" * 10, 80)))
Migration Guide: 1.0.14 → 1.1.0
Version 1.1.0 introduces FAI index support and contains breaking changes.
Breaking Changes
| 1.0.14 | 1.1.0 | Notes |
|---|---|---|
build_index() |
build_dico_index() |
build_index() now builds a .fai file, not a dictionary |
get_sequence_index() |
get_sequence_dico_index() |
straight rename |
New in 1.1.0
build_index()— builds a standard.faiindex fileload_index()— loads a.faiindex into memoryquery()— fast random access to any sequence region by coordinates
What you need to change
# 1.0.14
index = build_index('sequences.fasta')
results = get_sequence_index('sequences.fasta', ids, index)
# 1.1.0
index = build_dico_index('sequences.fasta')
results = get_sequence_dico_index('sequences.fasta', ids, index)
⚠️ Important:
build_index()no longer returns a dictionary. Calling it expecting a dictionary index will silently produce wrong results. Usebuild_dico_index()instead.
Design Philosophy
This library prioritizes:
- Memory efficiency: Built for large genomic files that don't fit in RAM
- Simplicity: Clean, predictable API with minimal dependencies. Not OOP bloat, only data.
- Performance: Stream-based processing with O(1) memory usage for parsing
- Standards compliance: Full IUPAC nucleotide code support
Use Cases
- Processing large fasta files (metagenome)
- Common DNA sequence manipulation
- Common operations on fasta including parsing, indexing and sequence retrieval
- Bioinformatics workflows requiring memory efficiency
Requirements
- Python 3.8+
- No external dependencies
License
MIT
Contributing
Feel free to ask for new features. I published it as lightweight because those are the features I use the most and wanted to start with a solid foundation.
I used this library for years, and it has been extensively tested. As such I will only address issues that come with a minimal reproducible problem.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easyfasta-1.1.4.tar.gz.
File metadata
- Download URL: easyfasta-1.1.4.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
215fd0111036c3dbef4d8364a367a21712b405117ea65476aed4bf85c46b6a8b
|
|
| MD5 |
c8d008cc302b61000580f75411687ec5
|
|
| BLAKE2b-256 |
616d7944872dcc6d0a4ef0f3c9aa134e565e9333abd2e003bf8716b6fb75ab2e
|
File details
Details for the file easyfasta-1.1.4-py3-none-any.whl.
File metadata
- Download URL: easyfasta-1.1.4-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8bc8241c86b68c1b10222683d516f894ef4d29b2ea4e130ba2af72884b8c2d0
|
|
| MD5 |
242c9d9915108dc14ee1fd30e6b40071
|
|
| BLAKE2b-256 |
d0cece2847c9b24878da6900c6b8c39d1638ca2b0028586e98b25407bb6845db
|