A config-driven framework for curating biological sequence datasets
Project description
biocurator
A config-driven framework for curating biological sequence datasets from various databases. Define your search, filter, and export parameters in a YAML file; biocurator handles the rest.
Features
- Multi-database search — NCBI (nucleotide, protein, SRA) and UniProt
- Streaming Architecture — memory-efficient processing of large datasets
- Robustness — automatic retries with exponential backoff for API calls
- Typed config schema — validated YAML with sensible defaults
- Flexible filtering — length, quality score, organism, keywords, date range
- Multiple export formats — FASTA, CSV, JSON
- Rich CLI — progress bars, dry-run mode, per-job filtering
Supported Databases
| Database | Entrez / REST databases |
|---|---|
| NCBI | nuccore, nucleotide, protein, sra, pubmed, pmc, gene, taxonomy, and more |
| UniProt | Swiss-Prot (reviewed) and TrEMBL (unreviewed) protein entries |
Scalability & Robustness
biocurator is designed for high-throughput curation:
- Streaming: Sequences are processed one-by-one and streamed directly to disk, allowing you to curate thousands of sequences without exhausting system memory.
- NCBI History Server: Automatically uses the NCBI History Server (
WebEnv) for scalable and stable data retrieval from large search results. - Retry Logic: Built-in exponential backoff retries for all network operations to handle transient API failures gracefully.
Installation
Requires Python 3.13+.
# With uv (recommended)
uv pip install biocurator
# With pip
pip install biocurator
Quick Start
1. Generate a config file
biocurator init --output config.yaml
This writes a starter YAML to config.yaml. Use --template advanced to include all optional fields:
biocurator init --template advanced --output config.yaml
2. Edit the config
email: your@email.com
jobs:
covid-genomes:
search:
databases: [ncbi]
organism: "SARS-CoV-2"
sequence_type: nucleotide
keywords: ["complete genome"]
max_results: 50
filter:
min_length: 29000
quality_threshold: 0.8
export:
outdir: results/covid
formats: [fasta, csv]
prefix: sars_cov2
3. Run a dry-run to preview
biocurator run config.yaml --dry-run
Dry run — 1 job(s) would execute:
• covid-genomes databases=['ncbi']
4. Execute
biocurator run config.yaml
Config Reference
Every config file has a top-level email (required for NCBI access) and a jobs map where each key is the job name.
email: your@email.com # required
jobs:
<job-name>:
search:
databases: [ncbi] # required: ncbi | uniprot
organism: null # e.g. "SARS-CoV-2", "E. coli"
sequence_type: nucleotide # nucleotide | protein | sra
keywords: [] # AND-joined with other terms
max_results: 100
exclude_terms: [] # excluded from search
location: null # geographic filter, e.g. "Philippines"
taxonomy_filter: null # taxon name or ID
date_range:
start: "2020/01/01" # YYYY/MM/DD
end: "2024/12/31"
filter:
min_length: null # minimum sequence length (bp / aa)
max_length: null # maximum sequence length (bp / aa)
exclude_terms: [] # excluded from title/description
quality_threshold: null # 0.0–1.0; filters on N/X content
export:
outdir: results # output directory (created if absent)
formats: [fasta] # fasta | csv | json
prefix: biocurator # filename prefix for output files
Defaults apply to any omitted field, so a minimal job only needs search.databases:
email: your@email.com
jobs:
simple:
search:
databases: [ncbi]
organism: "Homo sapiens"
filter: {}
export: {}
CLI Reference
biocurator init
Generate a starter config file.
Usage: biocurator init [OPTIONS]
Options:
-o, --output TEXT Write config to this file instead of stdout
-t, --template TEXT Template to use: basic (default) or advanced
biocurator run
Run all jobs defined in a config file.
Usage: biocurator run [OPTIONS] CONFIG
Arguments:
CONFIG Path to the YAML config file [required]
Options:
-j, --jobs TEXT Comma-separated job names to run (default: all)
--dry-run Validate config and preview jobs without downloading
Run only specific jobs:
biocurator run config.yaml --jobs covid-genomes,spike-proteins
Dry-run before committing:
biocurator run config.yaml --dry-run
Usage Examples
Viral genome surveillance
Collect complete SARS-CoV-2 genomes deposited in 2024, filtered for quality:
email: researcher@uni.edu
jobs:
sars-cov2-2024:
search:
databases: [ncbi]
organism: "SARS-CoV-2"
sequence_type: nucleotide
keywords: ["complete genome"]
max_results: 500
exclude_terms: [synthetic, artificial, recombinant]
date_range:
start: "2024/01/01"
end: "2024/12/31"
filter:
min_length: 29000
quality_threshold: 0.9
export:
outdir: results/sars_2024
formats: [fasta, csv]
prefix: sars_cov2_2024
Antibiotic resistance genes
Collect beta-lactamase nucleotide sequences from NCBI:
email: researcher@uni.edu
jobs:
beta-lactamase:
search:
databases: [ncbi]
sequence_type: nucleotide
keywords: ["beta-lactamase", "bla gene"]
max_results: 300
exclude_terms: [partial, predicted]
filter:
min_length: 500
max_length: 3000
export:
outdir: results/amr
formats: [fasta, csv, json]
prefix: bla_genes
Multi-database protein family study
Search both NCBI and UniProt for a protein family in the same job:
email: researcher@uni.edu
jobs:
cytochrome-p450:
search:
databases: [ncbi, uniprot]
sequence_type: protein
keywords: ["cytochrome P450", "CYP"]
max_results: 200
filter:
min_length: 300
quality_threshold: 0.8
export:
outdir: results/cyp450
formats: [fasta, csv]
prefix: cyp450
Multiple independent jobs in one run
email: researcher@uni.edu
jobs:
spike-proteins:
search:
databases: [uniprot]
organism: "SARS-CoV-2"
sequence_type: protein
keywords: ["spike glycoprotein"]
max_results: 100
filter:
min_length: 1200
export:
outdir: results/spike
formats: [fasta]
prefix: spike
nucleocapsid-proteins:
search:
databases: [uniprot]
organism: "SARS-CoV-2"
sequence_type: protein
keywords: ["nucleocapsid protein"]
max_results: 100
filter:
min_length: 400
export:
outdir: results/ncap
formats: [fasta]
prefix: ncap
Run them together or selectively:
# All jobs
biocurator run config.yaml
# Only one
biocurator run config.yaml --jobs spike-proteins
Python API
You can drive curation from Python directly using Biocurator.run_job():
from biocurator.core.curator import Biocurator
from biocurator.config.schema import (
JobConfig, SearchConfig, FilterConfig, ExportConfig,
)
job = JobConfig(
name="my-job",
search=SearchConfig(
databases=["ncbi"],
organism="SARS-CoV-2",
sequence_type="nucleotide",
keywords=["complete genome"],
max_results=10,
),
filter=FilterConfig(min_length=29000, quality_threshold=0.8),
export=ExportConfig(outdir="results", formats=["fasta", "csv"], prefix="sars"),
)
curator = Biocurator(email="your@email.com")
output_files = curator.run_job(job)
# {"fasta": PosixPath("results/sars_sequences.fasta"), "csv": PosixPath("results/sars_metadata.csv")}
Progress callbacks let you integrate with your own UI:
def on_progress(phase: str, current: int, total: int):
print(f"[{phase}] {current}/{total}")
curator.run_job(job, progress_callback=on_progress)
Load a config file programmatically:
from biocurator.config.loader import ConfigLoader
config = ConfigLoader.load("config.yaml")
curator = Biocurator(email=config.email)
for job in config.jobs:
output_files = curator.run_job(job)
print(f"{job.name}: {list(output_files)}")
Provider internals
Each database provider exposes a QueryBuilder that translates search criteria into a database-specific query string. You can use these directly if you need the query without running the full pipeline:
from biocurator.providers.ncbi import NCBISearchCriteria, get_builder
from biocurator.providers.base import NCBIDatabase
criteria = NCBISearchCriteria(
database=NCBIDatabase.PUBMED,
organism="Homo sapiens",
keywords=["CRISPR"],
)
builder = get_builder(NCBIDatabase.PUBMED)
print(builder.build(criteria))
# '"Homo sapiens"[MeSH Terms] AND "CRISPR"[Title/Abstract]'
# Inspect available search fields
for field, desc in builder.available_fields().items():
print(f"{field}: {desc}")
# Get records (returns an iterator)
searcher = ProviderRegistry.get("ncbi", DatabaseConfig(name="NCBI"), "your@email.com")
ids = searcher.search(criteria)
for record in searcher.fetch_metadata(ids, criteria):
print(record.accession)
For UniProt:
from biocurator.providers.uniprot import UniProtQueryBuilder, UniProtSearchCriteria
criteria = UniProtSearchCriteria(organism="Mus musculus", reviewed=True)
query = UniProtQueryBuilder().build(criteria)
# 'organism:"Mus musculus" AND reviewed:true'
Output Files
| File | Format | Contents |
|---|---|---|
<prefix>_sequences.fasta |
FASTA | Downloaded sequences |
<prefix>_metadata.csv |
CSV | Per-sequence metadata |
<prefix>_metadata.json |
JSON | Per-sequence metadata (machine-readable) |
Troubleshooting
No sequences returned
- Broaden
keywords, removeexclude_terms, or increasemax_results - Verify the organism name matches NCBI/UniProt taxonomy exactly
NCBI rate-limit errors
- NCBI enforces 3 requests/second without an API key; the searcher already respects this, but heavy jobs may be slow
InvalidConfigError: 'email' is required
- Add
email: your@email.comat the top level of the YAML
ConfigNotFoundError
- Check the path passed to
biocurator run— use--dry-runto validate before downloading
Enable debug logging
biocurator --debug run config.yaml
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biocurator-0.2.0.tar.gz.
File metadata
- Download URL: biocurator-0.2.0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6330f7ee3e9483e502aed1dfe96b2693aa1d506d03ca09338ef8d57cbb21075a
|
|
| MD5 |
fb516343656157f4d875960a9a6b03eb
|
|
| BLAKE2b-256 |
6d4e7114f5c3cefe73e2e45da186865b8d203fe010c95f3d3528e69e36c3a1be
|
File details
Details for the file biocurator-0.2.0-py3-none-any.whl.
File metadata
- Download URL: biocurator-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
518cb2e587e1b9c6208c3795a81842732a8a6274abd461dcd54577cb4620b042
|
|
| MD5 |
364bfe8a64e2e4a3ffd808700d9e4a2e
|
|
| BLAKE2b-256 |
e1ac90185a544466e07fb58d5baf02a09e58eb8fb993895d42edd8c5e6264cb5
|