Skip to main content

A config-driven framework for curating biological sequence datasets

Project description

biocurator

A config-driven framework for curating biological sequence datasets from various databases. Define your search, filter, and export parameters in a YAML file; biocurator handles the rest.

Features

  • Multi-database search — NCBI (nucleotide, protein, SRA) and UniProt
  • Streaming Architecture — memory-efficient processing of large datasets
  • Robustness — automatic retries with exponential backoff for API calls
  • Typed config schema — validated YAML with sensible defaults
  • Flexible filtering — length, quality score, organism, keywords, date range
  • Multiple export formats — FASTA, CSV, JSON
  • Rich CLI — progress bars, dry-run mode, per-job filtering

Supported Databases

Database Entrez / REST databases
NCBI nuccore, nucleotide, protein, sra, pubmed, pmc, gene, taxonomy, and more
UniProt Swiss-Prot (reviewed) and TrEMBL (unreviewed) protein entries

Scalability & Robustness

biocurator is designed for high-throughput curation:

  • Streaming: Sequences are processed one-by-one and streamed directly to disk, allowing you to curate thousands of sequences without exhausting system memory.
  • NCBI History Server: Automatically uses the NCBI History Server (WebEnv) for scalable and stable data retrieval from large search results.
  • Retry Logic: Built-in exponential backoff retries for all network operations to handle transient API failures gracefully.

Installation

Requires Python 3.13+.

# With uv (recommended)
uv pip install biocurator

# With pip
pip install biocurator

Quick Start

1. Generate a config file

biocurator init --output config.yaml

This writes a starter YAML to config.yaml. Use --template advanced to include all optional fields:

biocurator init --template advanced --output config.yaml

2. Edit the config

email: your@email.com

jobs:
  covid-genomes:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 50
    filter:
      min_length: 29000
      quality_threshold: 0.8
    export:
      outdir: results/covid
      formats: [fasta, csv]
      prefix: sars_cov2

3. Run a dry-run to preview

biocurator run config.yaml --dry-run
Dry run — 1 job(s) would execute:
  • covid-genomes  databases=['ncbi']

4. Execute

biocurator run config.yaml

Config Reference

Every config file has a top-level email (required for NCBI access) and a jobs map where each key is the job name.

email: your@email.com # required

jobs:
  <job-name>:
    search:
      databases: [ncbi] # required: ncbi | uniprot
      organism: null # e.g. "SARS-CoV-2", "E. coli"
      sequence_type: nucleotide # nucleotide | protein | sra
      keywords: [] # AND-joined with other terms
      max_results: 100
      exclude_terms: [] # excluded from search
      location: null # geographic filter, e.g. "Philippines"
      taxonomy_filter: null # taxon name or ID
      date_range:
        start: "2020/01/01" # YYYY/MM/DD
        end: "2024/12/31"
    filter:
      min_length: null # minimum sequence length (bp / aa)
      max_length: null # maximum sequence length (bp / aa)
      exclude_terms: [] # excluded from title/description
      quality_threshold: null # 0.0–1.0; filters on N/X content
    export:
      outdir: results # output directory (created if absent)
      formats: [fasta] # fasta | csv | json
      prefix: biocurator # filename prefix for output files

Defaults apply to any omitted field, so a minimal job only needs search.databases:

email: your@email.com

jobs:
  simple:
    search:
      databases: [ncbi]
      organism: "Homo sapiens"
    filter: {}
    export: {}

CLI Reference

biocurator init

Generate a starter config file.

Usage: biocurator init [OPTIONS]

Options:
  -o, --output TEXT    Write config to this file instead of stdout
  -t, --template TEXT  Template to use: basic (default) or advanced

biocurator run

Run all jobs defined in a config file.

Usage: biocurator run [OPTIONS] CONFIG

Arguments:
  CONFIG  Path to the YAML config file  [required]

Options:
  -j, --jobs TEXT  Comma-separated job names to run (default: all)
  --dry-run        Validate config and preview jobs without downloading

Run only specific jobs:

biocurator run config.yaml --jobs covid-genomes,spike-proteins

Dry-run before committing:

biocurator run config.yaml --dry-run

Usage Examples

Viral genome surveillance

Collect complete SARS-CoV-2 genomes deposited in 2024, filtered for quality:

email: researcher@uni.edu

jobs:
  sars-cov2-2024:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 500
      exclude_terms: [synthetic, artificial, recombinant]
      date_range:
        start: "2024/01/01"
        end: "2024/12/31"
    filter:
      min_length: 29000
      quality_threshold: 0.9
    export:
      outdir: results/sars_2024
      formats: [fasta, csv]
      prefix: sars_cov2_2024

Antibiotic resistance genes

Collect beta-lactamase nucleotide sequences from NCBI:

email: researcher@uni.edu

jobs:
  beta-lactamase:
    search:
      databases: [ncbi]
      sequence_type: nucleotide
      keywords: ["beta-lactamase", "bla gene"]
      max_results: 300
      exclude_terms: [partial, predicted]
    filter:
      min_length: 500
      max_length: 3000
    export:
      outdir: results/amr
      formats: [fasta, csv, json]
      prefix: bla_genes

Multi-database protein family study

Search both NCBI and UniProt for a protein family in the same job:

email: researcher@uni.edu

jobs:
  cytochrome-p450:
    search:
      databases: [ncbi, uniprot]
      sequence_type: protein
      keywords: ["cytochrome P450", "CYP"]
      max_results: 200
    filter:
      min_length: 300
      quality_threshold: 0.8
    export:
      outdir: results/cyp450
      formats: [fasta, csv]
      prefix: cyp450

Multiple independent jobs in one run

email: researcher@uni.edu

jobs:
  spike-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["spike glycoprotein"]
      max_results: 100
    filter:
      min_length: 1200
    export:
      outdir: results/spike
      formats: [fasta]
      prefix: spike

  nucleocapsid-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["nucleocapsid protein"]
      max_results: 100
    filter:
      min_length: 400
    export:
      outdir: results/ncap
      formats: [fasta]
      prefix: ncap

Run them together or selectively:

# All jobs
biocurator run config.yaml

# Only one
biocurator run config.yaml --jobs spike-proteins

Python API

You can drive curation from Python directly using Biocurator.run_job():

from biocurator.core.curator import Biocurator
from biocurator.config.schema import (
    JobConfig, SearchConfig, FilterConfig, ExportConfig,
)

job = JobConfig(
    name="my-job",
    search=SearchConfig(
        databases=["ncbi"],
        organism="SARS-CoV-2",
        sequence_type="nucleotide",
        keywords=["complete genome"],
        max_results=10,
    ),
    filter=FilterConfig(min_length=29000, quality_threshold=0.8),
    export=ExportConfig(outdir="results", formats=["fasta", "csv"], prefix="sars"),
)

curator = Biocurator(email="your@email.com")
output_files = curator.run_job(job)
# {"fasta": PosixPath("results/sars_sequences.fasta"), "csv": PosixPath("results/sars_metadata.csv")}

Progress callbacks let you integrate with your own UI:

def on_progress(phase: str, current: int, total: int):
    print(f"[{phase}] {current}/{total}")

curator.run_job(job, progress_callback=on_progress)

Load a config file programmatically:

from biocurator.config.loader import ConfigLoader

config = ConfigLoader.load("config.yaml")
curator = Biocurator(email=config.email)

for job in config.jobs:
    output_files = curator.run_job(job)
    print(f"{job.name}: {list(output_files)}")

Provider internals

Each database provider exposes a QueryBuilder that translates search criteria into a database-specific query string. You can use these directly if you need the query without running the full pipeline:

from biocurator.providers.ncbi import NCBISearchCriteria, get_builder
from biocurator.providers.base import NCBIDatabase

criteria = NCBISearchCriteria(
    database=NCBIDatabase.PUBMED,
    organism="Homo sapiens",
    keywords=["CRISPR"],
)
builder = get_builder(NCBIDatabase.PUBMED)
print(builder.build(criteria))
# '"Homo sapiens"[MeSH Terms] AND "CRISPR"[Title/Abstract]'

# Inspect available search fields
for field, desc in builder.available_fields().items():
    print(f"{field}: {desc}")

# Get records (returns an iterator)
searcher = ProviderRegistry.get("ncbi", DatabaseConfig(name="NCBI"), "your@email.com")
ids = searcher.search(criteria)
for record in searcher.fetch_metadata(ids, criteria):
    print(record.accession)

For UniProt:

from biocurator.providers.uniprot import UniProtQueryBuilder, UniProtSearchCriteria

criteria = UniProtSearchCriteria(organism="Mus musculus", reviewed=True)
query = UniProtQueryBuilder().build(criteria)
# 'organism:"Mus musculus" AND reviewed:true'

Output Files

File Format Contents
<prefix>_sequences.fasta FASTA Downloaded sequences
<prefix>_metadata.csv CSV Per-sequence metadata
<prefix>_metadata.json JSON Per-sequence metadata (machine-readable)

Troubleshooting

No sequences returned

  • Broaden keywords, remove exclude_terms, or increase max_results
  • Verify the organism name matches NCBI/UniProt taxonomy exactly

NCBI rate-limit errors

  • NCBI enforces 3 requests/second without an API key; the searcher already respects this, but heavy jobs may be slow

InvalidConfigError: 'email' is required

  • Add email: your@email.com at the top level of the YAML

ConfigNotFoundError

  • Check the path passed to biocurator run — use --dry-run to validate before downloading

Enable debug logging

biocurator --debug run config.yaml

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocurator-0.2.0.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biocurator-0.2.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file biocurator-0.2.0.tar.gz.

File metadata

  • Download URL: biocurator-0.2.0.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6330f7ee3e9483e502aed1dfe96b2693aa1d506d03ca09338ef8d57cbb21075a
MD5 fb516343656157f4d875960a9a6b03eb
BLAKE2b-256 6d4e7114f5c3cefe73e2e45da186865b8d203fe010c95f3d3528e69e36c3a1be

See more details on using hashes here.

File details

Details for the file biocurator-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: biocurator-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 518cb2e587e1b9c6208c3795a81842732a8a6274abd461dcd54577cb4620b042
MD5 364bfe8a64e2e4a3ffd808700d9e4a2e
BLAKE2b-256 e1ac90185a544466e07fb58d5baf02a09e58eb8fb993895d42edd8c5e6264cb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page