A config-driven framework for curating biological sequence datasets

These details have not been verified by PyPI

Project links

Project description

biocurator

A config-driven framework for curating biological sequence datasets from various databases. Define your search, filter, and export parameters in a YAML file; biocurator handles the rest.

Features

Multi-database search — NCBI (nucleotide, protein, SRA) and UniProt
Streaming Architecture — memory-efficient processing of large datasets
Robustness — automatic retries with exponential backoff for API calls
Typed config schema — validated YAML with sensible defaults
Flexible filtering — length, quality score, organism, keywords, date range
Multiple export formats — FASTA, CSV, JSON
Rich CLI — progress bars, dry-run mode, per-job filtering

Supported Databases

Database	Entrez / REST databases
NCBI	nuccore, nucleotide, protein, sra, pubmed, pmc, gene, taxonomy, and more
UniProt	Swiss-Prot (reviewed) and TrEMBL (unreviewed) protein entries

Scalability & Robustness

biocurator is designed for high-throughput curation:

Streaming: Sequences are processed one-by-one and streamed directly to disk, allowing you to curate thousands of sequences without exhausting system memory.
NCBI History Server: Automatically uses the NCBI History Server (WebEnv) for scalable and stable data retrieval from large search results.
Retry Logic: Built-in exponential backoff retries for all network operations to handle transient API failures gracefully.

Installation

Requires Python 3.13+.

# With uv (recommended)
uv pip install biocurator

# With pip
pip install biocurator

Quick Start

1. Generate a config file

biocurator init --output config.yaml

This writes a starter YAML to config.yaml. Use --template advanced to include all optional fields:

biocurator init --template advanced --output config.yaml

2. Edit the config

email: your@email.com

jobs:
  covid-genomes:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 50
    filter:
      min_length: 29000
      quality_threshold: 0.8
    export:
      outdir: results/covid
      formats: [fasta, csv]
      prefix: sars_cov2

3. Run a dry-run to preview

biocurator run config.yaml --dry-run

Dry run — 1 job(s) would execute:
  • covid-genomes  databases=['ncbi']

4. Execute

biocurator run config.yaml

Config Reference

Every config file has a top-level email (required for NCBI access) and a jobs map where each key is the job name.

email: your@email.com # required

jobs:
  <job-name>:
    search:
      databases: [ncbi] # required: ncbi | uniprot
      organism: null # e.g. "SARS-CoV-2", "E. coli"
      sequence_type: nucleotide # nucleotide | protein | sra
      keywords: [] # AND-joined with other terms
      max_results: 100
      exclude_terms: [] # excluded from search
      location: null # geographic filter, e.g. "Philippines"
      taxonomy_filter: null # taxon name or ID
      date_range:
        start: "2020/01/01" # YYYY/MM/DD
        end: "2024/12/31"
    filter:
      min_length: null # minimum sequence length (bp / aa)
      max_length: null # maximum sequence length (bp / aa)
      exclude_terms: [] # excluded from title/description
      quality_threshold: null # 0.0–1.0; filters on N/X content
    export:
      outdir: results # output directory (created if absent)
      formats: [fasta] # fasta | csv | json
      prefix: biocurator # filename prefix for output files

Defaults apply to any omitted field, so a minimal job only needs search.databases:

email: your@email.com

jobs:
  simple:
    search:
      databases: [ncbi]
      organism: "Homo sapiens"
    filter: {}
    export: {}

CLI Reference

`biocurator init`

Generate a starter config file.

Usage: biocurator init [OPTIONS]

Options:
  -o, --output TEXT    Write config to this file instead of stdout
  -t, --template TEXT  Template to use: basic (default) or advanced

`biocurator run`

Run all jobs defined in a config file.

Usage: biocurator run [OPTIONS] CONFIG

Arguments:
  CONFIG  Path to the YAML config file  [required]

Options:
  -j, --jobs TEXT  Comma-separated job names to run (default: all)
  --dry-run        Validate config and preview jobs without downloading

Run only specific jobs:

biocurator run config.yaml --jobs covid-genomes,spike-proteins

Dry-run before committing:

biocurator run config.yaml --dry-run

Usage Examples

Viral genome surveillance

Collect complete SARS-CoV-2 genomes deposited in 2024, filtered for quality:

email: researcher@uni.edu

jobs:
  sars-cov2-2024:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 500
      exclude_terms: [synthetic, artificial, recombinant]
      date_range:
        start: "2024/01/01"
        end: "2024/12/31"
    filter:
      min_length: 29000
      quality_threshold: 0.9
    export:
      outdir: results/sars_2024
      formats: [fasta, csv]
      prefix: sars_cov2_2024

Antibiotic resistance genes

Collect beta-lactamase nucleotide sequences from NCBI:

email: researcher@uni.edu

jobs:
  beta-lactamase:
    search:
      databases: [ncbi]
      sequence_type: nucleotide
      keywords: ["beta-lactamase", "bla gene"]
      max_results: 300
      exclude_terms: [partial, predicted]
    filter:
      min_length: 500
      max_length: 3000
    export:
      outdir: results/amr
      formats: [fasta, csv, json]
      prefix: bla_genes

Multi-database protein family study

Search both NCBI and UniProt for a protein family in the same job:

email: researcher@uni.edu

jobs:
  cytochrome-p450:
    search:
      databases: [ncbi, uniprot]
      sequence_type: protein
      keywords: ["cytochrome P450", "CYP"]
      max_results: 200
    filter:
      min_length: 300
      quality_threshold: 0.8
    export:
      outdir: results/cyp450
      formats: [fasta, csv]
      prefix: cyp450

Multiple independent jobs in one run

email: researcher@uni.edu

jobs:
  spike-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["spike glycoprotein"]
      max_results: 100
    filter:
      min_length: 1200
    export:
      outdir: results/spike
      formats: [fasta]
      prefix: spike

  nucleocapsid-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["nucleocapsid protein"]
      max_results: 100
    filter:
      min_length: 400
    export:
      outdir: results/ncap
      formats: [fasta]
      prefix: ncap

Run them together or selectively:

# All jobs
biocurator run config.yaml

# Only one
biocurator run config.yaml --jobs spike-proteins

Python API

You can drive curation from Python directly using Biocurator.run_job():

from biocurator.core.curator import Biocurator
from biocurator.config.schema import (
    JobConfig, SearchConfig, FilterConfig, ExportConfig,
)

job = JobConfig(
    name="my-job",
    search=SearchConfig(
        databases=["ncbi"],
        organism="SARS-CoV-2",
        sequence_type="nucleotide",
        keywords=["complete genome"],
        max_results=10,
    ),
    filter=FilterConfig(min_length=29000, quality_threshold=0.8),
    export=ExportConfig(outdir="results", formats=["fasta", "csv"], prefix="sars"),
)

curator = Biocurator(email="your@email.com")
output_files = curator.run_job(job)
# {"fasta": PosixPath("results/sars_sequences.fasta"), "csv": PosixPath("results/sars_metadata.csv")}

Progress callbacks let you integrate with your own UI:

def on_progress(phase: str, current: int, total: int):
    print(f"[{phase}] {current}/{total}")

curator.run_job(job, progress_callback=on_progress)

Load a config file programmatically:

from biocurator.config.loader import ConfigLoader

config = ConfigLoader.load("config.yaml")
curator = Biocurator(email=config.email)

for job in config.jobs:
    output_files = curator.run_job(job)
    print(f"{job.name}: {list(output_files)}")

Provider internals

Each database provider exposes a QueryBuilder that translates search criteria into a database-specific query string. You can use these directly if you need the query without running the full pipeline:

from biocurator.providers.ncbi import NCBISearchCriteria, get_builder
from biocurator.providers.base import NCBIDatabase

criteria = NCBISearchCriteria(
    database=NCBIDatabase.PUBMED,
    organism="Homo sapiens",
    keywords=["CRISPR"],
)
builder = get_builder(NCBIDatabase.PUBMED)
print(builder.build(criteria))
# '"Homo sapiens"[MeSH Terms] AND "CRISPR"[Title/Abstract]'

# Inspect available search fields
for field, desc in builder.available_fields().items():
    print(f"{field}: {desc}")

# Get records (returns an iterator)
searcher = ProviderRegistry.get("ncbi", DatabaseConfig(name="NCBI"), "your@email.com")
ids = searcher.search(criteria)
for record in searcher.fetch_metadata(ids, criteria):
    print(record.accession)

For UniProt:

from biocurator.providers.uniprot import UniProtQueryBuilder, UniProtSearchCriteria

criteria = UniProtSearchCriteria(organism="Mus musculus", reviewed=True)
query = UniProtQueryBuilder().build(criteria)
# 'organism:"Mus musculus" AND reviewed:true'

Output Files

File	Format	Contents
`<prefix>_sequences.fasta`	FASTA	Downloaded sequences
`<prefix>_metadata.csv`	CSV	Per-sequence metadata
`<prefix>_metadata.json`	JSON	Per-sequence metadata (machine-readable)

Troubleshooting

No sequences returned

Broaden keywords, remove exclude_terms, or increase max_results
Verify the organism name matches NCBI/UniProt taxonomy exactly

NCBI rate-limit errors

NCBI enforces 3 requests/second without an API key; the searcher already respects this, but heavy jobs may be slow

InvalidConfigError: 'email' is required

Add email: your@email.com at the top level of the YAML

ConfigNotFoundError

Check the path passed to biocurator run — use --dry-run to validate before downloading

Enable debug logging

biocurator --debug run config.yaml

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 20, 2026

0.1.1

May 16, 2026

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocurator-0.2.0.tar.gz (35.6 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biocurator-0.2.0-py3-none-any.whl (34.4 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file biocurator-0.2.0.tar.gz.

File metadata

Download URL: biocurator-0.2.0.tar.gz
Upload date: May 20, 2026
Size: 35.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6330f7ee3e9483e502aed1dfe96b2693aa1d506d03ca09338ef8d57cbb21075a`
MD5	`fb516343656157f4d875960a9a6b03eb`
BLAKE2b-256	`6d4e7114f5c3cefe73e2e45da186865b8d203fe010c95f3d3528e69e36c3a1be`

See more details on using hashes here.

File details

Details for the file biocurator-0.2.0-py3-none-any.whl.

File metadata

Download URL: biocurator-0.2.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`518cb2e587e1b9c6208c3795a81842732a8a6274abd461dcd54577cb4620b042`
MD5	`364bfe8a64e2e4a3ffd808700d9e4a2e`
BLAKE2b-256	`e1ac90185a544466e07fb58d5baf02a09e58eb8fb993895d42edd8c5e6264cb5`

See more details on using hashes here.

biocurator 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

biocurator

Features

Supported Databases

Scalability & Robustness

Installation

Quick Start

1. Generate a config file

2. Edit the config

3. Run a dry-run to preview

4. Execute

Config Reference

CLI Reference

biocurator init

biocurator run

Usage Examples

Viral genome surveillance

Antibiotic resistance genes

Multi-database protein family study

Multiple independent jobs in one run

Python API

Provider internals

Output Files

Troubleshooting

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`biocurator init`

`biocurator run`