Skip to main content

A config-driven framework for curating biological sequence datasets

Project description

biocurator

A config-driven framework for curating biological sequence datasets from various databases. Define your search, filter, and export parameters in a YAML file; biocurator handles the rest.

Features

  • Multi-database search — NCBI (nucleotide, protein, SRA) and UniProt
  • Typed config schema — validated YAML with sensible defaults
  • Flexible filtering — length, quality score, organism, keywords, date range
  • Multiple export formats — FASTA, CSV, JSON
  • Rich CLI — progress bars, dry-run mode, per-job filtering

Supported Databases

  • NCBI
  • UniProt

Installation

Requires Python 3.13+.

# With uv (recommended)
uv pip install biocurator

# With pip
pip install biocurator

Quick Start

1. Generate a config file

biocurator init --output config.yaml

This writes a starter YAML to config.yaml. Use --template advanced to include all optional fields:

biocurator init --template advanced --output config.yaml

2. Edit the config

email: your@email.com

jobs:
  covid-genomes:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 50
    filter:
      min_length: 29000
      quality_threshold: 0.8
    export:
      outdir: results/covid
      formats: [fasta, csv]
      prefix: sars_cov2

3. Run a dry-run to preview

biocurator run config.yaml --dry-run
Dry run — 1 job(s) would execute:
  • covid-genomes  databases=['ncbi']

4. Execute

biocurator run config.yaml

Config Reference

Every config file has a top-level email (required for NCBI access) and a jobs map where each key is the job name.

email: your@email.com # required

jobs:
  <job-name>:
    search:
      databases: [ncbi] # required: ncbi | uniprot
      organism: null # e.g. "SARS-CoV-2", "E. coli"
      sequence_type: nucleotide # nucleotide | protein | sra
      keywords: [] # AND-joined with other terms
      max_results: 100
      exclude_terms: [] # excluded from search
      location: null # geographic filter, e.g. "Philippines"
      taxonomy_filter: null # taxon name or ID
      date_range:
        start: "2020/01/01" # YYYY/MM/DD
        end: "2024/12/31"
    filter:
      min_length: null # minimum sequence length (bp / aa)
      max_length: null # maximum sequence length (bp / aa)
      exclude_terms: [] # excluded from title/description
      quality_threshold: null # 0.0–1.0; filters on N/X content
    export:
      outdir: results # output directory (created if absent)
      formats: [fasta] # fasta | csv | json
      prefix: biocurator # filename prefix for output files

Defaults apply to any omitted field, so a minimal job only needs search.databases:

email: your@email.com

jobs:
  simple:
    search:
      databases: [ncbi]
      organism: "Homo sapiens"
    filter: {}
    export: {}

CLI Reference

biocurator init

Generate a starter config file.

Usage: biocurator init [OPTIONS]

Options:
  -o, --output TEXT    Write config to this file instead of stdout
  -t, --template TEXT  Template to use: basic (default) or advanced

biocurator run

Run all jobs defined in a config file.

Usage: biocurator run [OPTIONS] CONFIG

Arguments:
  CONFIG  Path to the YAML config file  [required]

Options:
  -j, --jobs TEXT  Comma-separated job names to run (default: all)
  --dry-run        Validate config and preview jobs without downloading

Run only specific jobs:

biocurator run config.yaml --jobs covid-genomes,spike-proteins

Dry-run before committing:

biocurator run config.yaml --dry-run

Usage Examples

Viral genome surveillance

Collect complete SARS-CoV-2 genomes deposited in 2024, filtered for quality:

email: researcher@uni.edu

jobs:
  sars-cov2-2024:
    search:
      databases: [ncbi]
      organism: "SARS-CoV-2"
      sequence_type: nucleotide
      keywords: ["complete genome"]
      max_results: 500
      exclude_terms: [synthetic, artificial, recombinant]
      date_range:
        start: "2024/01/01"
        end: "2024/12/31"
    filter:
      min_length: 29000
      quality_threshold: 0.9
    export:
      outdir: results/sars_2024
      formats: [fasta, csv]
      prefix: sars_cov2_2024

Antibiotic resistance genes

Collect beta-lactamase nucleotide sequences from NCBI:

email: researcher@uni.edu

jobs:
  beta-lactamase:
    search:
      databases: [ncbi]
      sequence_type: nucleotide
      keywords: ["beta-lactamase", "bla gene"]
      max_results: 300
      exclude_terms: [partial, predicted]
    filter:
      min_length: 500
      max_length: 3000
    export:
      outdir: results/amr
      formats: [fasta, csv, json]
      prefix: bla_genes

Multi-database protein family study

Search both NCBI and UniProt for a protein family in the same job:

email: researcher@uni.edu

jobs:
  cytochrome-p450:
    search:
      databases: [ncbi, uniprot]
      sequence_type: protein
      keywords: ["cytochrome P450", "CYP"]
      max_results: 200
    filter:
      min_length: 300
      quality_threshold: 0.8
    export:
      outdir: results/cyp450
      formats: [fasta, csv]
      prefix: cyp450

Multiple independent jobs in one run

email: researcher@uni.edu

jobs:
  spike-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["spike glycoprotein"]
      max_results: 100
    filter:
      min_length: 1200
    export:
      outdir: results/spike
      formats: [fasta]
      prefix: spike

  nucleocapsid-proteins:
    search:
      databases: [uniprot]
      organism: "SARS-CoV-2"
      sequence_type: protein
      keywords: ["nucleocapsid protein"]
      max_results: 100
    filter:
      min_length: 400
    export:
      outdir: results/ncap
      formats: [fasta]
      prefix: ncap

Run them together or selectively:

# All jobs
biocurator run config.yaml

# Only one
biocurator run config.yaml --jobs spike-proteins

Python API

You can drive curation from Python directly using Biocurator.run_job():

from biocurator.core.curator import Biocurator
from biocurator.config.schema import (
    JobConfig, SearchConfig, FilterConfig, ExportConfig,
)

job = JobConfig(
    name="my-job",
    search=SearchConfig(
        databases=["ncbi"],
        organism="SARS-CoV-2",
        sequence_type="nucleotide",
        keywords=["complete genome"],
        max_results=10,
    ),
    filter=FilterConfig(min_length=29000, quality_threshold=0.8),
    export=ExportConfig(outdir="results", formats=["fasta", "csv"], prefix="sars"),
)

curator = Biocurator(email="your@email.com")
output_files = curator.run_job(job)
# {"fasta": PosixPath("results/sars_sequences.fasta"), "csv": PosixPath("results/sars_metadata.csv")}

Progress callbacks let you integrate with your own UI:

def on_progress(phase: str, current: int, total: int):
    print(f"[{phase}] {current}/{total}")

curator.run_job(job, progress_callback=on_progress)

Load a config file programmatically:

from biocurator.config.loader import ConfigLoader

config = ConfigLoader.load("config.yaml")
curator = Biocurator(email=config.email)

for job in config.jobs:
    output_files = curator.run_job(job)
    print(f"{job.name}: {list(output_files)}")

Output Files

File Format Contents
<prefix>_sequences.fasta FASTA Downloaded sequences
<prefix>_metadata.csv CSV Per-sequence metadata
<prefix>_metadata.json JSON Per-sequence metadata (machine-readable)

Troubleshooting

No sequences returned

  • Broaden keywords, remove exclude_terms, or increase max_results
  • Verify the organism name matches NCBI/UniProt taxonomy exactly

NCBI rate-limit errors

  • NCBI enforces 3 requests/second without an API key; the searcher already respects this, but heavy jobs may be slow

InvalidConfigError: 'email' is required

  • Add email: your@email.com at the top level of the YAML

ConfigNotFoundError

  • Check the path passed to biocurator run — use --dry-run to validate before downloading

Enable debug logging

biocurator --debug run config.yaml

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biocurator-0.1.1.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biocurator-0.1.1-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file biocurator-0.1.1.tar.gz.

File metadata

  • Download URL: biocurator-0.1.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a07497755c3489d7977394b443f6d9a5bd407ad222eb6937efc262791252a894
MD5 1d1b97e2285453add04cf9ec09ce392a
BLAKE2b-256 3c47deb1265bdc1edf011d51967dc87f08b2ad7f10fd6817155a715bd581a94d

See more details on using hashes here.

File details

Details for the file biocurator-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: biocurator-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for biocurator-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c81ea0ac04438de800bdc71ca854986ef240b9cac39dae5a9cad4463665ca71
MD5 275175ee661937aa62221c5b33e51c81
BLAKE2b-256 1f928f5617234516acd48783ecc1a8c8d61409e1228d3299514572401951df4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page