Skip to main content

FlyBase sync/query helper for agents.

Project description

FlyBase local sync/query

Use FlyBase bulk files for agent workloads. Live API: helper only.

Why

  • https://api.flybase.org/api/v1.0/ exists.
  • some endpoints return useful JSON now, eg domain/FBgn0001250, sequence/id/FBgn0001250.
  • some plausible endpoints return empty body today.
  • bulk bucket + release files: better for repeatable agent queries.

Current surfaces checked

  • release bucket: https://s3ftp.flybase.org/releases/current/
  • precomputed files: https://s3ftp.flybase.org/releases/current/precomputed_files/
  • Postgres dump: https://s3ftp.flybase.org/releases/current/psql/FB2026_01.sql.gz
  • API root: https://api.flybase.org/api/v1.0/
  • batch download: https://flybase.org/batchdownload

Layout

  • src/flybase_cli/: package code
  • tests/: stdlib unittest
  • flybase_cli.py: thin repo-root shim
  • pyproject.toml: package metadata / console entrypoint

Install

PyPI with pipx:

pipx install flybase

PyPI with plain pip:

python3 -m pip install flybase

Homebrew:

brew tap gumadeiras/tap
brew install flybase

From source:

python3 -m pip install -e .

Release

Current release: v0.1.2.

Tag pushes like vX.Y.Z run the release workflow: build artifacts, create a GitHub release, publish to PyPI, and update gumadeiras/homebrew-tap.

Release prerequisites:

  • PyPI trusted publishing configured for this repo.
  • HOMEBREW_TAP_TOKEN repository secret can write to gumadeiras/homebrew-tap.

CLI

python3 flybase_cli.py presets

python3 flybase_cli.py sync gene-core

python3 flybase_cli.py sync gene-core --release FB2026_01

python3 flybase_cli.py sync gene-knowledge --release FB2026_01

python3 flybase_cli.py full-sync --release FB2026_01

python3 flybase_cli.py full-sync \
  --release FB2026_01 \
  --include 'best_gene_summary|entity_publication'

python3 flybase_cli.py sync-incremental \
  gene-knowledge \
  --from-release FB2025_06 \
  --release FB2026_01

python3 flybase_cli.py release-diff \
  --preset gene-knowledge \
  --from-release FB2025_06 \
  --to-release FB2026_01

python3 flybase_cli.py genomes --release FB2026_01

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --section fasta \
  --asset mirna

python3 flybase_cli.py genome-presets

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --preset mirna-fasta

PYTHONPATH=src python3 -m flybase_cli sync gene-expression

python3 flybase_cli.py manifest \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py sync-url \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py ingest \
  data/flybase/precomputed_files/genes/best_gene_summary_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_annotation_ID_fb_2026_01.tsv.gz

python3 flybase_cli.py tables --columns

python3 flybase_cli.py describe --sample-values 2
python3 flybase_cli.py schema-export --sample-values 1
python3 flybase_cli.py query-plan --sample-values 1 --limit 5
python3 flybase_cli.py query-run --template-name gene-summary-by-fbgn --param fbgn_id=FBgn0002121

python3 flybase_cli.py fts-build

python3 flybase_cli.py search 'memory formation'

python3 flybase_cli.py pg-load --release FB2026_01

python3 flybase_cli.py sql \
  "select * from fb_best_gene_summary_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select s.fbgn_id, s.gene_symbol, a.annotation_id, p.flybase_fbtr, p.flybase_fbpp \
   from fb_best_gene_summary_fb_2026_01 s \
   join fb_fbgn_annotation_id_fb_2026_01 a on a.primary_fbgn = s.fbgn_id \
   left join fb_fbgn_fbtr_fbpp_fb_2026_01 p on p.flybase_fbgn = s.fbgn_id \
   limit 5"

python3 flybase_cli.py api domain/FBgn0001250

Sync presets

  • gene-core: summaries + FBgn/FBtr/FBpp + annotation IDs + SO annotations
  • gene-expression: curated/high-throughput/scRNA expression slices
  • references: publication/link tables
  • gene-knowledge: core gene facts + representative publications + orthology tables
  • orthology: ortholog, paralog, and disease-association tables
  • interactions: gene- and allele-level interaction tables

Full sync

  • full-sync crawls an entire release prefix, default precomputed_files/
  • default behavior: download only files the current loaders can ingest into SQLite
  • use --all-files if you want non-ingestable release artifacts too
  • use --include / --exclude to stage a narrower smoke or partial warehouse
  • default manifest path: data/flybase/manifests/<release>/full-sync.json

Discovery

  • genomes --release FB2026_01 lists genome builds linked from that FlyBase release
  • sync-url turns a crawlable FlyBase directory URL into a one-step local sync
  • sync-genome resolves a release/build pair into the right genome-section URL automatically
  • genome-presets lists reusable genome asset sync recipes

Genome sync

  • sections: fasta, gff, gtf, dna, chado-xml
  • asset shortcuts include mirna, transcript, translation, gene, chromosome, cds, ncrna, gff, gtf
  • presets include mirna-fasta, transcript-fasta, translation-fasta, gene-fasta, chromosome-fasta, ncrna-fasta, gff-all, gtf-all
  • use --include/--exclude for narrower file selection on top of the asset preset

Ingest formats

  • delimited: tsv, csv, gzipped variants
  • sequence: fasta, fa, fna, faa, gzipped variants
  • annotation: gff, gff3, gtf, gzipped variants
  • JSON: json, json.gz

JSON ingest

  • top-level scalar JSON fields become queryable SQLite columns
  • one nested dict level is flattened, eg gene.symbol -> gene_symbol
  • repeated top-level lists become child tables, eg symbolSynonyms -> <table>_symbolsynonyms
  • repeated lists nested inside child dict rows become descendant tables, eg genomeLocations[].exons[] -> <table>_genomelocations_exons
  • full source record remains in payload_json

Example:

python3 flybase_cli.py sql \
  "select record_id, symbol, gene_geneId from fb_ncrna_genes_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, ordinal, value \
   from fb_ncrna_genes_fb_2026_01_symbolsynonyms \
   limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, parent_ordinal, ordinal, startPosition, endPosition \
   from fb_ncrna_genes_fb_2026_01_genomelocations_exons \
   limit 5"

Search

  • fts-build creates a local SQLite FTS5 index from ingested tables
  • search queries that index without calling the live FlyBase API
  • record ids prefer stable FlyBase-like columns such as fbgn_id, primary_fbgn, flybase_fbtr

Metadata

  • describe summarizes ingested tables with row counts, source paths, semantic tags, columns, and representative non-empty values
  • schema-export writes the same metadata to a deterministic JSON artifact beside the SQLite DB, eg FB2026_01.schema.json
  • schema-export also includes inferred relationships for nested child tables and common FlyBase ID joins
  • schema-export also emits semantic_summary for table/entity tag coverage
  • schema-export also emits ready-to-run query_templates
  • query-plan prints starter SQL without the larger schema payload
  • query-plan now includes named biological templates such as gene-summary-by-fbgn, transcript-protein-links, publications-for-gene, and coordinate lookups when matching tables exist
  • query-run selects one template and executes it with parameter values
  • useful first step before writing ad hoc SQL or building agent query plans

Example:

python3 flybase_cli.py schema-export \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1

python3 flybase_cli.py query-plan \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1 \
  --limit 5

python3 flybase_cli.py query-run \
  --db data/flybase/FB2026_01.sqlite \
  --template-name gene-summary-by-fbgn \
  --param fbgn_id=FBgn0002121

Notes

  • nested JSON child tables keep lineage columns like parent_record_id, parent_ordinal, ordinal.
  • many FlyBase files start with ## metadata lines; loader skips those.
  • sync writes a preset manifest under data/flybase/manifests/<release>/.
  • full-sync is the broadest offline path for release bulk data without going through the full Postgres dump.
  • sync --release FB2026_01 defaults to data/flybase/FB2026_01.sqlite to avoid cross-release mixing.
  • sync-incremental uses stable manifest keys so release-renamed files still land in updated instead of noisy add/remove pairs.
  • release-diff compares releases either by raw prefix or by curated multi-prefix preset.
  • manifest --url lets you crawl non-releases/ FlyBase directories such as genome FASTA/GFF trees.
  • sync-url is the shortest path for genome assets once you know the directory URL.
  • sync-genome is the shortest path when you know the FlyBase release + genome build label.
  • sync-genome --preset ... is the preferred path for common genome asset pulls.
  • some FlyBase .gff.gz assets are tar-wrapped gzip archives; loader handles that transparently.
  • sql and query-run shape results as record-oriented JSON with summary metadata for agent chaining.
  • pg-load stages the full Postgres import script for releases/<release>/psql/<release>.sql.gz.
  • pg-load --execute runs the staged script when createdb and psql are installed locally.
  • SQLite keeps setup minimal; switch to DuckDB/Postgres if you want bigger joins/faster scans.
  • if you only need a few IDs, FlyBase Batch Download may be simpler than syncing files.
  • use --no-header for files whose first non-comment row is data, not column names.

Tests

python3 -m unittest discover -s tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flybase-0.1.4.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flybase-0.1.4-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file flybase-0.1.4.tar.gz.

File metadata

  • Download URL: flybase-0.1.4.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flybase-0.1.4.tar.gz
Algorithm Hash digest
SHA256 cbe902c0b21507d3c16a9a8b3881c574af1c8c5490d29fa55ce95735cbb34d37
MD5 cf9f7690e2c773055578ac2d112452f1
BLAKE2b-256 7a3781b2ad29fac35d102ded6108bc6dd0421d280476593bf6036ceebdebcf9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for flybase-0.1.4.tar.gz:

Publisher: release.yml on gumadeiras/flybase-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flybase-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: flybase-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flybase-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 68519bd8dea217ed7966ee0689b378505fea2d8f4e3527c64ce0508d24ad90a2
MD5 79b45ea96e7dde7646b61c2531d08b36
BLAKE2b-256 40540f74934556c7ebd09435cd5711729a9eb9e4d26f965bf5d06c52a3fd428f

See more details on using hashes here.

Provenance

The following attestation bundles were made for flybase-0.1.4-py3-none-any.whl:

Publisher: release.yml on gumadeiras/flybase-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page