Skip to main content

FlyBase sync/query helper for agents.

Project description

FlyBase local sync/query

Use FlyBase bulk files for agent workloads. Live API: helper only.

Why

  • https://api.flybase.org/api/v1.0/ exists.
  • some endpoints return useful JSON now, eg domain/FBgn0001250, sequence/id/FBgn0001250.
  • some plausible endpoints return empty body today.
  • bulk bucket + release files: better for repeatable agent queries.

Current surfaces checked

  • release bucket: https://s3ftp.flybase.org/releases/current/
  • precomputed files: https://s3ftp.flybase.org/releases/current/precomputed_files/
  • Postgres dump: https://s3ftp.flybase.org/releases/current/psql/FB2026_01.sql.gz
  • API root: https://api.flybase.org/api/v1.0/
  • batch download: https://flybase.org/batchdownload

Layout

  • src/flybase_cli/: package code
  • tests/: stdlib unittest
  • flybase_cli.py: thin repo-root shim
  • pyproject.toml: package metadata / console entrypoint

CLI

python3 flybase_cli.py presets

python3 flybase_cli.py sync gene-core

python3 flybase_cli.py sync gene-core --release FB2026_01

python3 flybase_cli.py sync gene-knowledge --release FB2026_01

python3 flybase_cli.py full-sync --release FB2026_01

python3 flybase_cli.py full-sync \
  --release FB2026_01 \
  --include 'best_gene_summary|entity_publication'

python3 flybase_cli.py sync-incremental \
  gene-knowledge \
  --from-release FB2025_06 \
  --release FB2026_01

python3 flybase_cli.py release-diff \
  --preset gene-knowledge \
  --from-release FB2025_06 \
  --to-release FB2026_01

python3 flybase_cli.py genomes --release FB2026_01

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --section fasta \
  --asset mirna

python3 flybase_cli.py genome-presets

python3 flybase_cli.py sync-genome \
  --release FB2026_01 \
  --genome dmel_r6.67 \
  --preset mirna-fasta

PYTHONPATH=src python3 -m flybase_cli sync gene-expression

python3 flybase_cli.py manifest \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py sync-url \
  --url https://s3ftp.flybase.org/genomes/Drosophila_melanogaster/dmel_r6.67_FB2026_01/fasta/ \
  --include 'miRNA'

python3 flybase_cli.py ingest \
  data/flybase/precomputed_files/genes/best_gene_summary_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2026_01.tsv.gz \
  data/flybase/precomputed_files/genes/fbgn_annotation_ID_fb_2026_01.tsv.gz

python3 flybase_cli.py tables --columns

python3 flybase_cli.py describe --sample-values 2
python3 flybase_cli.py schema-export --sample-values 1
python3 flybase_cli.py query-plan --sample-values 1 --limit 5
python3 flybase_cli.py query-run --template-name gene-summary-by-fbgn --param fbgn_id=FBgn0002121

python3 flybase_cli.py fts-build

python3 flybase_cli.py search 'memory formation'

python3 flybase_cli.py pg-load --release FB2026_01

python3 flybase_cli.py sql \
  "select * from fb_best_gene_summary_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select s.fbgn_id, s.gene_symbol, a.annotation_id, p.flybase_fbtr, p.flybase_fbpp \
   from fb_best_gene_summary_fb_2026_01 s \
   join fb_fbgn_annotation_id_fb_2026_01 a on a.primary_fbgn = s.fbgn_id \
   left join fb_fbgn_fbtr_fbpp_fb_2026_01 p on p.flybase_fbgn = s.fbgn_id \
   limit 5"

python3 flybase_cli.py api domain/FBgn0001250

Sync presets

  • gene-core: summaries + FBgn/FBtr/FBpp + annotation IDs + SO annotations
  • gene-expression: curated/high-throughput/scRNA expression slices
  • references: publication/link tables
  • gene-knowledge: core gene facts + representative publications + orthology tables
  • orthology: ortholog, paralog, and disease-association tables
  • interactions: gene- and allele-level interaction tables

Full sync

  • full-sync crawls an entire release prefix, default precomputed_files/
  • default behavior: download only files the current loaders can ingest into SQLite
  • use --all-files if you want non-ingestable release artifacts too
  • use --include / --exclude to stage a narrower smoke or partial warehouse
  • default manifest path: data/flybase/manifests/<release>/full-sync.json

Discovery

  • genomes --release FB2026_01 lists genome builds linked from that FlyBase release
  • sync-url turns a crawlable FlyBase directory URL into a one-step local sync
  • sync-genome resolves a release/build pair into the right genome-section URL automatically
  • genome-presets lists reusable genome asset sync recipes

Genome sync

  • sections: fasta, gff, gtf, dna, chado-xml
  • asset shortcuts include mirna, transcript, translation, gene, chromosome, cds, ncrna, gff, gtf
  • presets include mirna-fasta, transcript-fasta, translation-fasta, gene-fasta, chromosome-fasta, ncrna-fasta, gff-all, gtf-all
  • use --include/--exclude for narrower file selection on top of the asset preset

Ingest formats

  • delimited: tsv, csv, gzipped variants
  • sequence: fasta, fa, fna, faa, gzipped variants
  • annotation: gff, gff3, gtf, gzipped variants
  • JSON: json, json.gz

JSON ingest

  • top-level scalar JSON fields become queryable SQLite columns
  • one nested dict level is flattened, eg gene.symbol -> gene_symbol
  • repeated top-level lists become child tables, eg symbolSynonyms -> <table>_symbolsynonyms
  • repeated lists nested inside child dict rows become descendant tables, eg genomeLocations[].exons[] -> <table>_genomelocations_exons
  • full source record remains in payload_json

Example:

python3 flybase_cli.py sql \
  "select record_id, symbol, gene_geneId from fb_ncrna_genes_fb_2026_01 limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, ordinal, value \
   from fb_ncrna_genes_fb_2026_01_symbolsynonyms \
   limit 5"

python3 flybase_cli.py sql \
  "select parent_record_id, parent_ordinal, ordinal, startPosition, endPosition \
   from fb_ncrna_genes_fb_2026_01_genomelocations_exons \
   limit 5"

Search

  • fts-build creates a local SQLite FTS5 index from ingested tables
  • search queries that index without calling the live FlyBase API
  • record ids prefer stable FlyBase-like columns such as fbgn_id, primary_fbgn, flybase_fbtr

Metadata

  • describe summarizes ingested tables with row counts, source paths, semantic tags, columns, and representative non-empty values
  • schema-export writes the same metadata to a deterministic JSON artifact beside the SQLite DB, eg FB2026_01.schema.json
  • schema-export also includes inferred relationships for nested child tables and common FlyBase ID joins
  • schema-export also emits semantic_summary for table/entity tag coverage
  • schema-export also emits ready-to-run query_templates
  • query-plan prints starter SQL without the larger schema payload
  • query-plan now includes named biological templates such as gene-summary-by-fbgn, transcript-protein-links, publications-for-gene, and coordinate lookups when matching tables exist
  • query-run selects one template and executes it with parameter values
  • useful first step before writing ad hoc SQL or building agent query plans

Example:

python3 flybase_cli.py schema-export \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1

python3 flybase_cli.py query-plan \
  --db data/flybase/FB2026_01.sqlite \
  --sample-values 1 \
  --limit 5

python3 flybase_cli.py query-run \
  --db data/flybase/FB2026_01.sqlite \
  --template-name gene-summary-by-fbgn \
  --param fbgn_id=FBgn0002121

Notes

  • nested JSON child tables keep lineage columns like parent_record_id, parent_ordinal, ordinal.
  • many FlyBase files start with ## metadata lines; loader skips those.
  • sync writes a preset manifest under data/flybase/manifests/<release>/.
  • full-sync is the broadest offline path for release bulk data without going through the full Postgres dump.
  • sync --release FB2026_01 defaults to data/flybase/FB2026_01.sqlite to avoid cross-release mixing.
  • sync-incremental uses stable manifest keys so release-renamed files still land in updated instead of noisy add/remove pairs.
  • release-diff compares releases either by raw prefix or by curated multi-prefix preset.
  • manifest --url lets you crawl non-releases/ FlyBase directories such as genome FASTA/GFF trees.
  • sync-url is the shortest path for genome assets once you know the directory URL.
  • sync-genome is the shortest path when you know the FlyBase release + genome build label.
  • sync-genome --preset ... is the preferred path for common genome asset pulls.
  • some FlyBase .gff.gz assets are tar-wrapped gzip archives; loader handles that transparently.
  • sql and query-run shape results as record-oriented JSON with summary metadata for agent chaining.
  • pg-load stages the full Postgres import script for releases/<release>/psql/<release>.sql.gz.
  • pg-load --execute runs the staged script when createdb and psql are installed locally.
  • SQLite keeps setup minimal; switch to DuckDB/Postgres if you want bigger joins/faster scans.
  • if you only need a few IDs, FlyBase Batch Download may be simpler than syncing files.
  • use --no-header for files whose first non-comment row is data, not column names.

Tests

python3 -m unittest discover -s tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flybase_cli-0.1.2.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flybase_cli-0.1.2-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file flybase_cli-0.1.2.tar.gz.

File metadata

  • Download URL: flybase_cli-0.1.2.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flybase_cli-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4957ed7d9a9097a2349283b947f7bb9051ee2fc99cc7ad1922b5951fe99cafc3
MD5 0b71073a741d0c9d7e8ae543ffc56b9d
BLAKE2b-256 7fda7cbf036cd00d5ca594d396e8ea05b8fbaa0a3e403ad60186036344349060

See more details on using hashes here.

Provenance

The following attestation bundles were made for flybase_cli-0.1.2.tar.gz:

Publisher: release.yml on gumadeiras/flybase-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file flybase_cli-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: flybase_cli-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for flybase_cli-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 74821040bed158c6111892afb94de4c989cdfbf725d91ed8ac9f85d5a30708b9
MD5 635d93d058d74d38172e9e9735e01749
BLAKE2b-256 4b646c09caebfe17252d92d2bd09554225b752c6bec003f53f4b8fbe25340f6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for flybase_cli-0.1.2-py3-none-any.whl:

Publisher: release.yml on gumadeiras/flybase-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page