Skip to main content

Goose — a columnar storage engine with 10 specialized encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.

Project description

Goose

A columnar storage engine for analytical workloads — 10 specialized column encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.

CI License: MIT Python 3.10+

Goose is a universal booster: a column-encoding and predicate-pushdown layer you can use three ways —

  1. As a Python library — point it at data, let it pick the best encoding per column, write a compressed columnar store, and query it with predicate pushdown.
  2. As a PostgreSQL cold tier — profile a Postgres table, export it to Goose, and query the compressed copy (CDC keeps them in sync).
  3. As an embedded encoding library — link the C (libgoose.a) or Rust (goose-encoding) reference implementation into another database to get encoding intelligence and predicate pushdown without changing your storage engine.

The binary format, encoding selection algorithm, and predicate evaluation logic are specified language-agnostically in reference/SPEC.md, so any database can implement a compatible reader/writer.


Why

PostgreSQL is great at OLTP and terrible at analytical scans over years of history. Goose takes the analytical workload off Postgres: a verb_id column with 6 distinct values compresses 58×; a monotonic timestamp column compresses 10.8× with FOR+ZSTD; a sparse boolean flag becomes a roaring bitmap you can filter without decoding the column. Predicate pushdown then skips whole partitions using bloom filters and zone maps before touching row data.

Features

  • 10 column encodingsraw, bitpacked, dictionary, offset_blob, delta_zstd, zstd, for (frame-of-reference), for_zstd, roaring (roaring-bitmap booleans), zstd_dict (shared trained dictionary).
  • Predicate pushdowneq, neq, gt, gte, lt, lte, in, compound and/or. Each predicate produces a roaring bitmap of matching rows; predicates are intersected, never materializing the full column.
  • Partition skipping — per-partition and per-block bloom filters + zone maps let the reader rule out entire partitions before decode.
  • Adaptive selectivity feedback — observed predicate selectivities reorder filters so the most selective (cheapest) predicates run first.
  • Cross-column correlation statistics — propagate partition pruning across correlated columns.
  • Probabilistic sketches — approximate answers (count-distinct, etc.) for fast exploration.
  • Atomic writes + CRC32 checksums — every file is written to a .tmp then renamed; partition manifests carry checksums, verified on read.
  • SQL query advisorgoose.advisor.optimize() takes a SQL WHERE clause, reorders predicates by selectivity, and returns optimized SQL + a structured predicate tree for GooseTable.query().
  • PostgreSQL integrationgoose-pg CLI to profile, export, benchmark, and query.

Install

pip install weji-goose                 # core engine (numpy, zstandard, sqlglot)
pip install "weji-goose[pg]"           # + psycopg2-binary for the goose-pg CLI
pip install "weji-goose[dev]"          # + pytest for running the test suite

Note: the PyPI distribution is weji-goose; the import name is import weji_goose.

Requires Python ≥ 3.10.

Quick start — Python library

import numpy as np
from weji_goose.schema import TableSchema
from weji_goose.column import ColumnSpec, ColumnType, Encoding
from weji_goose.table import GooseTable
from weji_goose.query import Predicate, CompoundPredicate

schema = TableSchema("demo", columns=[
    ColumnSpec("id",    ColumnType.INT64,   Encoding.DELTA_ZSTD),  # monotonic → delta
    ColumnSpec("flag",  ColumnType.BOOLEAN, Encoding.ROARING),     # sparse bool → bitmap
    ColumnSpec("city",  ColumnType.TEXT,    Encoding.DICTIONARY),  # low-card text → dict
])

# Create and write
table = GooseTable.create(schema, "./demo_table")
table.insert({
    "id":   np.arange(1_000_000, dtype=np.int64),
    "flag": np.array([i % 5 == 0 for i in range(1_000_000)], dtype=bool),
    "city": np.array(["stjohns", "corner", "bay"] * 333_333 + ["stjohns"], dtype=object),
})

# Predicate pushdown: filter without decoding the columns
table = GooseTable.open("./demo_table")
result = table.query(
    ["city", "id"],
    where=CompoundPredicate("and", [
        Predicate("flag", "eq", True),
        Predicate("id",   "gte", 999_000),
    ]),
)
print(result["city"][:5], result["id"][:5])
print("on-disk size:", table.total_size_bytes(), "bytes")

Supported types

int64, int32, int16, float64, float32, boolean, uuid, text, interval.

Quick start — PostgreSQL cold tier (goose-pg)

# 1. Profile a table → schema + per-column compression estimates
goose-pg profile \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output ./sensor_readings_schema.json -v

# 2. Export the table to Goose format (auto-partitioned, resumable)
goose-pg export \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output-dir ./goose_data \
    --partition-column recorded_at --partition-interval month --resume -v

# 3. Benchmark: profile + export + measured compression ratio
goose-pg benchmark \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output-dir ./bench

# 4. Query the compressed Goose data with predicate pushdown
goose-pg query \
    --goose-dir ./goose_data/sensor_readings \
    --columns recorded_at,pm25 \
    --where "pm25 >= 35 AND region = 'bay_st_george'" \
    --format csv

goose-pg query --where parses the SQL WHERE clause through the Goose advisor (using the opened table's schema) and pushes it down. Output formats: table (default), csv, json.

A docker-compose.yml is included for integration testing against a throwaway Postgres. The goose service is profile-gated under cli, so pass --profile cli to invoke it:

docker compose up -d postgres
docker compose --profile cli run --rm goose --help
docker compose --profile cli run --rm goose query --goose-dir ./goose_data/mytable --columns id,name

The query advisor

goose.advisor.optimize(sql, schema) reorders WHERE predicates by selectivity — roaring-bitmap lookups first, then bloom-filtered ID lookups, then range predicates — and returns optimized SQL plus a predicate tree:

from weji_goose.advisor import optimize, SCHEMAS

oq = optimize(
    "SELECT id, verb_id FROM xapi_events "
    "WHERE verb_id IN ('completed', 'passed') AND actor_id = 42",
    schema=SCHEMAS["xapi_events"],
)
print(oq.optimized_sql)
# SELECT id, verb_id
# FROM xapi_events
# WHERE actor_id = 42 AND verb_id IN ('completed', 'passed')

print(oq.predicate_json)
# {'op': 'and', 'predicates': [
#     {'column': 'actor_id', 'op': 'eq',  'value': 42},
#     {'column': 'verb_id',  'op': 'in',  'value': ['completed', 'passed']}]}

OptimizedQuery exposes: original_sql, optimized_sql, predicate_json, table_name, selected_columns, warnings.

Encodings

Encoding Best for
raw High-cardinality, incompressible numeric data
bitpacked Dense booleans (8 values/byte)
dictionary Low-cardinality text/categorical (≈ < 50k distinct)
offset_blob Variable-length blobs with a dictionary offset index
delta_zstd Monotonic / near-sequential int64 (timestamps, IDs)
zstd Generic high-entropy compressible data
for Frame-of-reference: clustered int ranges
for_zstd FOR + ZSTD for clustered ints that still compress
roaring Sparse booleans — queryable without column decode
zstd_dict Repeated text patterns via a shared trained dictionary

Benchmarks

On a 10,000-row synthetic sample (weji_goose/benchmarks/benchmark_results.json, reproducible via python -m weji_goose.benchmarks.demo):

Metric Value
Overall compression vs PG 3.83×
Write throughput ~10,000 rows/s
Full-scan throughput ~3.96M rows/s
Best per-column compression verb_id58.5× (dictionary, 6 distinct)
Column Encoding Compression
verb_id dictionary 58.5×
context_org dictionary 14.8×
stored delta_zstd 10.8×
actor_id for 8.0×
object_id dictionary 7.4×

Project layout

weji_goose/            core engine (import as `import weji_goose`)
  schema.py            TableSchema, ColumnType, Encoding enums
  table.py             GooseTable: create / open / insert / query / scan
  query.py             predicate pushdown engine, Predicate / CompoundPredicate
  advisor.py           SQL WHERE → optimized predicates
  reader.py writer.py  checksummed, atomic, mmap-backed I/O
  bloom.py roaring.py  partition skipping + row bitmaps
  sketch.py            probabilistic sketches
  correlation.py       cross-column pruning stats
goose_pg/              PostgreSQL integration
  cli.py               goose-pg CLI: profile / export / benchmark / query
  profiler.py          schema introspection + data profiling
  exporter.py          bulk export to Goose partitions
  type_map.py          PostgreSQL → Goose type/encoding mapping
reference/
  SPEC.md              language-agnostic binary format + algorithm spec
  c/                   C reference (libgoose.a / libgoose.so + tests)
  rust/                Rust reference crate (goose-encoding)
tests/                 272 tests

Reference implementations & spec

Goose's on-disk format and algorithms are defined in reference/SPEC.md so any database can implement a compatible reader/writer. Reference implementations live under reference/:

  • Creference/c/ builds libgoose.a / libgoose.so (make, make test).
  • Rustreference/rust/ is the goose-encoding crate (cargo build).

These are maintained alongside the Python engine as the canonical cross-language contract.

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pg]"
pytest -q          # 272 tests

The Python reference encoders/decoders are in weji_goose/; the C and Rust references are in reference/. Benchmarks: python -m weji_goose.benchmarks.demo.

License

MIT © 2026 WEJI Northern Technologies Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weji_goose-0.3.0.tar.gz (112.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

weji_goose-0.3.0-py3-none-any.whl (94.8 kB view details)

Uploaded Python 3

File details

Details for the file weji_goose-0.3.0.tar.gz.

File metadata

  • Download URL: weji_goose-0.3.0.tar.gz
  • Upload date:
  • Size: 112.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for weji_goose-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d5401c01c4aa24832249ac3726eb2f32a1dba77c7f853f33fecec2439e9bb922
MD5 81fdd36e84b86dec3593d0ba4ed4fc11
BLAKE2b-256 69b38574d0768be80585c8971f03ae2f2627ccf1cdbdc9a19f4d9b64f66f4630

See more details on using hashes here.

Provenance

The following attestation bundles were made for weji_goose-0.3.0.tar.gz:

Publisher: publish.yml on wejinortherntechnologiesinc/Goose

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file weji_goose-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: weji_goose-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 94.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for weji_goose-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 029ac7aa8664ecd361f2de17ae4c038b1266f50e227b70fe1fcddc2187154800
MD5 ab124a2324d6d63247a571b5128a9523
BLAKE2b-256 8da2f213095ecd3717b31f184ce3cb354f537d586d77d496311f710090ab40bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for weji_goose-0.3.0-py3-none-any.whl:

Publisher: publish.yml on wejinortherntechnologiesinc/Goose

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page