Goose — a columnar storage engine with 10 specialized encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.
Project description
Goose
A columnar storage engine for analytical workloads — 10 specialized column encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.
Goose is a universal booster: a column-encoding and predicate-pushdown layer you can use three ways —
- As a Python library — point it at data, let it pick the best encoding per column, write a compressed columnar store, and query it with predicate pushdown.
- As a PostgreSQL cold tier — profile a Postgres table, export it to Goose, and query the compressed copy (CDC keeps them in sync).
- As an embedded encoding library — link the C (
libgoose.a) or Rust (goose-encoding) reference implementation into another database to get encoding intelligence and predicate pushdown without changing your storage engine.
The binary format, encoding selection algorithm, and predicate evaluation logic are specified language-agnostically in reference/SPEC.md, so any database can implement a compatible reader/writer.
Why
PostgreSQL is great at OLTP and terrible at analytical scans over years of history. Goose takes the analytical workload off Postgres: a verb_id column with 6 distinct values compresses 58×; a monotonic timestamp column compresses 10.8× with FOR+ZSTD; a sparse boolean flag becomes a roaring bitmap you can filter without decoding the column. Predicate pushdown then skips whole partitions using bloom filters and zone maps before touching row data.
Features
- 10 column encodings —
raw,bitpacked,dictionary,offset_blob,delta_zstd,zstd,for(frame-of-reference),for_zstd,roaring(roaring-bitmap booleans),zstd_dict(shared trained dictionary). - Predicate pushdown —
eq,neq,gt,gte,lt,lte,in, compoundand/or. Each predicate produces a roaring bitmap of matching rows; predicates are intersected, never materializing the full column. - Partition skipping — per-partition and per-block bloom filters + zone maps let the reader rule out entire partitions before decode.
- Adaptive selectivity feedback — observed predicate selectivities reorder filters so the most selective (cheapest) predicates run first.
- Cross-column correlation statistics — propagate partition pruning across correlated columns.
- Probabilistic sketches — approximate answers (count-distinct, etc.) for fast exploration.
- Atomic writes + CRC32 checksums — every file is written to a
.tmpthen renamed; partition manifests carry checksums, verified on read. - SQL query advisor —
goose.advisor.optimize()takes a SQLWHEREclause, reorders predicates by selectivity, and returns optimized SQL + a structured predicate tree forGooseTable.query(). - PostgreSQL integration —
goose-pgCLI to profile, export, benchmark, and query.
Install
pip install weji-goose # core engine (numpy, zstandard, sqlglot)
pip install "weji-goose[pg]" # + psycopg2-binary for the goose-pg CLI
pip install "weji-goose[dev]" # + pytest for running the test suite
Note: the PyPI distribution is
weji-goose; the import name isimport weji_goose.
Requires Python ≥ 3.10.
Quick start — Python library
import numpy as np
from weji_goose.schema import TableSchema
from weji_goose.column import ColumnSpec, ColumnType, Encoding
from weji_goose.table import GooseTable
from weji_goose.query import Predicate, CompoundPredicate
schema = TableSchema("demo", columns=[
ColumnSpec("id", ColumnType.INT64, Encoding.DELTA_ZSTD), # monotonic → delta
ColumnSpec("flag", ColumnType.BOOLEAN, Encoding.ROARING), # sparse bool → bitmap
ColumnSpec("city", ColumnType.TEXT, Encoding.DICTIONARY), # low-card text → dict
])
# Create and write
table = GooseTable.create(schema, "./demo_table")
table.insert({
"id": np.arange(1_000_000, dtype=np.int64),
"flag": np.array([i % 5 == 0 for i in range(1_000_000)], dtype=bool),
"city": np.array(["stjohns", "corner", "bay"] * 333_333 + ["stjohns"], dtype=object),
})
# Predicate pushdown: filter without decoding the columns
table = GooseTable.open("./demo_table")
result = table.query(
["city", "id"],
where=CompoundPredicate("and", [
Predicate("flag", "eq", True),
Predicate("id", "gte", 999_000),
]),
)
print(result["city"][:5], result["id"][:5])
print("on-disk size:", table.total_size_bytes(), "bytes")
Supported types
int64, int32, int16, float64, float32, boolean, uuid, text, interval.
Quick start — PostgreSQL cold tier (goose-pg)
# 1. Profile a table → schema + per-column compression estimates
goose-pg profile \
--db-url "postgresql://user:pass@host/db" \
--table sensor_readings \
--output ./sensor_readings_schema.json -v
# 2. Export the table to Goose format (auto-partitioned, resumable)
goose-pg export \
--db-url "postgresql://user:pass@host/db" \
--table sensor_readings \
--output-dir ./goose_data \
--partition-column recorded_at --partition-interval month --resume -v
# 3. Benchmark: profile + export + measured compression ratio
goose-pg benchmark \
--db-url "postgresql://user:pass@host/db" \
--table sensor_readings \
--output-dir ./bench
# 4. Query the compressed Goose data with predicate pushdown
goose-pg query \
--goose-dir ./goose_data/sensor_readings \
--columns recorded_at,pm25 \
--where "pm25 >= 35 AND region = 'bay_st_george'" \
--format csv
goose-pg query --where parses the SQL WHERE clause through the Goose advisor (using the opened table's schema) and pushes it down. Output formats: table (default), csv, json.
A docker-compose.yml is included for integration testing against a throwaway Postgres. The goose service is profile-gated under cli, so pass --profile cli to invoke it:
docker compose up -d postgres
docker compose --profile cli run --rm goose --help
docker compose --profile cli run --rm goose query --goose-dir ./goose_data/mytable --columns id,name
The query advisor
goose.advisor.optimize(sql, schema) reorders WHERE predicates by selectivity — roaring-bitmap lookups first, then bloom-filtered ID lookups, then range predicates — and returns optimized SQL plus a predicate tree:
from weji_goose.advisor import optimize, SCHEMAS
oq = optimize(
"SELECT id, verb_id FROM xapi_events "
"WHERE verb_id IN ('completed', 'passed') AND actor_id = 42",
schema=SCHEMAS["xapi_events"],
)
print(oq.optimized_sql)
# SELECT id, verb_id
# FROM xapi_events
# WHERE actor_id = 42 AND verb_id IN ('completed', 'passed')
print(oq.predicate_json)
# {'op': 'and', 'predicates': [
# {'column': 'actor_id', 'op': 'eq', 'value': 42},
# {'column': 'verb_id', 'op': 'in', 'value': ['completed', 'passed']}]}
OptimizedQuery exposes: original_sql, optimized_sql, predicate_json, table_name, selected_columns, warnings.
Encodings
| Encoding | Best for |
|---|---|
raw |
High-cardinality, incompressible numeric data |
bitpacked |
Dense booleans (8 values/byte) |
dictionary |
Low-cardinality text/categorical (≈ < 50k distinct) |
offset_blob |
Variable-length blobs with a dictionary offset index |
delta_zstd |
Monotonic / near-sequential int64 (timestamps, IDs) |
zstd |
Generic high-entropy compressible data |
for |
Frame-of-reference: clustered int ranges |
for_zstd |
FOR + ZSTD for clustered ints that still compress |
roaring |
Sparse booleans — queryable without column decode |
zstd_dict |
Repeated text patterns via a shared trained dictionary |
Benchmarks
On a 10,000-row synthetic sample (weji_goose/benchmarks/benchmark_results.json, reproducible via python -m weji_goose.benchmarks.demo):
| Metric | Value |
|---|---|
| Overall compression vs PG | 3.83× |
| Write throughput | ~10,000 rows/s |
| Full-scan throughput | ~3.96M rows/s |
| Best per-column compression | verb_id → 58.5× (dictionary, 6 distinct) |
| Column | Encoding | Compression |
|---|---|---|
verb_id |
dictionary | 58.5× |
context_org |
dictionary | 14.8× |
stored |
delta_zstd | 10.8× |
actor_id |
for | 8.0× |
object_id |
dictionary | 7.4× |
Project layout
weji_goose/ core engine (import as `import weji_goose`)
schema.py TableSchema, ColumnType, Encoding enums
table.py GooseTable: create / open / insert / query / scan
query.py predicate pushdown engine, Predicate / CompoundPredicate
advisor.py SQL WHERE → optimized predicates
reader.py writer.py checksummed, atomic, mmap-backed I/O
bloom.py roaring.py partition skipping + row bitmaps
sketch.py probabilistic sketches
correlation.py cross-column pruning stats
goose_pg/ PostgreSQL integration
cli.py goose-pg CLI: profile / export / benchmark / query
profiler.py schema introspection + data profiling
exporter.py bulk export to Goose partitions
type_map.py PostgreSQL → Goose type/encoding mapping
reference/
SPEC.md language-agnostic binary format + algorithm spec
c/ C reference (libgoose.a / libgoose.so + tests)
rust/ Rust reference crate (goose-encoding)
tests/ 272 tests
Reference implementations & spec
Goose's on-disk format and algorithms are defined in reference/SPEC.md so any database can implement a compatible reader/writer. Reference implementations live under reference/:
- C —
reference/c/buildslibgoose.a/libgoose.so(make,make test). - Rust —
reference/rust/is thegoose-encodingcrate (cargo build).
These are maintained alongside the Python engine as the canonical cross-language contract.
Development
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pg]"
pytest -q # 272 tests
The Python reference encoders/decoders are in weji_goose/; the C and Rust references are in reference/. Benchmarks: python -m weji_goose.benchmarks.demo.
License
MIT © 2026 WEJI Northern Technologies Inc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file weji_goose-0.3.0.tar.gz.
File metadata
- Download URL: weji_goose-0.3.0.tar.gz
- Upload date:
- Size: 112.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5401c01c4aa24832249ac3726eb2f32a1dba77c7f853f33fecec2439e9bb922
|
|
| MD5 |
81fdd36e84b86dec3593d0ba4ed4fc11
|
|
| BLAKE2b-256 |
69b38574d0768be80585c8971f03ae2f2627ccf1cdbdc9a19f4d9b64f66f4630
|
Provenance
The following attestation bundles were made for weji_goose-0.3.0.tar.gz:
Publisher:
publish.yml on wejinortherntechnologiesinc/Goose
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
weji_goose-0.3.0.tar.gz -
Subject digest:
d5401c01c4aa24832249ac3726eb2f32a1dba77c7f853f33fecec2439e9bb922 - Sigstore transparency entry: 1867998711
- Sigstore integration time:
-
Permalink:
wejinortherntechnologiesinc/Goose@7e64f472f20545311559033affe9cc7d990db847 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/wejinortherntechnologiesinc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7e64f472f20545311559033affe9cc7d990db847 -
Trigger Event:
push
-
Statement type:
File details
Details for the file weji_goose-0.3.0-py3-none-any.whl.
File metadata
- Download URL: weji_goose-0.3.0-py3-none-any.whl
- Upload date:
- Size: 94.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
029ac7aa8664ecd361f2de17ae4c038b1266f50e227b70fe1fcddc2187154800
|
|
| MD5 |
ab124a2324d6d63247a571b5128a9523
|
|
| BLAKE2b-256 |
8da2f213095ecd3717b31f184ce3cb354f537d586d77d496311f710090ab40bd
|
Provenance
The following attestation bundles were made for weji_goose-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on wejinortherntechnologiesinc/Goose
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
weji_goose-0.3.0-py3-none-any.whl -
Subject digest:
029ac7aa8664ecd361f2de17ae4c038b1266f50e227b70fe1fcddc2187154800 - Sigstore transparency entry: 1867998749
- Sigstore integration time:
-
Permalink:
wejinortherntechnologiesinc/Goose@7e64f472f20545311559033affe9cc7d990db847 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/wejinortherntechnologiesinc
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7e64f472f20545311559033affe9cc7d990db847 -
Trigger Event:
push
-
Statement type: