Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via DuckDB
Project description
acid — Astronomical Catalog Inference Driver
SQL-driven crossmatching and analysis of HEALPix-partitioned
astronomical catalogs, executed by DuckDB. acid extends SQL with an
XMATCH(...) predicate for spherical-distance joins, runs each anchor
partition independently against a boundary-safe margin cache, and
aggregates the results.
Reads and writes the HATS format used by LINCC Frameworks (LSDB, hats-import) and by published catalogs such as Gaia DR3 and Rubin DP1.
Quick start
One-shot query
import acid
r = acid.sql(
"""
SELECT a.id, b.id AS b_id, XMATCH_DISTANCE(b) AS d_arcsec
FROM dia_source AS a
JOIN object AS b ON XMATCH(radius_arcsec => 1.0)
JOIN lightcurve AS lc ON a.id = lc.object_id
""",
catalogs="catalogs.yaml",
workers=8,
)
r.df() # -> pandas.DataFrame
r.arrow() # -> pyarrow.Table
r.to_polars() # -> polars.DataFrame
for batch in r.batches(): ... # -> RecordBatch iterator
r.write_parquet("results.parquet", layout="single")
Exploratory session (recommended for repeated queries)
with acid.connect("catalogs.yaml", workers=8) as s:
r1 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 1.0)")
r2 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 0.5)")
# Promote an intermediate to a real catalog usable in later queries.
s.materialize(
"nearby",
"SELECT a.id, a.ra, a.dec FROM a JOIN b ON XMATCH(r => 1.0)",
)
r3 = s.sql("SELECT * FROM nearby JOIN c ON XMATCH(r => 5.0, mode => 'all')")
A Session keeps the worker pool, the DuckDB connection in each worker,
and the parsed registry alive across queries. On a 64-core box it's
typically 5–10× faster than calling acid.sql() in a loop.
CLI
# Query execution (--db accepts a directory of HATS catalogs or a YAML file)
acid query "SELECT COUNT(*) FROM object" --db datasets/ --out /tmp/result
acid query -f query.sql --db catalogs.yaml --out results/ --workers 32
echo "SELECT ..." | acid query --db datasets/ --out results/
acid validate "SELECT ..." --db datasets/
# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5
# Inspect catalogs (local or remote)
acid inspect /data/two_mass # summary
acid inspect schema /data/two_mass # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass # remote
# Build margin caches locally
acid hats build-margin /data/two_mass --margin-arcsec 5.0 --workers 16
results/ is itself a valid HATS catalog (lsdb.open_catalog(...) and
hats.read_hats(...) will read it). Downloaded subsets are also valid
HATS catalogs with rebuilt _metadata.
Catalog registry
The simplest way: point --db at a directory of HATS catalogs. Each
subdirectory with a properties file becomes a table named after the
directory. Margin caches (dataproduct_type=margin) are auto-skipped.
For more control, use a YAML file:
catalogs:
dia_source:
path: /data/dia_source # HATS root, or CatalogCollection root
# Auto-detected from <path>/properties when present:
# ra_col (from hats_col_ra)
# dec_col (from hats_col_dec)
# hpix_order (from <path>/partition_info.csv)
# neighbor_path (from collection.properties or sibling '_margin' dir)
# neighbor_margin_arcsec (from hats_margin_threshold)
# npix_suffix (from hats_npix_suffix; default '.parquet')
# Any auto-detected value can be overridden here.
object:
path: /data/object_collection # a CatalogCollection root works too
lightcurve:
path: /data/lightcurve
hpix_order: 5 # explicit when partition_info.csv is absent
# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
des_dr2: /data/mocs/des_dr2.fits
known_artifacts: /data/mocs/artifacts.fits
# If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
# auto-loads it — no explicit entry needed.
What XMATCH does
JOIN b ON XMATCH(radius_arcsec => 1.0) -- nearest, inner
JOIN b ON XMATCH(r => 1.0) -- 'r' is an alias
JOIN b ON XMATCH(r => 1.0, mode => 'all') -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0) -- keep unmatched anchors
-- Distance is exposed as a SELECT-level function over the right alias.
SELECT a.id, XMATCH_DISTANCE(b) AS d FROM a JOIN b ON XMATCH(r => 1.0)
WHERE XMATCH_DISTANCE(b) < 0.5
-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM a
JOIN b ON XMATCH(r => 1.0)
JOIN lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100
-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2') -- anchor inside DES footprint
AND NOT IN_MOC(b, 'known_artifacts') -- exclude artifact regions
-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a
Semantics, in short:
- All XMATCHes in a query use the anchor (first FROM) table's
coordinates, even after a
mode => 'all'expansion. - A right-table radius must be ≤ that catalog's
neighbor_margin_arcsec. Otherwise we'd silently miss boundary pairs; the analyzer rejects the query. ORDER BY ... LIMIT Kpushes the top-K to each partition first; the reducer re-sorts the union and applies the global LIMIT/OFFSET.- Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer over the per-partition Parquet output.
Python API surface
# Sessions (preferred for EDA)
s = acid.connect(catalogs, *, workers=1, duckdb_threads=None, cache_dir=None)
s.sql(query, *, output=None, inmem_row_limit=50_000_000) -> acid.Result
s.run(query, *, output=None) -> ExecutionResult
s.validate(query) -> QueryPlan
s.explain(query) -> str (rewritten SQL)
s.list_catalogs() -> list[str]
s.add_catalog(name, **kwargs) -> TableSpec
s.materialize(name, query, *, ra_col=None, dec_col=None, overwrite=False) -> TableSpec
s.register_moc(name, source) # FITS path, mocpy.MOC, or (N,2) ranges
s.close() # or use as a context manager
# One-shot convenience (transient Session under the hood)
acid.sql(query, *, catalogs, output=None, workers=1, ...) -> acid.Result
acid.run(query, *, catalogs, output, workers=1) -> ExecutionResult
# Result
r.num_rows, r.column_names, r.schema
r.column(name) -> pa.ChunkedArray
r.arrow() -> pa.Table
r.df() / r.to_pandas() -> pandas.DataFrame
r.to_polars() -> polars.DataFrame
r.to_pylist() -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10) -> Result
r.write_parquet(path, layout="hats"|"single") -> Path
len(r), for batch in r: ...
# Errors (all inherit from acid.AcidError)
acid.RegistryError # catalog config (missing path, mixed Norder, ...)
acid.ParseError # SQL parse failures
acid.ValidationError # unsupported XMATCH constructs
acid.ExecutionError # per-partition execution failures
acid.SessionClosedError # method called on a closed Session
Layout assumptions
- Catalogs follow the HATS layout:
<root>/dataset/Norder=N/Dir=D/Npix=P.parquet(orNpix=P/*.parquetwhenhats_npix_suffix='/'). - Margin caches live as sibling catalogs (HATS canonical), at
<root>/margin_cache/..., or any sibling dir matching<name>_margin*.collection.propertiesis consumed if present. - Adaptive (per-pixel) Norder is supported: a catalog's
partition_info.csvmay list pixels at any orders, and XMATCH/ordinary joins across mixed-Norder catalogs are run via a refinement-tree enumeration that emits one work unit per coarsest cursor pixel where every joined catalog has ≤ 1 partition. Output is itself a valid HATS catalog whosepartition_info.csvreflects the refinement.
What's the speed story?
- DuckDB does the heavy SQL work and Parquet I/O. acid is roughly a
Python orchestrator + a vectorized SciPy
cKDTreecrossmatcher. - Each partition is independent → embarrassingly parallel across HEALPix pixels.
- The per-worker DuckDB connection is opened once and reused for every
partition the worker processes; temp namespace is dropped between
partitions. With a
Session, the connection survives across queries. - Top-K queries push the LIMIT to each partition. Aggregates write partial data to disk and reduce centrally.
- Column pruning: anchor and right tables are TEMP views over
read_parquet(), so the final SELECT only pulls referenced columns from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs. - Auto-spill: when
output_diris unset and the running result exceedsinmem_row_limit(default 50M rows),acidspills to a tempdir rather than OOM-ing the parent.
See bench/match_all.py and bench/session_vs_oneshot.py for
microbenchmarks.
Install
With uv (recommended for development)
uv sync --dev # creates .venv, installs all deps + test + hats
uv run pytest # run tests
With pip
pip install -e .
# extras: pip install -e .[hats,dev]
Requires Python 3.10+, DuckDB 1.x, SQLGlot ≥ 27, PyArrow ≥ 14, NumPy ≥ 1.24, SciPy ≥ 1.10, PyYAML ≥ 6.
Status
-
v0 (correctness): XMATCH inner/left, mode 'nearest'/'all', chains, ordinary joins, distance via
XMATCH_DISTANCE(alias). -
v1 (scale): views + narrow side-tables, vectorized matcher, worker initializer, auto-spill, top-K pushdown, manifest.
-
v1.1 (HATS spec): writes valid HATS catalogs, reads canonical property keys, supports
hats_npix_suffix='/', auto-discovers margin siblings viacollection.properties. -
v2 (EDA): persistent
Session, per-worker DuckDB connection,Resultwrapper,materialize(). -
v3 (adaptive Norder): per-catalog
PartitionIndex, refinement-tree tuple enumeration, vectorized pure-numpyang2pix_nestedexposed as a DuckDB UDF for per-pixel row filtering, LEFT-XMATCH/JOIN over partitions without coverage. -
v4 (MOC footprint filtering):
IN_MOC(<alias>, '<name>')in WHERE restricts rows to a named sky region (Multi-Order Coverage map). SupportsNOT IN_MOC, multiple predicates (AND-combined via mocpy set ops), catalog auto-resolution frompoint_map.fits, and use in SELECT/ORDER BY/ CASE expressions. Three-level optimization: catalog-footprint scoping, cursor-pixel intersection, and partition-level pruning — all via the existing_healpix_29row-group pushdown fast path. -
v5 (catalog ops):
acid hats build-marginbuilds HATS margin caches locally (validated against hats-import).acid downloadgeneratespoint_map.fits, auto-includes HATS RA/Dec/healpix columns.acid queryaccepts--db <directory>for zero-config usage, fails fast on errors, shows tqdm progress, shuffles work for load balancing. Bare column resolution via schema introspection.LocalFetcherfor local I/O.
Tests: 176 passing (~30s parallel via pytest-xdist). Fixtures cached across runs.
Known limitations
- XMATCH must be the entire
ONpredicate. Compound predicates likeXMATCH(...) AND b.mag < 20are rejected. - No CTEs / subqueries in the anchor position.
- RIGHT / FULL / CROSS JOIN XMATCH not supported.
IN_MOCmust be in conjunctive WHERE position (top-level AND-chain, optionally negated). Disjunctive use (IN_MOC(...) OR ...) andIN_MOCinJOIN ONare rejected.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyacid-0.1.0a2.tar.gz.
File metadata
- Download URL: pyacid-0.1.0a2.tar.gz
- Upload date:
- Size: 349.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f268def72c2e8d2bcb89761705332029ba2b3e2bf9e5bdae7b8c88a6c87dd5d
|
|
| MD5 |
99d40e7f4dfb7f6508b7d5a0a57eb178
|
|
| BLAKE2b-256 |
8951d9995858c7df5924d7fa8d361adea7d642609484aa4826db527596521789
|
Provenance
The following attestation bundles were made for pyacid-0.1.0a2.tar.gz:
Publisher:
publish.yml on mjuric/acid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyacid-0.1.0a2.tar.gz -
Subject digest:
9f268def72c2e8d2bcb89761705332029ba2b3e2bf9e5bdae7b8c88a6c87dd5d - Sigstore transparency entry: 1611735429
- Sigstore integration time:
-
Permalink:
mjuric/acid@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/mjuric
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pyacid-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: pyacid-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 101.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6548585372a96002ede04703e5b2a4f4c26dd18fbc738bed70e35bac8e75f8b3
|
|
| MD5 |
e6178aeba97f146afc891e2b0475eb40
|
|
| BLAKE2b-256 |
4a383d457138c732fae02cc816124f849929e4be58ca7f30b6c57e5fab91b6f6
|
Provenance
The following attestation bundles were made for pyacid-0.1.0a2-py3-none-any.whl:
Publisher:
publish.yml on mjuric/acid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyacid-0.1.0a2-py3-none-any.whl -
Subject digest:
6548585372a96002ede04703e5b2a4f4c26dd18fbc738bed70e35bac8e75f8b3 - Sigstore transparency entry: 1611735906
- Sigstore integration time:
-
Permalink:
mjuric/acid@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/mjuric
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4 -
Trigger Event:
release
-
Statement type: