Skip to main content

Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via DuckDB

Project description

acid — Astronomical Catalog Inference Driver

SQL-driven crossmatching and analysis of HEALPix-partitioned astronomical catalogs, executed by DuckDB. acid extends SQL with an XMATCH(...) predicate for spherical-distance joins, runs each anchor partition independently against a boundary-safe margin cache, and aggregates the results.

Reads and writes the HATS format used by LINCC Frameworks (LSDB, hats-import) and by published catalogs such as Gaia DR3 and Rubin DP1.


Quick start

One-shot query

import acid

r = acid.sql(
    """
    SELECT a.id, b.id AS b_id, XMATCH_DISTANCE(b) AS d_arcsec
    FROM dia_source AS a
    JOIN object AS b ON XMATCH(radius_arcsec => 1.0)
    JOIN lightcurve AS lc ON a.id = lc.object_id
    """,
    catalogs="catalogs.yaml",
    workers=8,
)

r.df()              # -> pandas.DataFrame
r.arrow()           # -> pyarrow.Table
r.to_polars()       # -> polars.DataFrame
for batch in r.batches(): ...      # -> RecordBatch iterator
r.write_parquet("results.parquet", layout="single")

Exploratory session (recommended for repeated queries)

with acid.connect("catalogs.yaml", workers=8) as s:
    r1 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 1.0)")
    r2 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 0.5)")

    # Promote an intermediate to a real catalog usable in later queries.
    s.materialize(
        "nearby",
        "SELECT a.id, a.ra, a.dec FROM a JOIN b ON XMATCH(r => 1.0)",
    )
    r3 = s.sql("SELECT * FROM nearby JOIN c ON XMATCH(r => 5.0, mode => 'all')")

A Session keeps the worker pool, the DuckDB connection in each worker, and the parsed registry alive across queries. On a 64-core box it's typically 5–10× faster than calling acid.sql() in a loop.

CLI

# Query execution (--db accepts a directory of HATS catalogs or a YAML file)
acid query "SELECT COUNT(*) FROM object" --db datasets/ --out /tmp/result
acid query -f query.sql --db catalogs.yaml --out results/ --workers 32
echo "SELECT ..." | acid query --db datasets/ --out results/
acid validate "SELECT ..." --db datasets/

# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5

# Inspect catalogs (local or remote)
acid inspect /data/two_mass                          # summary
acid inspect schema /data/two_mass                   # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass  # remote

# Build margin caches locally
acid hats build-margin /data/two_mass --margin-arcsec 5.0 --workers 16

results/ is itself a valid HATS catalog (lsdb.open_catalog(...) and hats.read_hats(...) will read it). Downloaded subsets are also valid HATS catalogs with rebuilt _metadata.

Catalog registry

The simplest way: point --db at a directory of HATS catalogs. Each subdirectory with a properties file becomes a table named after the directory. Margin caches (dataproduct_type=margin) are auto-skipped.

For more control, use a YAML file:

catalogs:
  dia_source:
    path: /data/dia_source      # HATS root, or CatalogCollection root
    # Auto-detected from <path>/properties when present:
    #   ra_col            (from hats_col_ra)
    #   dec_col           (from hats_col_dec)
    #   hpix_order        (from <path>/partition_info.csv)
    #   neighbor_path     (from collection.properties or sibling '_margin' dir)
    #   neighbor_margin_arcsec  (from hats_margin_threshold)
    #   npix_suffix       (from hats_npix_suffix; default '.parquet')
    # Any auto-detected value can be overridden here.

  object:
    path: /data/object_collection    # a CatalogCollection root works too

  lightcurve:
    path: /data/lightcurve
    hpix_order: 5                    # explicit when partition_info.csv is absent

# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
  des_dr2: /data/mocs/des_dr2.fits
  known_artifacts: /data/mocs/artifacts.fits
  # If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
  # auto-loads it — no explicit entry needed.

What XMATCH does

JOIN  b ON XMATCH(radius_arcsec => 1.0)                   -- nearest, inner
JOIN  b ON XMATCH(r => 1.0)                               -- 'r' is an alias
JOIN  b ON XMATCH(r => 1.0, mode => 'all')                -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0)                           -- keep unmatched anchors

-- Distance is exposed as a SELECT-level function over the right alias.
SELECT a.id, XMATCH_DISTANCE(b) AS d FROM a JOIN b ON XMATCH(r => 1.0)
WHERE  XMATCH_DISTANCE(b) < 0.5

-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM a
JOIN  b ON XMATCH(r => 1.0)
JOIN  lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100

-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2')              -- anchor inside DES footprint
  AND NOT IN_MOC(b, 'known_artifacts')  -- exclude artifact regions

-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a

Semantics, in short:

  • All XMATCHes in a query use the anchor (first FROM) table's coordinates, even after a mode => 'all' expansion.
  • A right-table radius must be ≤ that catalog's neighbor_margin_arcsec. Otherwise we'd silently miss boundary pairs; the analyzer rejects the query.
  • ORDER BY ... LIMIT K pushes the top-K to each partition first; the reducer re-sorts the union and applies the global LIMIT/OFFSET.
  • Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer over the per-partition Parquet output.

Python API surface

# Sessions (preferred for EDA)
s = acid.connect(catalogs, *, workers=1, duckdb_threads=None, cache_dir=None)
s.sql(query, *, output=None, inmem_row_limit=50_000_000) -> acid.Result
s.run(query, *, output=None)                              -> ExecutionResult
s.validate(query)                                          -> QueryPlan
s.explain(query)                                           -> str (rewritten SQL)
s.list_catalogs()                                          -> list[str]
s.add_catalog(name, **kwargs)                              -> TableSpec
s.materialize(name, query, *, ra_col=None, dec_col=None, overwrite=False) -> TableSpec
s.register_moc(name, source)                           # FITS path, mocpy.MOC, or (N,2) ranges
s.close()    # or use as a context manager

# One-shot convenience (transient Session under the hood)
acid.sql(query, *, catalogs, output=None, workers=1, ...)  -> acid.Result
acid.run(query, *, catalogs, output, workers=1)            -> ExecutionResult

# Result
r.num_rows, r.column_names, r.schema
r.column(name)         -> pa.ChunkedArray
r.arrow()              -> pa.Table
r.df() / r.to_pandas() -> pandas.DataFrame
r.to_polars()          -> polars.DataFrame
r.to_pylist()          -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10)           -> Result
r.write_parquet(path, layout="hats"|"single") -> Path
len(r), for batch in r: ...

# Errors (all inherit from acid.AcidError)
acid.RegistryError       # catalog config (missing path, mixed Norder, ...)
acid.ParseError          # SQL parse failures
acid.ValidationError     # unsupported XMATCH constructs
acid.ExecutionError      # per-partition execution failures
acid.SessionClosedError  # method called on a closed Session

Layout assumptions

  • Catalogs follow the HATS layout: <root>/dataset/Norder=N/Dir=D/Npix=P.parquet (or Npix=P/*.parquet when hats_npix_suffix='/').
  • Margin caches live as sibling catalogs (HATS canonical), at <root>/margin_cache/..., or any sibling dir matching <name>_margin*. collection.properties is consumed if present.
  • Adaptive (per-pixel) Norder is supported: a catalog's partition_info.csv may list pixels at any orders, and XMATCH/ordinary joins across mixed-Norder catalogs are run via a refinement-tree enumeration that emits one work unit per coarsest cursor pixel where every joined catalog has ≤ 1 partition. Output is itself a valid HATS catalog whose partition_info.csv reflects the refinement.

What's the speed story?

  • DuckDB does the heavy SQL work and Parquet I/O. acid is roughly a Python orchestrator + a vectorized SciPy cKDTree crossmatcher.
  • Each partition is independent → embarrassingly parallel across HEALPix pixels.
  • The per-worker DuckDB connection is opened once and reused for every partition the worker processes; temp namespace is dropped between partitions. With a Session, the connection survives across queries.
  • Top-K queries push the LIMIT to each partition. Aggregates write partial data to disk and reduce centrally.
  • Column pruning: anchor and right tables are TEMP views over read_parquet(), so the final SELECT only pulls referenced columns from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs.
  • Auto-spill: when output_dir is unset and the running result exceeds inmem_row_limit (default 50M rows), acid spills to a tempdir rather than OOM-ing the parent.

See bench/match_all.py and bench/session_vs_oneshot.py for microbenchmarks.


Install

With uv (recommended for development)

uv sync --dev          # creates .venv, installs all deps + test + hats
uv run pytest          # run tests

With pip

pip install -e .
# extras:  pip install -e .[hats,dev]

Requires Python 3.10+, DuckDB 1.x, SQLGlot ≥ 27, PyArrow ≥ 14, NumPy ≥ 1.24, SciPy ≥ 1.10, PyYAML ≥ 6.


Status

  • v0 (correctness): XMATCH inner/left, mode 'nearest'/'all', chains, ordinary joins, distance via XMATCH_DISTANCE(alias).

  • v1 (scale): views + narrow side-tables, vectorized matcher, worker initializer, auto-spill, top-K pushdown, manifest.

  • v1.1 (HATS spec): writes valid HATS catalogs, reads canonical property keys, supports hats_npix_suffix='/', auto-discovers margin siblings via collection.properties.

  • v2 (EDA): persistent Session, per-worker DuckDB connection, Result wrapper, materialize().

  • v3 (adaptive Norder): per-catalog PartitionIndex, refinement-tree tuple enumeration, vectorized pure-numpy ang2pix_nested exposed as a DuckDB UDF for per-pixel row filtering, LEFT-XMATCH/JOIN over partitions without coverage.

  • v4 (MOC footprint filtering): IN_MOC(<alias>, '<name>') in WHERE restricts rows to a named sky region (Multi-Order Coverage map). Supports NOT IN_MOC, multiple predicates (AND-combined via mocpy set ops), catalog auto-resolution from point_map.fits, and use in SELECT/ORDER BY/ CASE expressions. Three-level optimization: catalog-footprint scoping, cursor-pixel intersection, and partition-level pruning — all via the existing _healpix_29 row-group pushdown fast path.

  • v5 (catalog ops): acid hats build-margin builds HATS margin caches locally (validated against hats-import). acid download generates point_map.fits, auto-includes HATS RA/Dec/healpix columns. acid query accepts --db <directory> for zero-config usage, fails fast on errors, shows tqdm progress, shuffles work for load balancing. Bare column resolution via schema introspection. LocalFetcher for local I/O.

Tests: 176 passing (~30s parallel via pytest-xdist). Fixtures cached across runs.

Known limitations

  • XMATCH must be the entire ON predicate. Compound predicates like XMATCH(...) AND b.mag < 20 are rejected.
  • No CTEs / subqueries in the anchor position.
  • RIGHT / FULL / CROSS JOIN XMATCH not supported.
  • IN_MOC must be in conjunctive WHERE position (top-level AND-chain, optionally negated). Disjunctive use (IN_MOC(...) OR ...) and IN_MOC in JOIN ON are rejected.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyacid-0.1.0a1.tar.gz (346.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyacid-0.1.0a1-py3-none-any.whl (99.9 kB view details)

Uploaded Python 3

File details

Details for the file pyacid-0.1.0a1.tar.gz.

File metadata

  • Download URL: pyacid-0.1.0a1.tar.gz
  • Upload date:
  • Size: 346.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 cc94e27545ba65a74ab6e55c20feb2790e18d91be9839884ab5c97df11a25ff7
MD5 b7f587a1ed8d659512c8455faf86b82d
BLAKE2b-256 f83528cc429e7494e84250a1833b3f49580b2e08e59d0634c2f2a97d215b28db

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a1.tar.gz:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyacid-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: pyacid-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 99.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 e9c65a0199c4f7dc50d048e8b9ce9cd892f2fe1ec4c8834518d0249e3212c02f
MD5 fbb9ddd64f3ed87e9be3193895540a1d
BLAKE2b-256 b5a21b5489327e1cda5a6fc4e1017de600e3ac7a00b9661a1fe57e1028b802a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a1-py3-none-any.whl:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page