Skip to main content

Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via DuckDB

Project description

acid — Astronomical Catalog Inference Driver

SQL-driven crossmatching and analysis of HEALPix-partitioned astronomical catalogs, executed by DuckDB. acid extends SQL with an XMATCH(...) predicate for spherical-distance joins, runs each anchor partition independently against a boundary-safe margin cache, and aggregates the results.

Reads and writes the HATS format used by LINCC Frameworks (LSDB, hats-import) and by published catalogs such as Gaia DR3 and Rubin DP1.


Quick start

One-shot query

import acid

r = acid.sql(
    """
    SELECT a.id, b.id AS b_id, XMATCH_DISTANCE(b) AS d_arcsec
    FROM dia_source AS a
    JOIN object AS b ON XMATCH(radius_arcsec => 1.0)
    JOIN lightcurve AS lc ON a.id = lc.object_id
    """,
    catalogs="catalogs.yaml",
    workers=8,
)

r.df()              # -> pandas.DataFrame
r.arrow()           # -> pyarrow.Table
r.to_polars()       # -> polars.DataFrame
for batch in r.batches(): ...      # -> RecordBatch iterator
r.write_parquet("results.parquet", layout="single")

Exploratory session (recommended for repeated queries)

with acid.connect("catalogs.yaml", workers=8) as s:
    r1 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 1.0)")
    r2 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 0.5)")

    # Promote an intermediate to a real catalog usable in later queries.
    s.materialize(
        "nearby",
        "SELECT a.id, a.ra, a.dec FROM a JOIN b ON XMATCH(r => 1.0)",
    )
    r3 = s.sql("SELECT * FROM nearby JOIN c ON XMATCH(r => 5.0, mode => 'all')")

A Session keeps the worker pool, the DuckDB connection in each worker, and the parsed registry alive across queries. On a 64-core box it's typically 5–10× faster than calling acid.sql() in a loop.

CLI

# Query execution (--db accepts a directory of HATS catalogs or a YAML file)
acid query "SELECT COUNT(*) FROM object" --db datasets/ --out /tmp/result
acid query -f query.sql --db catalogs.yaml --out results/ --workers 32
echo "SELECT ..." | acid query --db datasets/ --out results/
acid validate "SELECT ..." --db datasets/

# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5

# Inspect catalogs (local or remote)
acid inspect /data/two_mass                          # summary
acid inspect schema /data/two_mass                   # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass  # remote

# Build margin caches locally
acid hats build-margin /data/two_mass --margin-arcsec 5.0 --workers 16

results/ is itself a valid HATS catalog (lsdb.open_catalog(...) and hats.read_hats(...) will read it). Downloaded subsets are also valid HATS catalogs with rebuilt _metadata.

Catalog registry

The simplest way: point --db at a directory of HATS catalogs. Each subdirectory with a properties file becomes a table named after the directory. Margin caches (dataproduct_type=margin) are auto-skipped.

For more control, use a YAML file:

catalogs:
  dia_source:
    path: /data/dia_source      # HATS root, or CatalogCollection root
    # Auto-detected from <path>/properties when present:
    #   ra_col            (from hats_col_ra)
    #   dec_col           (from hats_col_dec)
    #   hpix_order        (from <path>/partition_info.csv)
    #   neighbor_path     (from collection.properties or sibling '_margin' dir)
    #   neighbor_margin_arcsec  (from hats_margin_threshold)
    #   npix_suffix       (from hats_npix_suffix; default '.parquet')
    # Any auto-detected value can be overridden here.

  object:
    path: /data/object_collection    # a CatalogCollection root works too

  lightcurve:
    path: /data/lightcurve
    hpix_order: 5                    # explicit when partition_info.csv is absent

# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
  des_dr2: /data/mocs/des_dr2.fits
  known_artifacts: /data/mocs/artifacts.fits
  # If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
  # auto-loads it — no explicit entry needed.

What XMATCH does

JOIN  b ON XMATCH(radius_arcsec => 1.0)                   -- nearest, inner
JOIN  b ON XMATCH(r => 1.0)                               -- 'r' is an alias
JOIN  b ON XMATCH(r => 1.0, mode => 'all')                -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0)                           -- keep unmatched anchors

-- Distance is exposed as a SELECT-level function over the right alias.
SELECT a.id, XMATCH_DISTANCE(b) AS d FROM a JOIN b ON XMATCH(r => 1.0)
WHERE  XMATCH_DISTANCE(b) < 0.5

-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM a
JOIN  b ON XMATCH(r => 1.0)
JOIN  lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100

-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2')              -- anchor inside DES footprint
  AND NOT IN_MOC(b, 'known_artifacts')  -- exclude artifact regions

-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a

Semantics, in short:

  • All XMATCHes in a query use the anchor (first FROM) table's coordinates, even after a mode => 'all' expansion.
  • A right-table radius must be ≤ that catalog's neighbor_margin_arcsec. Otherwise we'd silently miss boundary pairs; the analyzer rejects the query.
  • ORDER BY ... LIMIT K pushes the top-K to each partition first; the reducer re-sorts the union and applies the global LIMIT/OFFSET.
  • Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer over the per-partition Parquet output.

Python API surface

# Sessions (preferred for EDA)
s = acid.connect(catalogs, *, workers=1, duckdb_threads=None, cache_dir=None)
s.sql(query, *, output=None, inmem_row_limit=50_000_000) -> acid.Result
s.run(query, *, output=None)                              -> ExecutionResult
s.validate(query)                                          -> QueryPlan
s.explain(query)                                           -> str (rewritten SQL)
s.list_catalogs()                                          -> list[str]
s.add_catalog(name, **kwargs)                              -> TableSpec
s.materialize(name, query, *, ra_col=None, dec_col=None, overwrite=False) -> TableSpec
s.register_moc(name, source)                           # FITS path, mocpy.MOC, or (N,2) ranges
s.close()    # or use as a context manager

# One-shot convenience (transient Session under the hood)
acid.sql(query, *, catalogs, output=None, workers=1, ...)  -> acid.Result
acid.run(query, *, catalogs, output, workers=1)            -> ExecutionResult

# Result
r.num_rows, r.column_names, r.schema
r.column(name)         -> pa.ChunkedArray
r.arrow()              -> pa.Table
r.df() / r.to_pandas() -> pandas.DataFrame
r.to_polars()          -> polars.DataFrame
r.to_pylist()          -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10)           -> Result
r.write_parquet(path, layout="hats"|"single") -> Path
len(r), for batch in r: ...

# Errors (all inherit from acid.AcidError)
acid.RegistryError       # catalog config (missing path, mixed Norder, ...)
acid.ParseError          # SQL parse failures
acid.ValidationError     # unsupported XMATCH constructs
acid.ExecutionError      # per-partition execution failures
acid.SessionClosedError  # method called on a closed Session

Layout assumptions

  • Catalogs follow the HATS layout: <root>/dataset/Norder=N/Dir=D/Npix=P.parquet (or Npix=P/*.parquet when hats_npix_suffix='/').
  • Margin caches live as sibling catalogs (HATS canonical), at <root>/margin_cache/..., or any sibling dir matching <name>_margin*. collection.properties is consumed if present.
  • Adaptive (per-pixel) Norder is supported: a catalog's partition_info.csv may list pixels at any orders, and XMATCH/ordinary joins across mixed-Norder catalogs are run via a refinement-tree enumeration that emits one work unit per coarsest cursor pixel where every joined catalog has ≤ 1 partition. Output is itself a valid HATS catalog whose partition_info.csv reflects the refinement.

What's the speed story?

  • DuckDB does the heavy SQL work and Parquet I/O. acid is roughly a Python orchestrator + a vectorized SciPy cKDTree crossmatcher.
  • Each partition is independent → embarrassingly parallel across HEALPix pixels.
  • The per-worker DuckDB connection is opened once and reused for every partition the worker processes; temp namespace is dropped between partitions. With a Session, the connection survives across queries.
  • Top-K queries push the LIMIT to each partition. Aggregates write partial data to disk and reduce centrally.
  • Column pruning: anchor and right tables are TEMP views over read_parquet(), so the final SELECT only pulls referenced columns from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs.
  • Auto-spill: when output_dir is unset and the running result exceeds inmem_row_limit (default 50M rows), acid spills to a tempdir rather than OOM-ing the parent.

See bench/match_all.py and bench/session_vs_oneshot.py for microbenchmarks.


Install

With uv (recommended for development)

uv sync --dev          # creates .venv, installs all deps + test + hats
uv run pytest          # run tests

With pip

pip install -e .
# extras:  pip install -e .[hats,dev]

Requires Python 3.10+, DuckDB 1.x, SQLGlot ≥ 27, PyArrow ≥ 14, NumPy ≥ 1.24, SciPy ≥ 1.10, PyYAML ≥ 6.


Status

  • v0 (correctness): XMATCH inner/left, mode 'nearest'/'all', chains, ordinary joins, distance via XMATCH_DISTANCE(alias).

  • v1 (scale): views + narrow side-tables, vectorized matcher, worker initializer, auto-spill, top-K pushdown, manifest.

  • v1.1 (HATS spec): writes valid HATS catalogs, reads canonical property keys, supports hats_npix_suffix='/', auto-discovers margin siblings via collection.properties.

  • v2 (EDA): persistent Session, per-worker DuckDB connection, Result wrapper, materialize().

  • v3 (adaptive Norder): per-catalog PartitionIndex, refinement-tree tuple enumeration, vectorized pure-numpy ang2pix_nested exposed as a DuckDB UDF for per-pixel row filtering, LEFT-XMATCH/JOIN over partitions without coverage.

  • v4 (MOC footprint filtering): IN_MOC(<alias>, '<name>') in WHERE restricts rows to a named sky region (Multi-Order Coverage map). Supports NOT IN_MOC, multiple predicates (AND-combined via mocpy set ops), catalog auto-resolution from point_map.fits, and use in SELECT/ORDER BY/ CASE expressions. Three-level optimization: catalog-footprint scoping, cursor-pixel intersection, and partition-level pruning — all via the existing _healpix_29 row-group pushdown fast path.

  • v5 (catalog ops): acid hats build-margin builds HATS margin caches locally (validated against hats-import). acid download generates point_map.fits, auto-includes HATS RA/Dec/healpix columns. acid query accepts --db <directory> for zero-config usage, fails fast on errors, shows tqdm progress, shuffles work for load balancing. Bare column resolution via schema introspection. LocalFetcher for local I/O.

Tests: 176 passing (~30s parallel via pytest-xdist). Fixtures cached across runs.

Known limitations

  • XMATCH must be the entire ON predicate. Compound predicates like XMATCH(...) AND b.mag < 20 are rejected.
  • No CTEs / subqueries in the anchor position.
  • RIGHT / FULL / CROSS JOIN XMATCH not supported.
  • IN_MOC must be in conjunctive WHERE position (top-level AND-chain, optionally negated). Disjunctive use (IN_MOC(...) OR ...) and IN_MOC in JOIN ON are rejected.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyacid-0.1.0a2.tar.gz (349.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyacid-0.1.0a2-py3-none-any.whl (101.9 kB view details)

Uploaded Python 3

File details

Details for the file pyacid-0.1.0a2.tar.gz.

File metadata

  • Download URL: pyacid-0.1.0a2.tar.gz
  • Upload date:
  • Size: 349.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 9f268def72c2e8d2bcb89761705332029ba2b3e2bf9e5bdae7b8c88a6c87dd5d
MD5 99d40e7f4dfb7f6508b7d5a0a57eb178
BLAKE2b-256 8951d9995858c7df5924d7fa8d361adea7d642609484aa4826db527596521789

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a2.tar.gz:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyacid-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: pyacid-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 101.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 6548585372a96002ede04703e5b2a4f4c26dd18fbc738bed70e35bac8e75f8b3
MD5 e6178aeba97f146afc891e2b0475eb40
BLAKE2b-256 4a383d457138c732fae02cc816124f849929e4be58ca7f30b6c57e5fab91b6f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a2-py3-none-any.whl:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page