Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via DuckDB

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

acid — Astronomical Catalog Inference Driver

SQL-driven crossmatching and analysis of HEALPix-partitioned astronomical catalogs, executed by DuckDB. acid extends SQL with an XMATCH(...) predicate for spherical-distance joins, runs each anchor partition independently against a boundary-safe margin cache, and aggregates the results.

Reads and writes the HATS format used by LINCC Frameworks (LSDB, hats-import) and by published catalogs such as Gaia DR3 and Rubin DP1.

Quick start

One-shot query

import acid

r = acid.sql(
    """
    SELECT a.id, b.id AS b_id, XMATCH_DISTANCE(b) AS d_arcsec
    FROM dia_source AS a
    JOIN object AS b ON XMATCH(radius_arcsec => 1.0)
    JOIN lightcurve AS lc ON a.id = lc.object_id
    """,
    catalogs="catalogs.yaml",
    workers=8,
)

r.df()              # -> pandas.DataFrame
r.arrow()           # -> pyarrow.Table
r.to_polars()       # -> polars.DataFrame
for batch in r.batches(): ...      # -> RecordBatch iterator
r.write_parquet("results.parquet", layout="single")

Exploratory session (recommended for repeated queries)

with acid.connect("catalogs.yaml", workers=8) as s:
    r1 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 1.0)")
    r2 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 0.5)")

    # Promote an intermediate to a real catalog usable in later queries.
    s.materialize(
        "nearby",
        "SELECT a.id, a.ra, a.dec FROM a JOIN b ON XMATCH(r => 1.0)",
    )
    r3 = s.sql("SELECT * FROM nearby JOIN c ON XMATCH(r => 5.0, mode => 'all')")

A Session keeps the worker pool, the DuckDB connection in each worker, and the parsed registry alive across queries. On a 64-core box it's typically 5–10× faster than calling acid.sql() in a loop.

CLI

# Query execution (--db accepts a directory of HATS catalogs or a YAML file)
acid query "SELECT COUNT(*) FROM object" --db datasets/ --out /tmp/result
acid query -f query.sql --db catalogs.yaml --out results/ --workers 32
echo "SELECT ..." | acid query --db datasets/ --out results/
acid validate "SELECT ..." --db datasets/

# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5

# Inspect catalogs (local or remote)
acid inspect /data/two_mass                          # summary
acid inspect schema /data/two_mass                   # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass  # remote

# Build margin caches locally
acid hats build-margin /data/two_mass --margin-arcsec 5.0 --workers 16

results/ is itself a valid HATS catalog (lsdb.open_catalog(...) and hats.read_hats(...) will read it). Downloaded subsets are also valid HATS catalogs with rebuilt _metadata.

Catalog registry

The simplest way: point --db at a directory of HATS catalogs. Each subdirectory with a properties file becomes a table named after the directory. Margin caches (dataproduct_type=margin) are auto-skipped.

For more control, use a YAML file:

catalogs:
  dia_source:
    path: /data/dia_source      # HATS root, or CatalogCollection root
    # Auto-detected from <path>/properties when present:
    #   ra_col            (from hats_col_ra)
    #   dec_col           (from hats_col_dec)
    #   hpix_order        (from <path>/partition_info.csv)
    #   neighbor_path     (from collection.properties or sibling '_margin' dir)
    #   neighbor_margin_arcsec  (from hats_margin_threshold)
    #   npix_suffix       (from hats_npix_suffix; default '.parquet')
    # Any auto-detected value can be overridden here.

  object:
    path: /data/object_collection    # a CatalogCollection root works too

  lightcurve:
    path: /data/lightcurve
    hpix_order: 5                    # explicit when partition_info.csv is absent

# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
  des_dr2: /data/mocs/des_dr2.fits
  known_artifacts: /data/mocs/artifacts.fits
  # If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
  # auto-loads it — no explicit entry needed.

What `XMATCH` does

JOIN  b ON XMATCH(radius_arcsec => 1.0)                   -- nearest, inner
JOIN  b ON XMATCH(r => 1.0)                               -- 'r' is an alias
JOIN  b ON XMATCH(r => 1.0, mode => 'all')                -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0)                           -- keep unmatched anchors

-- Distance is exposed as a SELECT-level function over the right alias.
SELECT a.id, XMATCH_DISTANCE(b) AS d FROM a JOIN b ON XMATCH(r => 1.0)
WHERE  XMATCH_DISTANCE(b) < 0.5

-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM a
JOIN  b ON XMATCH(r => 1.0)
JOIN  lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100

-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2')              -- anchor inside DES footprint
  AND NOT IN_MOC(b, 'known_artifacts')  -- exclude artifact regions

-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a

Semantics, in short:

All XMATCHes in a query use the anchor (first FROM) table's coordinates, even after a mode => 'all' expansion.
A right-table radius must be ≤ that catalog's neighbor_margin_arcsec. Otherwise we'd silently miss boundary pairs; the analyzer rejects the query.
ORDER BY ... LIMIT K pushes the top-K to each partition first; the reducer re-sorts the union and applies the global LIMIT/OFFSET.
Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer over the per-partition Parquet output.

Python API surface

# Sessions (preferred for EDA)
s = acid.connect(catalogs, *, workers=1, duckdb_threads=None, cache_dir=None)
s.sql(query, *, output=None, inmem_row_limit=50_000_000) -> acid.Result
s.run(query, *, output=None)                              -> ExecutionResult
s.validate(query)                                          -> QueryPlan
s.explain(query)                                           -> str (rewritten SQL)
s.list_catalogs()                                          -> list[str]
s.add_catalog(name, **kwargs)                              -> TableSpec
s.materialize(name, query, *, ra_col=None, dec_col=None, overwrite=False) -> TableSpec
s.register_moc(name, source)                           # FITS path, mocpy.MOC, or (N,2) ranges
s.close()    # or use as a context manager

# One-shot convenience (transient Session under the hood)
acid.sql(query, *, catalogs, output=None, workers=1, ...)  -> acid.Result
acid.run(query, *, catalogs, output, workers=1)            -> ExecutionResult

# Result
r.num_rows, r.column_names, r.schema
r.column(name)         -> pa.ChunkedArray
r.arrow()              -> pa.Table
r.df() / r.to_pandas() -> pandas.DataFrame
r.to_polars()          -> polars.DataFrame
r.to_pylist()          -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10)           -> Result
r.write_parquet(path, layout="hats"|"single") -> Path
len(r), for batch in r: ...

# Errors (all inherit from acid.AcidError)
acid.RegistryError       # catalog config (missing path, mixed Norder, ...)
acid.ParseError          # SQL parse failures
acid.ValidationError     # unsupported XMATCH constructs
acid.ExecutionError      # per-partition execution failures
acid.SessionClosedError  # method called on a closed Session

Layout assumptions

Catalogs follow the HATS layout: <root>/dataset/Norder=N/Dir=D/Npix=P.parquet (or Npix=P/*.parquet when hats_npix_suffix='/').
Margin caches live as sibling catalogs (HATS canonical), at <root>/margin_cache/..., or any sibling dir matching <name>_margin*. collection.properties is consumed if present.
Adaptive (per-pixel) Norder is supported: a catalog's partition_info.csv may list pixels at any orders, and XMATCH/ordinary joins across mixed-Norder catalogs are run via a refinement-tree enumeration that emits one work unit per coarsest cursor pixel where every joined catalog has ≤ 1 partition. Output is itself a valid HATS catalog whose partition_info.csv reflects the refinement.

What's the speed story?

DuckDB does the heavy SQL work and Parquet I/O. acid is roughly a Python orchestrator + a vectorized SciPy cKDTree crossmatcher.
Each partition is independent → embarrassingly parallel across HEALPix pixels.
The per-worker DuckDB connection is opened once and reused for every partition the worker processes; temp namespace is dropped between partitions. With a Session, the connection survives across queries.
Top-K queries push the LIMIT to each partition. Aggregates write partial data to disk and reduce centrally.
Column pruning: anchor and right tables are TEMP views over read_parquet(), so the final SELECT only pulls referenced columns from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs.
Auto-spill: when output_dir is unset and the running result exceeds inmem_row_limit (default 50M rows), acid spills to a tempdir rather than OOM-ing the parent.

See bench/match_all.py and bench/session_vs_oneshot.py for microbenchmarks.

Install

With uv (recommended for development)

uv sync --dev          # creates .venv, installs all deps + test + hats
uv run pytest          # run tests

With pip

pip install -e .
# extras:  pip install -e .[hats,dev]

Requires Python 3.10+, DuckDB 1.x, SQLGlot ≥ 27, PyArrow ≥ 14, NumPy ≥ 1.24, SciPy ≥ 1.10, PyYAML ≥ 6.

Status

v0 (correctness): XMATCH inner/left, mode 'nearest'/'all', chains, ordinary joins, distance via XMATCH_DISTANCE(alias).
v1 (scale): views + narrow side-tables, vectorized matcher, worker initializer, auto-spill, top-K pushdown, manifest.
v1.1 (HATS spec): writes valid HATS catalogs, reads canonical property keys, supports hats_npix_suffix='/', auto-discovers margin siblings via collection.properties.
v2 (EDA): persistent Session, per-worker DuckDB connection, Result wrapper, materialize().
v3 (adaptive Norder): per-catalog PartitionIndex, refinement-tree tuple enumeration, vectorized pure-numpy ang2pix_nested exposed as a DuckDB UDF for per-pixel row filtering, LEFT-XMATCH/JOIN over partitions without coverage.
v4 (MOC footprint filtering): IN_MOC(<alias>, '<name>') in WHERE restricts rows to a named sky region (Multi-Order Coverage map). Supports NOT IN_MOC, multiple predicates (AND-combined via mocpy set ops), catalog auto-resolution from point_map.fits, and use in SELECT/ORDER BY/ CASE expressions. Three-level optimization: catalog-footprint scoping, cursor-pixel intersection, and partition-level pruning — all via the existing _healpix_29 row-group pushdown fast path.
v5 (catalog ops): acid hats build-margin builds HATS margin caches locally (validated against hats-import). acid download generates point_map.fits, auto-includes HATS RA/Dec/healpix columns. acid query accepts --db <directory> for zero-config usage, fails fast on errors, shows tqdm progress, shuffles work for load balancing. Bare column resolution via schema introspection. LocalFetcher for local I/O.

Tests: 176 passing (~30s parallel via pytest-xdist). Fixtures cached across runs.

Known limitations

XMATCH must be the entire ON predicate. Compound predicates like XMATCH(...) AND b.mag < 20 are rejected.
No CTEs / subqueries in the anchor position.
RIGHT / FULL / CROSS JOIN XMATCH not supported.
IN_MOC must be in conjunctive WHERE position (top-level AND-chain, optionally negated). Disjunctive use (IN_MOC(...) OR ...) and IN_MOC in JOIN ON are rejected.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mjuric

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0a2 pre-release

May 23, 2026

0.1.0a1 pre-release

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyacid-0.1.0a2.tar.gz (349.7 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyacid-0.1.0a2-py3-none-any.whl (101.9 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file pyacid-0.1.0a2.tar.gz.

File metadata

Download URL: pyacid-0.1.0a2.tar.gz
Upload date: May 23, 2026
Size: 349.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a2.tar.gz
Algorithm	Hash digest
SHA256	`9f268def72c2e8d2bcb89761705332029ba2b3e2bf9e5bdae7b8c88a6c87dd5d`
MD5	`99d40e7f4dfb7f6508b7d5a0a57eb178`
BLAKE2b-256	`8951d9995858c7df5924d7fa8d361adea7d642609484aa4826db527596521789`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a2.tar.gz:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyacid-0.1.0a2.tar.gz
- Subject digest: 9f268def72c2e8d2bcb89761705332029ba2b3e2bf9e5bdae7b8c88a6c87dd5d
- Sigstore transparency entry: 1611735429
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: mjuric/acid@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4
- Branch / Tag: refs/tags/v0.1.0a2
- Owner: https://github.com/mjuric
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4
- Trigger Event: release

File details

Details for the file pyacid-0.1.0a2-py3-none-any.whl.

File metadata

Download URL: pyacid-0.1.0a2-py3-none-any.whl
Upload date: May 23, 2026
Size: 101.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyacid-0.1.0a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6548585372a96002ede04703e5b2a4f4c26dd18fbc738bed70e35bac8e75f8b3`
MD5	`e6178aeba97f146afc891e2b0475eb40`
BLAKE2b-256	`4a383d457138c732fae02cc816124f849929e4be58ca7f30b6c57e5fab91b6f6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyacid-0.1.0a2-py3-none-any.whl:

Publisher: publish.yml on mjuric/acid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyacid-0.1.0a2-py3-none-any.whl
- Subject digest: 6548585372a96002ede04703e5b2a4f4c26dd18fbc738bed70e35bac8e75f8b3
- Sigstore transparency entry: 1611735906
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: mjuric/acid@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4
- Branch / Tag: refs/tags/v0.1.0a2
- Owner: https://github.com/mjuric
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7dc40ebcb0d5081d90b51bc4b758aa5abee782a4
- Trigger Event: release

pyacid 0.1.0a2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

acid — Astronomical Catalog Inference Driver

Quick start

One-shot query

Exploratory session (recommended for repeated queries)

CLI

Catalog registry

What XMATCH does

Python API surface

Layout assumptions

What's the speed story?

Install

With uv (recommended for development)

With pip

Status

Known limitations

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

What `XMATCH` does