Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via native Polars

These details have not been verified by PyPI

Project links

Project description

ACID — Astronomical Catalog Inference Driver

Cross-match and query HEALPix-partitioned astronomical catalogs from Python. ACID gives you a fluent Catalog surface (crossmatch, where, select, group_by / aggregate, save) for the common shapes, and a SQL escape hatch (db.sql(...)) with one astronomy extension — XMATCH(radius_arcsec => ...) — for everything else. Each anchor partition runs independently against a boundary-safe margin cache.

Reads and writes the HATS format used by LINCC Frameworks (LSDB, hats-import) and by published catalogs such as Gaia DR3 and Rubin DP1.

Quick start

Run a query

acid is module-level and singleton-by-default (like Ray / DuckDB / Polars): acid.open(...) returns a catalog handle, acid.sql.query(...) is the escape hatch, and the worker pool is built once and reused. acid.init(...) is optional — call it to pin the source / worker count; otherwise the first acid.open() lazy-inits with defaults.

import acid
import astropy.units as u

acid.init("catalogs.yaml", workers=8)       # optional — pins config

gaia = acid.open("gaia_dr3")
twomass = acid.open("twomass_psc")

# Fluent: composable verbs, lazy until materialized.
matches = (gaia
           .crossmatch(twomass, radius=1*u.arcsec)
           .where("phot_g_mean_mag < 16")
           .select("source_id, designation"))

matches.head(10).show()         # pretty-print to stdout
df = matches.to_polars()        # Catalog converters: .to_polars(),
                                #   .to_astropy(), .to_arrow(), .to_pandas()

# SQL escape hatch for aggregates / HAVING / windows / DISTINCT.
r = acid.sql.query("""
    SELECT g.source_id, t.designation, d
    FROM   gaia_dr3 AS g
    JOIN   twomass_psc AS t ON XMATCH(radius_arcsec => 1.0, dist_col => 'd')
    ORDER BY d
    LIMIT  20
""")
print(r)

Need two simultaneous connections, or full isolation (e.g. in a library or a test)? Construct acid.Connection(...) explicitly and use it as a context manager — it bypasses the module-level default entirely:

with acid.Connection("catalogs.yaml", workers=8) as db:
    df = db.open("gaia_dr3").head(10).to_polars()

Result is a thin wrapper around an Arrow table that comes back from every materialization call. .show() prints; .to_astropy() / .to_polars() / .to_arrow() / .to_pandas() / .to_pylist() convert (the same converter names as Catalog); and .export("results.parquet") writes one flat file (csv/parquet/fits, by extension or format=) while .save(path) writes a HATS catalog directory — the same stays-in-the-system / leaves-the-system pair as Catalog.save / Catalog.export (minus the name registration, which only makes sense on a connection).

Restrict to a region while you iterate

acid.in_cone(...) (or db.in_cone(...)) is a context manager that scopes a spatial cone to every query executed inside the with block — both the fluent surface and db.sql(...). The cone is applied at execution time, not when the query was built, so you can build a query once and run it scoped inside the block and full-sky outside it. Use it for a "debug small, run big" workflow: keep the block in while you iterate, remove it for the production run.

gaia = acid.open("gaia_dr3")

with acid.in_cone((180, 0), radius=1*u.deg):
    small = gaia.where("phot_g_mean_mag < 16").to_polars()

# Same query, full sky, no edits:
big = gaia.where("phot_g_mean_mag < 16").to_polars()

Cones do not nest; one in_cone block at a time.

Materialize an intermediate

Catalog.save(...) writes a query result as a HATS catalog and registers it on the connection so later queries can reference it by name. This is the canonical EDA pattern: run a heavy crossmatch once, save it, iterate cheaply on the cached output.

nearby = (acid.open("gaia_dr3")
              .crossmatch(acid.open("twomass_psc"), radius=1*u.arcsec)
              .save("./out/gaia_x_2mass", name="nearby"))

# `nearby` is a normal Catalog; "nearby" is also resolvable by name.
r = acid.sql.query("SELECT COUNT(*) FROM nearby")
print(r)

CLI

# Query execution. The SQL query is required (use '-' to read stdin, or -f).
# --db is a ':'-separated list of HATS dirs / registry YAMLs; it's optional,
# falling back to $ACID_PATH, the acid.conf 'path' setting, then ~/datasets.
acid query "SELECT COUNT(*) FROM object" --db datasets/ --output /tmp/result
acid query -f query.sql --db catalogs.yaml --output results/ --workers 32
# --ram-budget bounds the RAM the planner sizes work tuples for
# (default: 25% of available RAM); bytes or 64GB / 512MiB forms.
acid query "SELECT ..." --db datasets/ --ram-budget 64GB
echo "SELECT ..." | acid query - --db datasets/ --output results/   # '-' reads stdin
# --format is optional: it's inferred from the --output extension
# (.parquet/.pq, .csv, .fits/.fit, .hats); no extension → HATS tree.
acid query "SELECT ..." --db datasets/ --output results.csv
# --open uses a raw file (parquet/csv/fits/arrow/…) as a table, alongside the
# --db catalogs. The ra/dec column names are required. Two forms: positional
# 'PATH,RA,DEC' (table name = file basename) or named 'NAME=PATH,ra=RA,dec=DEC'.
acid query "SELECT t.id, g.source_id FROM t JOIN gaia ON XMATCH(radius_arcsec => 1.0)" \
    --db datasets/ --open t=candidates.csv,ra=RA,dec=DEC
acid query "SELECT * FROM candidates" --db datasets/ --open candidates.csv,RA,DEC
acid validate "SELECT ..." --db datasets/

# List what's already in your catalog path (ACID_PATH) — names you can query
acid list                                             # every catalog acid can open by name
acid list gaia                                        # filter by name (substring)
# → same one-line-per-catalog format as `acid search` (margin radii, shadowing),
#   but over ACID_PATH (what you already have) rather than the download path.

# Discover what's available to download (across the download path)
acid search                                           # list every downloadable catalog
acid search gaia                                      # filter by name (substring)
# → one line per catalog with its margin-cache radii (arcsec); names like
#   `wise/allwise` come from namespace dirs on the mirror. Remote listings are
#   cached ~1h; --cache refresh re-crawls. Piped, it emits TSV for scripting.

# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download two_mass                                # resolve name + dest (see below)
acid download wise/allwise                            # nested name (as shown by `acid search`)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5

# Import tabular files into a HATS catalog — name the source kind first
# (parquet / csv / fits / arrow / butler), then the sources.
acid import peek data/brick.fits                              # inspect: format, rows, RA/Dec
acid import fits data/*.fits --out my_catalog --ra RA --dec DEC   # bare name → ACID_PATH root
acid import fits survey/ --out /data/cat --ra ra --dec dec    # a directory of files
acid import csv lc/*.csv --out lc --ra ra --dec dec --schema schema.yaml   # coerce CSV types
acid import csv survey.dat --out cat --ra ra --dec dec --delimiter '|'     # any extension
# Sources can be remote fsspec URLs (http(s)/s3/gs/…; install s3fs/gcsfs for those):
acid import parquet https://example.org/survey/bricks.parquet --out cat --ra ra --dec dec
acid import parquet 's3://bucket/cat/*.parquet' --out cat --ra ra --dec dec \
  --storage-option anon=true               # public bucket; creds also via AWS_* env
# Import a Rubin Butler dataset directly (needs the LSST stack; always shuffles):
acid import butler --repo /repo/dp1 \
  --collection LSSTComCam/runs/DRP/DP1/... --dataset object \
  --ra coord_ra --dec coord_dec --where "tract = 4849" --out dp1_object
# A QUOTED glob is expanded in-process (handles directories with 70k+ files,
# local or remote). http(s):// has no directory listing — pass explicit URLs.
# Files are read as the named format with the same columns (order may differ; a
# --schema coerces CSV). Small inputs import in memory; large ones use an
# out-of-core shuffle (--mode auto|inmem|shuffle; --order/--max-order/
# --rows-per-partition tune it). A margin cache is built automatically after
# import (default 10 arcsec; --margin-arcsec R / --no-margin), so boundary
# crossmatches are correct without a separate step.

# Inspect catalogs (local or remote)
acid inspect two_mass                                # bare name → resolved on ACID_PATH
acid inspect /data/two_mass                          # summary
acid inspect schema /data/two_mass                   # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass  # remote

# Build margin caches locally (--margin-arcsec defaults to 10.0)
acid hats build-margin /data/two_mass --margin-arcsec 10.0 --workers 16

results/ is itself a valid HATS catalog (lsdb.open_catalog(...) and hats.read_hats(...) will read it). Downloaded subsets are also valid HATS catalogs with rebuilt _metadata.

acid download name resolution. Give a bare catalog name and both the source and destination are resolved for you:

acid download two_mass
# source ← first ACID_DOWNLOAD_PATH root that has it (collection-aware),
#          e.g. https://data.lsdb.io/hats/two_mass/two_mass
# dest   ← <first writable local ACID_PATH root>/two_mass (created if needed)

The source search path is ACID_DOWNLOAD_PATH → the download_path config setting → the built-in defaults (https://data.lsdb.io/hats/, then the SLAC ssh://slacd/sdf/home/m/mjuric/datasets dir), searched in order. Each candidate <root>/<name> is probed; a directory holding collection.properties is a collection, so its hats_primary_table_url child is downloaded. The name may be nested (wise/allwise) to reach a catalog under a namespace directory — exactly the names acid search prints — and lands locally under its leaf (<ACID_PATH>/allwise). An explicit source (a leading ./ / / / ~ path, or a URL) skips the lookup and is used verbatim; give a local relative dir a leading ./ to copy from it. Use acid search to see which names resolve.

The destination follows the same bare-vs-path rule: omit it and the catalog lands in <first writable local ACID_PATH root>/<catalog name>; pass a bare name (acid download two_mass tm) and it resolves to <ACID_PATH root>/tm; pass a path with a / (./tm, /data/tm) and it's used verbatim. The ACID_PATH root is the same search path acid query uses (URL entries are skipped), created with a notice if it doesn't exist. An explicit source with an omitted destination is an error (there's no name to resolve a destination from).

Catalog registry

The simplest way: point --db (or acid.init(...)) at a directory of HATS catalogs. Each subdirectory with a properties file becomes a table named after the directory. Margin caches (dataproduct_type=margin) are auto-skipped.

For more control, use a YAML file:

catalogs:
  dia_source:
    path: /data/dia_source      # HATS root, or CatalogCollection root
    # Auto-detected from <path>/properties when present:
    #   ra_col            (from hats_col_ra)
    #   dec_col           (from hats_col_dec)
    #   hpix_order        (from <path>/partition_info.csv)
    #   neighbor_path     (from collection.properties or sibling '_margin' dir)
    #   neighbor_margin_arcsec  (from hats_margin_threshold)
    #   npix_suffix       (from hats_npix_suffix; default '.parquet')
    # Any auto-detected value can be overridden here.

  object:
    path: /data/object_collection    # a CatalogCollection root works too

  lightcurve:
    path: /data/lightcurve
    hpix_order: 5                    # explicit when partition_info.csv is absent

# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
  des_dr2: /data/mocs/des_dr2.fits
  known_artifacts: /data/mocs/artifacts.fits
  # If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
  # auto-loads it — no explicit entry needed.

Configuration (`acid.conf`)

So you don't re-type --db/--workers on every invocation, acid reads an INI config. The first existing file wins, searched highest-priority first: --config FILE / $ACID_CONFIG, then ~/.config/acid/acid.conf ($XDG_CONFIG_HOME), /sdf/data/rubin/user/mjuric/etc/acid.conf, /sdf/home/m/mjuric/etc/acid.conf, $XDG_CONFIG_DIRS, /etc/acid/acid.conf.

# ~/.config/acid/acid.conf
[acid]
path = /data/hats:~/datasets        # ':'-separated HATS dirs / registry YAMLs
download_path = https://data.lsdb.io/hats/   # 'acid download' name search path
workers = 32                        # query worker pool ("auto" = cgroup-aware)
mem_per_worker_gb = 4               # RAM/worker bounding "auto" (CPU and memory)
tmpdir = /scratch/$USER             # base temp dir (a per-run subdir is made + cleaned)
inmem_row_limit = 50_000_000        # spill threshold

Each setting resolves explicit flag/arg → env var → config → built-in. Env overrides: ACID_PATH, ACID_DOWNLOAD_PATH, ACID_WORKERS, ACID_MEM_PER_WORKER_GB, ACID_TMPDIR, ACID_INMEM_ROW_LIMIT.

tmpdir is special: it also honors the standard OS $TMPDIR — --tmpdir/kwarg > $ACID_TMPDIR > $TMPDIR > config > system default — so a batch system's node-local $TMPDIR (e.g. SLURM) is used automatically; set $ACID_TMPDIR to point acid elsewhere. This applies in the library (Jupyter) too, and acid only ever uses it for its own temp files — it never changes the process $TMPDIR, so other libraries are unaffected.

Inspect and edit with acid config:

acid config show                 # values set in the file (--effective: resolved + provenance)
acid config get workers          # file value; exits 1 (prints nothing) if unset
acid config set path /data/hats:~/datasets

With a config in place, --db becomes optional (falls back to the path setting, then ~/datasets). See docs/archive/CONFIG-SYSTEM.md for the full design.

What `XMATCH` does

JOIN  b ON XMATCH(radius_arcsec => 1.0)                   -- nearest, inner
JOIN  b ON XMATCH(r => 1.0)                               -- 'r' is an alias
JOIN  b ON XMATCH(r => 1.0, mode => 'all')                -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0)                           -- keep unmatched anchors

-- Distance is exposed as a named column via dist_col on the XMATCH call.
SELECT a.id, d FROM a JOIN b ON XMATCH(r => 1.0, dist_col => 'd')
WHERE  d < 0.5

-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(d) AS avg_d
FROM a
JOIN  b ON XMATCH(r => 1.0, dist_col => 'd')
JOIN  lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100

-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2')              -- anchor inside DES footprint
  AND NOT IN_MOC(b, 'known_artifacts')  -- exclude artifact regions

-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a

The fluent equivalent of the simple shapes:

a.crossmatch(b, radius=1*u.arcsec)                          # nearest, inner
a.crossmatch(b, radius=1*u.arcsec, how="all")               # every match within r
a.crossmatch(b, radius=1*u.arcsec, how="left")              # LEFT XMATCH
a.in_region("des_dr2")                                      # IN_MOC mask, per-receiver

Semantics, in short:

All XMATCHes in a query use the anchor (first FROM) table's coordinates, even after a mode => 'all' expansion.
A right-table radius must be ≤ that catalog's neighbor_margin_arcsec. Otherwise we'd silently miss boundary pairs; the analyzer rejects the query.
ORDER BY ... LIMIT K pushes the top-K to each partition first; the reducer re-sorts the union and applies the global LIMIT/OFFSET.
Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer over the per-partition Parquet output.

Python API surface

# Module-level API (singleton-by-default). The first call lazy-inits;
# acid.init(...) pins config; acid.shutdown() tears down.
acid.init(source=None, *, workers=None, threads=None,
          inmem_row_limit=50_000_000, progress via configure) -> None
acid.shutdown() / acid.is_initialized() / acid.configure(progress=...)
acid.open(name_or_path, *, alias=None, columns=None) -> Catalog
acid.register_catalog(name, **spec_kwargs) / acid.register_file(name, src, *, ra=, dec=)
acid.list_catalogs() / acid.register_moc(...)
acid.in_cone(center, *, radius) / acid.status()

# SQL escape hatch — the acid.sql submodule
acid.sql.query(query, *, output=None)                -> Result
acid.sql.validate(query) / acid.sql.explain(query)

# Explicit, isolated Connection (escape hatch — two connections / two configs)
db = acid.Connection(source, *, workers=None, threads=None,
                     inmem_row_limit=50_000_000, progress="auto")
# ...then db.open(...) / db.sql(...) / etc. — the same methods, on `db`.
db.in_cone(center, *, radius)                       # ctx manager
db.status() / db.validate(q) / db.explain(q)
db.close()    # or use as a context manager

# Catalog (composable, lazy) — composition verbs return Catalog;
# materialization verbs execute and return Result (or, for to_*, the
# converted type).
cat.where(predicate)        -> Catalog
cat.select(*cols)           -> Catalog
cat.limit(n)                -> Catalog
cat.in_region(moc_or_cat)   -> Catalog
cat.crossmatch(other, *, radius, how="nearest"|"all"|"left",
               dist_col=None, suffix=None)                   -> Catalog
cat.join(other, *, on, how="inner"|"left")                   -> Catalog
# Fluent aggregation — decomposable-only (acid.agg constructors).
cat.group_by(*keys)                        -> Catalog
cat.aggregate(**named_aggs)                -> Catalog
# After .aggregate(), verbs compose over the aggregate output in written
# order: a post-aggregate .where(...) is the old HAVING (and .limit(5).where(...)
# filters the top-5 — fluent composes by position, no separate .having()).
cat.sort(*keys, descending=False, nulls_last=False) -> Catalog
# Reduction shortcuts — one aggregate, no agg.* ceremony. Global (no
# group_by) materializes and returns a bare Python scalar; grouped returns a
# lazy Catalog (column `count` / `mean_<col>` / …, so a following .where() is
# HAVING). For mixed stats / named outputs use .aggregate(...).
cat.count(col=None)                         -> int | Catalog
cat.sum/mean/min/max/std/var(col)           -> scalar | Catalog
# Decomposable aggregate constructors (acid.agg):
#   agg.count(col=None), agg.sum, agg.mean, agg.min, agg.max,
#   agg.std, agg.var, agg.all, agg.any, agg.list.
# (No agg.median / agg.mode — non-decomposable; rejected with
# ValidationError. Drop into Polars after .to_polars() if you need them.)

cat.columns / cat.alias / cat.describe() / cat.explain()
cat.head(n=10)              -> Result
cat.execute()               -> Result
# These convert and return the target type directly (no Result detour):
cat.to_pandas() / cat.to_astropy() / cat.to_polars() / cat.to_arrow()
cat.save(path, *, name=None, overwrite=False) -> Catalog

# Result — comes back from Catalog.head / .execute and from db.sql.
# A thin wrapper around an in-memory pa.Table or a partitioned dir;
# same converter / terminal names as Catalog.
r.num_rows, r.column_names, r.schema
r.column(name)         -> pa.ChunkedArray
r.show(n=20)           # pretty-print first n rows (CLI renderer)
print(r)               # renders the result as a Polars DataFrame (__str__)
r.to_arrow()           -> pa.Table
r.to_polars()          -> polars.DataFrame
r.to_astropy()         -> astropy.table.Table
r.to_pandas()          -> pandas.DataFrame
r.to_pylist()          -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10)           -> Result
r.export(path, format=None) -> Path  # one flat file; format from extension
                                     # (a Result has left the system — no .save();
                                     #  write HATS from a Catalog or Connection.sql(output=))
len(r), for batch in r: ...

# Errors (all inherit from acid.AcidError)
acid.RegistryError           # catalog config (missing path, mixed Norder, ...)
acid.ParseError              # SQL parse failures
acid.ValidationError         # unsupported XMATCH constructs
acid.ExecutionError          # per-partition execution failures
acid.ConnectionClosedError   # method called on a closed Connection

Layout assumptions

Catalogs follow the HATS layout: <root>/dataset/Norder=N/Dir=D/Npix=P.parquet (or Npix=P/*.parquet when hats_npix_suffix='/').
Margin caches live as sibling catalogs (HATS canonical), at <root>/margin_cache/..., or any sibling dir matching <name>_margin*. collection.properties is consumed if present.
Adaptive (per-pixel) Norder is supported: a catalog's partition_info.csv may list pixels at any orders, and XMATCH/ordinary joins across mixed-Norder catalogs are run via a refinement-tree enumeration that emits one work unit per coarsest cursor pixel where every joined catalog has ≤ 1 partition. Output is itself a valid HATS catalog whose partition_info.csv reflects the refinement.

What's the speed story?

Each partition is independent → embarrassingly parallel across HEALPix pixels.
Top-K queries push the LIMIT to each partition. Aggregates write partial data to disk and reduce centrally.
Column pruning: the anchor and right relations are lazy Polars LazyFrames over scan_parquet(), so the final projection only pulls referenced columns from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs.
Auto-spill: when output is unset and the running result exceeds inmem_row_limit (default 50M rows), acid spills to a tempdir rather than OOM-ing the parent.
Allocator tuning: acid ships a jemalloc default that avoids page-purge contention at high worker counts (~2× faster wall, ~20% more RSS). It's a single overridable env var — see MEMORY-TUNING.md if you're memory-constrained or scaling workers on a large machine.

See bench/match_all.py and bench/session_vs_oneshot.py for microbenchmarks.

Install

With uv (recommended for development)

uv sync --dev          # creates .venv, installs all deps + test + hats
uv run pytest          # run tests

With pip

pip install -e .
# extras:  pip install -e .[hats,dev]

Requires Python 3.10+. Runtime dependencies (installed automatically): Polars ≥ 1.41, SQLGlot ≥ 27 (< 31), PyArrow ≥ 14, NumPy ≥ 1.24, SciPy ≥ 1.10, cdshealpix ≥ 0.7, fsspec ≥ 2023.1, Astropy ≥ 5, PyYAML ≥ 6, rich ≥ 13. mocpy is not a runtime dependency — ACID ships a dependency-light MOC implementation.

Status

v0 (correctness): XMATCH inner/left, mode 'nearest'/'all', chains, ordinary joins, distance via XMATCH(..., dist_col => '<name>').
v1 (scale): views + narrow side-tables, vectorized matcher, worker initializer, auto-spill, top-K pushdown, manifest.
v1.1 (HATS spec): writes valid HATS catalogs, reads canonical property keys, supports hats_npix_suffix='/', auto-discovers margin siblings via collection.properties.
v2 (EDA): persistent Connection, per-worker Polars engine, Result wrapper, Catalog.save() for materialization.
v3 (adaptive Norder): per-catalog PartitionIndex, refinement-tree tuple enumeration, integer _healpix_29 range filtering for per-pixel row pruning, LEFT-XMATCH/JOIN over partitions without coverage.
v4 (Polars-native): single native-Polars engine; DuckDB, the SQL rewriter/reducer, the engine abstraction, and the QueryPlan IR removed (see CHANGELOG.md / ARCHITECTURE.md).
v4 (MOC footprint filtering): IN_MOC(<alias>, '<name>') in WHERE restricts rows to a named sky region (Multi-Order Coverage map). Supports NOT IN_MOC, multiple predicates (AND-combined via mocpy set ops), and catalog auto-resolution from point_map.fits. IN_MOC is a footprint restriction only — it must sit in conjunctive WHERE position (top-level AND-chain, optionally negated); use in SELECT/ORDER BY/CASE/JOIN ON or inside a disjunction is rejected (see Known limitations). Three-level optimization: catalog-footprint scoping, cursor-pixel intersection, and partition-level pruning — all via the existing _healpix_29 row-group pushdown fast path.
v5 (catalog ops): acid hats build-margin builds HATS margin caches locally (validated against hats-import). acid download generates point_map.fits, auto-includes HATS RA/Dec/healpix columns. acid query accepts --db <directory> for zero-config usage, fails fast on errors, shows tqdm progress, shuffles work for load balancing. Bare column resolution via schema introspection. LocalFetcher for local I/O.
v6 (fluent Catalog API): acid.init() builds a process-wide default (or acid.Connection() an explicit) Connection; db.open(name) returns a lazy Catalog; verbs (where, select, crossmatch, join, in_region, save) compose without writing SQL. db.in_cone(...) scopes a cone to every query in a with block. db.sql(...) remains the escape hatch for decomposable aggregates, HAVING, and top-K (ORDER BY ... LIMIT). Window functions, DISTINCT, COUNT(DISTINCT), bare GROUP BY, and unbounded ORDER BY are rejected with a ValidationError.

Tests: ~545 passing (~60s parallel via pytest-xdist) on the native Polars engine. Fixtures cached across runs.

Known limitations

XMATCH must be the entire ON predicate. Compound predicates like XMATCH(...) AND b.mag < 20 are rejected.
No CTEs / subqueries in the anchor position.
RIGHT / FULL / CROSS JOIN XMATCH not supported.
IN_MOC must be in conjunctive WHERE position (top-level AND-chain, optionally negated). Disjunctive use (IN_MOC(...) OR ...) and IN_MOC in JOIN ON are rejected.
No nested db.in_cone(...) blocks. The true intersection of two non-concentric cones is not a cone; we refuse rather than silently approximate.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0a2 pre-release

Jun 28, 2026

0.7.0a1 pre-release

Jun 25, 2026

This version

0.6.1

Jun 22, 2026

0.6.0

Jun 22, 2026

0.5.0

Jun 17, 2026

0.5.0a1 pre-release

Jun 15, 2026

0.4.0a4 pre-release

Jun 14, 2026

0.4.0a3 pre-release

Jun 14, 2026

0.4.0a2 pre-release

Jun 14, 2026

0.4.0a1 pre-release

Jun 13, 2026

0.3.0a1 pre-release

Jun 10, 2026

0.2.0a3 pre-release

Jun 3, 2026

0.2.0a2 pre-release

Jun 1, 2026

0.2.0a1 pre-release

May 31, 2026

0.1.0a3 pre-release

May 26, 2026

0.1.0a2 pre-release

May 23, 2026

0.1.0a1 pre-release

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyacid-0.6.1.tar.gz (2.9 MB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyacid-0.6.1-py3-none-any.whl (558.4 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file pyacid-0.6.1.tar.gz.

File metadata

Download URL: pyacid-0.6.1.tar.gz
Upload date: Jun 22, 2026
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyacid-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`7bc9bcf45cf9d1e5c2659f425f95bb6c85125e38e5753dfa4499eafba149544a`
MD5	`7052913c5fedf1dbaaa044a5b074a79f`
BLAKE2b-256	`29c940376423720eec612ba63ed26982ae7cb61f9bbfd3b452eb01fcc005c67e`

See more details on using hashes here.

File details

Details for the file pyacid-0.6.1-py3-none-any.whl.

File metadata

Download URL: pyacid-0.6.1-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 558.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Rocky Linux","version":"9.7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyacid-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30e2d9a8bfd28e146c911d9e0377ced710d38765a27f6cbc9a4bf4c148f424d4`
MD5	`bf9663c0b2c8053eca61f5a66b8bbde5`
BLAKE2b-256	`1f8a65227eff73af3486c03574ba57272115882d68ca275d7ac301d778a8423c`

See more details on using hashes here.

pyacid 0.6.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ACID — Astronomical Catalog Inference Driver

Quick start

Run a query

Restrict to a region while you iterate

Materialize an intermediate

CLI

Catalog registry

Configuration (acid.conf)

What XMATCH does

Python API surface

Layout assumptions

What's the speed story?

Install

With uv (recommended for development)

With pip

Status

Known limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Configuration (`acid.conf`)

What `XMATCH` does