Lightweight streaming parser for VOTable files; convert to Parquet, CSV, or ECSV with optional filtering.

These details have not been verified by PyPI

Project links

Project description

votpipe

Streaming VOTable processing for Python

votpipe is a lightweight streaming parser for large VOTable files. It allows you to process rows incrementally, apply transformations or filters, and write the results to formats such as Parquet, CSV, or Astropy tables.

The focus is simplicity and pipeline integration:

stream rows without loading the entire table into memory
transform rows as they pass through the pipeline
write to modern formats such as Parquet
install and run easily with pip or pipx

Unlike tools that require a Java runtime or complex setup, votpipe is a small Python-native utility designed to integrate naturally into data pipelines.

Features

Streaming VOTable parsing (TABLEDATA, BINARY, BINARY2)
Batch callback API and row-iterator API
CLI: votpipe convert with column selection (--select) and row filtering (--where)
Compiled filter expressions (e.g. parallax > 10 and phot_g_mean_mag < 15)
Works with arbitrarily large tables
Parquet, CSV, and ECSV output (CLI and Python); compression via .gz or .xz extension

Supported VOTable serializations:

TABLEDATA
BINARY
BINARY2

External serializations such as FITS and PARQUET references are intentionally out of scope for the core parser.

Comparison with STILTS

STILTS is a powerful and mature toolkit for working with VOTable and other astronomy table formats. If you simply need to convert tables between formats or perform standard VO table operations, STILTS is often the best and most feature-complete solution.

votpipe is designed for a different niche: Python-native streaming pipelines. It can be installed and run with a single pip install votpipe[cli,parquet] or pipx install votpipe[cli,parquet], without requiring a Java runtime or managing multiple JAR dependencies. It is particularly useful when you want to integrate VOTable processing directly into a Python workflow, apply custom filtering or transformations in Python, or stream large VOTables directly into modern analytics formats such as Parquet.

In short:

Use STILTS if you need the most complete astronomy table toolkit and are comfortable working with the Java-based ecosystem.
Use votpipe if you want a lightweight Python tool that streams VOTables into Python pipelines, supports simple CLI filtering (--select, --where), and writes directly to Parquet, CSV, or ECSV with minimal setup.

Installation

Install for CLI use (includes convert command and progress bar):

pip install votpipe[cli,parquet]

Or with pipx:

pipx install votpipe[cli,parquet]

Minimal install (Python API only, no CLI deps):

pip install votpipe

Command Line Usage

The CLI provides a single command, convert, which streams a VOTable to Parquet, CSV, or ECSV with optional column selection and row filtering. Install CLI dependencies with pip install votpipe[cli] or pip install votpipe[parquet,cli].

Output format is detected from the output file extension (e.g. .parquet, .csv, .ecsv, .csv.gz, .ecsv.xz), or set explicitly with --format auto|csv|ecsv|parquet. Default output path replaces .vot/.vot.gz with the appropriate extension for the chosen format.

Basic conversion:

votpipe convert input.vot.gz
votpipe convert input.vot.gz output.parquet
votpipe convert input.vot.gz output.csv
votpipe convert input.vot.gz output.ecsv.gz
votpipe convert input.vot.gz --format ecsv

Select specific columns:

votpipe convert input.vot.gz output.parquet \
  --select source_id,ra,dec,parallax

Filter rows with a --where expression:

votpipe convert input.vot.gz output.parquet \
  --where "parallax > 10 and phot_g_mean_mag < 15"

Combined select and filter:

votpipe convert input.vot.gz output.parquet \
  --select source_id,ra,dec,parallax \
  --where "parallax > 10 and phot_g_mean_mag < 15"

Other options:

--progress / --no-progress — show a progress bar (default: on)
--format — output format: auto (default, from extension), csv, ecsv, or parquet
--compression — Parquet only: zstd (default), snappy, or none. CSV/ECSV use .gz or .xz in the filename.
--batch-size — max rows per batch (default: 8192)

Filter expression (--where) supports column names, numeric/string/bool/None constants, and:

Boolean: and, or, not
Comparisons: ==, !=, <, <=, >, >=, is None, is not None
Chained comparisons: e.g. 0 < parallax <= 10

Nullable columns are treated as false in comparisons when the value is None (e.g. parallax > 10 drops nulls).

Python Usage

You can consume parsed data in two ways; the shape of the data differs:

API	What you get	Row shape
Batch callback	`on_batch(fields, rows)` called repeatedly	`rows` is a list of tuples; each tuple has values in the same order as `fields` (no column names).
Row iterator	Iterate over the stream	Each item is one dict keyed by column name (e.g. `row["ra"]`).

Use the batch API when you want maximum throughput and are feeding a batch-oriented sink (Parquet, CSV adapter, or CompiledBatchQuery). Use the iterator when you want to loop over rows by name or compose with Python generators.

Batch callback interface

parse_votable(source, on_batch, batch_size=8192) parses the VOTable and calls on_batch(fields, rows) for each batch. Here fields is a list of field metadata dicts (name, datatype, etc.) and rows is a list of tuples: each tuple is one row, with values in the same order as fields. There are no column names in the row data—you use the field list to interpret indices. No threading; lowest overhead. Use it when pushing directly into a sink such as ParquetAdapter or CompiledBatchQuery.

from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter

with ParquetAdapter("output.parquet") as parquet:
    parse_votable("table.vot.gz", parquet.on_batch)

With column selection and filtering (same logic as the CLI):

from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter
from votpipe.query import CompiledBatchQuery

with ParquetAdapter("output.parquet") as parquet:
    query = CompiledBatchQuery(
        parquet.on_batch,
        select="source_id,ra,dec,parallax",
        where="parallax > 10 and phot_g_mean_mag < 15",
    )
    with query:
        parse_votable("table.vot.gz", query.on_batch)

Iterator interface

VOTableStreamingParser(source) is iterable and yields one row per item, each row as a dict with column names as keys (e.g. row["ra"]). Unlike the batch API, you get named access per row. The implementation uses a background thread and a queue to bridge SAX’s push model to Python’s pull model. Use it when you want to filter, transform, or chain row streams and prefer dict-style access.

from votpipe import VOTableStreamingParser

for row in VOTableStreamingParser("table.vot.gz"):
    print(row["source_id"], row["ra"], row["dec"])  # each row is a dict

Which interface should I use?

Batch callback — Tuples in field order; zero threading overhead; best for one-shot conversion (e.g. VOTable → Parquet with optional --select/--where). Use parse_votable with ParquetAdapter and optionally CompiledBatchQuery.
Iterator — Dicts keyed by column name; composable with generator transforms and for loops. Slightly higher cost due to the thread and queue. Use VOTableStreamingParser when you need row-by-row logic in Python or named column access.

Streaming Pipelines

For batch-oriented conversion with filtering, use the CLI or the Python batch API with CompiledBatchQuery (see Batch callback interface above); the sink receives batches of tuples. For row-by-row logic with dict access, use VOTableStreamingParser and compose with generator transforms.

Example: filter rows in Python and pass batches to Parquet.

from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter
from votpipe.query import CompiledBatchQuery

with ParquetAdapter("output.parquet") as parquet:
    query = CompiledBatchQuery(parquet.on_batch, where="parallax > 10")
    with query:
        parse_votable("input.vot.gz", query.on_batch)

Or iterate and transform row dicts with VOTableStreamingParser, then feed a custom sink (e.g. build a list or write CSV) in your own loop.

Row Transformations

When using the iterator (VOTableStreamingParser), you get a stream of row dicts. Transforms on that stream can:

drop rows
mutate rows
emit multiple rows

Example (each row is a dict, so you can use row["column_name"]):

from votpipe import VOTableStreamingParser

def add_distance(rows):
    for row in rows:
        if row["parallax"] is None:
            continue
        row["distance_pc"] = 1000.0 / row["parallax"]
        yield row

# Iterator yields dicts; batch API would give you (fields, list of tuples).
for row in add_distance(VOTableStreamingParser("input.vot.gz")):
    process(row)

Astropy Integration

You can consume the row stream from VOTableStreamingParser and build an astropy.table.Table (e.g. by collecting rows into a list and calling Table(rows)). Because that builds the full table in memory, there is little advantage over using Astropy’s own VOTable reader unless you want to filter or aggregate in a streaming way before materializing the table. For large files, the batch callback + Parquet path is usually preferable; read the Parquet output with Astropy or pandas as needed.

Output Formats

Parquet — CLI (votpipe convert with .parquet or --format parquet) and votpipe.parquet.ParquetAdapter. Install with pip install votpipe[parquet].
CSV — CLI (e.g. output.csv, output.csv.gz, output.csv.xz or --format csv) and votpipe.csv.CsvAdapter.
ECSV — CLI (e.g. output.ecsv, output.ecsv.gz, output.ecsv.xz or --format ecsv) and votpipe.csv.EcsvAdapter. Compression for CSV/ECSV is inferred from the output filename (.gz or .xz; stdlib gzip and lzma).

Design Philosophy

votpipe follows a simple streaming model:

VOTable → parser → row stream → transform → serializer

The parser produces rows lazily. Transforms operate on row streams. Serializers consume the stream and write output.

This design allows large datasets to be processed with predictable memory usage.

Scope

votpipe focuses on streaming VOTable payloads embedded directly in XML.

Supported:

TABLEDATA
BINARY
BINARY2

Not supported:

FITS external serialization
PARQUET external serialization

These serializations reference external files and are better handled by specialised readers.

Development Status

votpipe is an early-stage project. Implemented:

Streaming parser for TABLEDATA, BINARY, BINARY2 (including .vot.gz)
CLI: votpipe convert with --select, --where, --format, --compression, --batch-size, optional progress bar; Parquet, CSV, and ECSV output (format/compression from extension or --format)
Compiled filter/select: CompiledBatchQuery with a small expression language for --where
Batch callback API: parse_votable + ParquetAdapter (and optionally CompiledBatchQuery)
Row iterator: VOTableStreamingParser yielding row dicts
Parquet and CSV/ECSV adapters for programmatic use

Planned improvements include:

full datatype coverage
round-trip tests against Astropy
improved metadata preservation

License

MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Mar 11, 2026

0.1.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

votpipe-0.2.0.tar.gz (88.7 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

votpipe-0.2.0-py3-none-any.whl (20.6 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file votpipe-0.2.0.tar.gz.

File metadata

Download URL: votpipe-0.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 88.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for votpipe-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fff4974b6120d97d1fc3c138d3b3827f6c82debb06e2c1a0faacd9d868317da2`
MD5	`fbab64215ecfcab195a4988a8b11d8f2`
BLAKE2b-256	`3367be3dcfd6365e3c2b58cc1363f760c6f4969b7185dfd7918aea38c4b3126e`

See more details on using hashes here.

File details

Details for the file votpipe-0.2.0-py3-none-any.whl.

File metadata

Download URL: votpipe-0.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for votpipe-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46e2ca3506063b712a93d602a0bd5a3b3d631bb816076678623c1cac8d72f978`
MD5	`0d155278a94bbcd2cbe4ba4b4f16cf59`
BLAKE2b-256	`4207e23d958725abe9e006edb66d8ffc1f29c45312d12341fbd3bfb8013cc56d`

See more details on using hashes here.

votpipe 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

votpipe

Features

Comparison with STILTS

Installation

Command Line Usage

Python Usage

Batch callback interface

Iterator interface

Which interface should I use?

Streaming Pipelines

Row Transformations

Astropy Integration

Output Formats

Design Philosophy

Scope

Development Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes