Lightweight streaming parser for VOTable files; convert to Parquet, CSV, or ECSV with optional filtering.
Project description
votpipe
Streaming VOTable processing for Python
votpipe is a lightweight streaming parser for large VOTable files.
It allows you to process rows incrementally, apply transformations or filters, and write the results to formats such as Parquet, CSV, or Astropy tables.
The focus is simplicity and pipeline integration:
- stream rows without loading the entire table into memory
- transform rows as they pass through the pipeline
- write to modern formats such as Parquet
- install and run easily with
piporpipx
Unlike tools that require a Java runtime or complex setup, votpipe is a small Python-native utility designed to integrate naturally into data pipelines.
Features
- Streaming VOTable parsing (TABLEDATA, BINARY, BINARY2)
- Batch callback API and row-iterator API
- CLI:
votpipe convertwith column selection (--select) and row filtering (--where) - Compiled filter expressions (e.g.
parallax > 10 and phot_g_mean_mag < 15) - Works with arbitrarily large tables
- Parquet, CSV, and ECSV output (CLI and Python); compression via
.gzor.xzextension
Supported VOTable serializations:
TABLEDATABINARYBINARY2
External serializations such as FITS and PARQUET references are intentionally out of scope for the core parser.
Comparison with STILTS
STILTS is a powerful and mature toolkit for working with VOTable and other astronomy table formats. If you simply need to convert tables between formats or perform standard VO table operations, STILTS is often the best and most feature-complete solution.
votpipe is designed for a different niche: Python-native streaming pipelines. It can be installed and run with a single pip install votpipe[cli,parquet] or pipx install votpipe[cli,parquet], without requiring a Java runtime or managing multiple JAR dependencies. It is particularly useful when you want to integrate VOTable processing directly into a Python workflow, apply custom filtering or transformations in Python, or stream large VOTables directly into modern analytics formats such as Parquet.
In short:
- Use STILTS if you need the most complete astronomy table toolkit and are comfortable working with the Java-based ecosystem.
- Use votpipe if you want a lightweight Python tool that streams VOTables into Python pipelines, supports simple CLI filtering (
--select,--where), and writes directly to Parquet, CSV, or ECSV with minimal setup.
Installation
Install for CLI use (includes convert command and progress bar):
pip install votpipe[cli,parquet]
Or with pipx:
pipx install votpipe[cli,parquet]
Minimal install (Python API only, no CLI deps):
pip install votpipe
Command Line Usage
The CLI provides a single command, convert, which streams a VOTable to Parquet, CSV, or ECSV with optional column selection and row filtering. Install CLI dependencies with pip install votpipe[cli] or pip install votpipe[parquet,cli].
Output format is detected from the output file extension (e.g. .parquet, .csv, .ecsv, .csv.gz, .ecsv.xz), or set explicitly with --format auto|csv|ecsv|parquet. Default output path replaces .vot/.vot.gz with the appropriate extension for the chosen format.
Basic conversion:
votpipe convert input.vot.gz
votpipe convert input.vot.gz output.parquet
votpipe convert input.vot.gz output.csv
votpipe convert input.vot.gz output.ecsv.gz
votpipe convert input.vot.gz --format ecsv
Select specific columns:
votpipe convert input.vot.gz output.parquet \
--select source_id,ra,dec,parallax
Filter rows with a --where expression:
votpipe convert input.vot.gz output.parquet \
--where "parallax > 10 and phot_g_mean_mag < 15"
Combined select and filter:
votpipe convert input.vot.gz output.parquet \
--select source_id,ra,dec,parallax \
--where "parallax > 10 and phot_g_mean_mag < 15"
Other options:
--progress/--no-progress— show a progress bar (default: on)--format— output format:auto(default, from extension),csv,ecsv, orparquet--compression— Parquet only:zstd(default),snappy, ornone. CSV/ECSV use.gzor.xzin the filename.--batch-size— max rows per batch (default: 8192)
Filter expression (--where) supports column names, numeric/string/bool/None constants, and:
- Boolean:
and,or,not - Comparisons:
==,!=,<,<=,>,>=,is None,is not None - Chained comparisons: e.g.
0 < parallax <= 10
Nullable columns are treated as false in comparisons when the value is None (e.g. parallax > 10 drops nulls).
Python Usage
You can consume parsed data in two ways; the shape of the data differs:
| API | What you get | Row shape |
|---|---|---|
| Batch callback | on_batch(fields, rows) called repeatedly |
rows is a list of tuples; each tuple has values in the same order as fields (no column names). |
| Row iterator | Iterate over the stream | Each item is one dict keyed by column name (e.g. row["ra"]). |
Use the batch API when you want maximum throughput and are feeding a batch-oriented sink (Parquet, CSV adapter, or CompiledBatchQuery). Use the iterator when you want to loop over rows by name or compose with Python generators.
Batch callback interface
parse_votable(source, on_batch, batch_size=8192) parses the VOTable and calls on_batch(fields, rows) for each batch. Here fields is a list of field metadata dicts (name, datatype, etc.) and rows is a list of tuples: each tuple is one row, with values in the same order as fields. There are no column names in the row data—you use the field list to interpret indices. No threading; lowest overhead. Use it when pushing directly into a sink such as ParquetAdapter or CompiledBatchQuery.
from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter
with ParquetAdapter("output.parquet") as parquet:
parse_votable("table.vot.gz", parquet.on_batch)
With column selection and filtering (same logic as the CLI):
from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter
from votpipe.query import CompiledBatchQuery
with ParquetAdapter("output.parquet") as parquet:
query = CompiledBatchQuery(
parquet.on_batch,
select="source_id,ra,dec,parallax",
where="parallax > 10 and phot_g_mean_mag < 15",
)
with query:
parse_votable("table.vot.gz", query.on_batch)
Iterator interface
VOTableStreamingParser(source) is iterable and yields one row per item, each row as a dict with column names as keys (e.g. row["ra"]). Unlike the batch API, you get named access per row. The implementation uses a background thread and a queue to bridge SAX’s push model to Python’s pull model. Use it when you want to filter, transform, or chain row streams and prefer dict-style access.
from votpipe import VOTableStreamingParser
for row in VOTableStreamingParser("table.vot.gz"):
print(row["source_id"], row["ra"], row["dec"]) # each row is a dict
Which interface should I use?
- Batch callback — Tuples in field order; zero threading overhead; best for one-shot conversion (e.g. VOTable → Parquet with optional
--select/--where). Useparse_votablewithParquetAdapterand optionallyCompiledBatchQuery. - Iterator — Dicts keyed by column name; composable with generator transforms and
forloops. Slightly higher cost due to the thread and queue. UseVOTableStreamingParserwhen you need row-by-row logic in Python or named column access.
Streaming Pipelines
For batch-oriented conversion with filtering, use the CLI or the Python batch API with CompiledBatchQuery (see Batch callback interface above); the sink receives batches of tuples. For row-by-row logic with dict access, use VOTableStreamingParser and compose with generator transforms.
Example: filter rows in Python and pass batches to Parquet.
from votpipe import parse_votable
from votpipe.parquet import ParquetAdapter
from votpipe.query import CompiledBatchQuery
with ParquetAdapter("output.parquet") as parquet:
query = CompiledBatchQuery(parquet.on_batch, where="parallax > 10")
with query:
parse_votable("input.vot.gz", query.on_batch)
Or iterate and transform row dicts with VOTableStreamingParser, then feed a custom sink (e.g. build a list or write CSV) in your own loop.
Row Transformations
When using the iterator (VOTableStreamingParser), you get a stream of row dicts. Transforms on that stream can:
- drop rows
- mutate rows
- emit multiple rows
Example (each row is a dict, so you can use row["column_name"]):
from votpipe import VOTableStreamingParser
def add_distance(rows):
for row in rows:
if row["parallax"] is None:
continue
row["distance_pc"] = 1000.0 / row["parallax"]
yield row
# Iterator yields dicts; batch API would give you (fields, list of tuples).
for row in add_distance(VOTableStreamingParser("input.vot.gz")):
process(row)
Astropy Integration
You can consume the row stream from VOTableStreamingParser and build an astropy.table.Table (e.g. by collecting rows into a list and calling Table(rows)). Because that builds the full table in memory, there is little advantage over using Astropy’s own VOTable reader unless you want to filter or aggregate in a streaming way before materializing the table. For large files, the batch callback + Parquet path is usually preferable; read the Parquet output with Astropy or pandas as needed.
Output Formats
- Parquet — CLI (
votpipe convertwith.parquetor--format parquet) andvotpipe.parquet.ParquetAdapter. Install withpip install votpipe[parquet]. - CSV — CLI (e.g.
output.csv,output.csv.gz,output.csv.xzor--format csv) andvotpipe.csv.CsvAdapter. - ECSV — CLI (e.g.
output.ecsv,output.ecsv.gz,output.ecsv.xzor--format ecsv) andvotpipe.csv.EcsvAdapter. Compression for CSV/ECSV is inferred from the output filename (.gzor.xz; stdlibgzipandlzma).
Design Philosophy
votpipe follows a simple streaming model:
VOTable → parser → row stream → transform → serializer
The parser produces rows lazily. Transforms operate on row streams. Serializers consume the stream and write output.
This design allows large datasets to be processed with predictable memory usage.
Scope
votpipe focuses on streaming VOTable payloads embedded directly in XML.
Supported:
TABLEDATABINARYBINARY2
Not supported:
FITSexternal serializationPARQUETexternal serialization
These serializations reference external files and are better handled by specialised readers.
Development Status
votpipe is an early-stage project. Implemented:
- Streaming parser for TABLEDATA, BINARY, BINARY2 (including
.vot.gz) - CLI:
votpipe convertwith--select,--where,--format,--compression,--batch-size, optional progress bar; Parquet, CSV, and ECSV output (format/compression from extension or--format) - Compiled filter/select:
CompiledBatchQuerywith a small expression language for--where - Batch callback API:
parse_votable+ParquetAdapter(and optionallyCompiledBatchQuery) - Row iterator:
VOTableStreamingParseryielding row dicts - Parquet and CSV/ECSV adapters for programmatic use
Planned improvements include:
- full datatype coverage
- round-trip tests against Astropy
- improved metadata preservation
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file votpipe-0.2.0.tar.gz.
File metadata
- Download URL: votpipe-0.2.0.tar.gz
- Upload date:
- Size: 88.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fff4974b6120d97d1fc3c138d3b3827f6c82debb06e2c1a0faacd9d868317da2
|
|
| MD5 |
fbab64215ecfcab195a4988a8b11d8f2
|
|
| BLAKE2b-256 |
3367be3dcfd6365e3c2b58cc1363f760c6f4969b7185dfd7918aea38c4b3126e
|
File details
Details for the file votpipe-0.2.0-py3-none-any.whl.
File metadata
- Download URL: votpipe-0.2.0-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46e2ca3506063b712a93d602a0bd5a3b3d631bb816076678623c1cac8d72f978
|
|
| MD5 |
0d155278a94bbcd2cbe4ba4b4f16cf59
|
|
| BLAKE2b-256 |
4207e23d958725abe9e006edb66d8ffc1f29c45312d12341fbd3bfb8013cc56d
|