Skip to main content

Parquet Metadata Reader

Project description

rugo

License Python Version PyPI Downloads

rugo is a C++17 and Cython powered Parquet metadata reader for Python. It delivers high-throughput metadata inspection without loading columnar data pages.

Key Features

  • Fast metadata extraction backed by an optimized C++17 parser and thin Python bindings.
  • Complete schema and row-group details, including encodings, codecs, offsets, bloom filter pointers, and custom key/value metadata.
  • Works with file paths, byte strings, and contiguous memoryviews for zero-copy parsing.
  • Optional schema conversion helpers for Orso.
  • No runtime dependencies beyond the Python standard library.

Installation

PyPI

pip install rugo

# Optional extras
pip install rugo[orso]
pip install rugo[dev]

From source

git clone https://github.com/mabel-dev/rugo.git
cd rugo
python -m venv .venv
source .venv/bin/activate
make update
make compile
pip install -e .

Requirements

  • Python 3.9 or newer
  • A C++17 compatible compiler (clang, gcc, or MSVC)
  • Cython and setuptools for source builds (installed by the commands above)

Quickstart

import rugo.parquet as parquet_meta

metadata = parquet_meta.read_metadata("example.parquet")

print(f"Rows: {metadata['num_rows']}")
print("Schema columns:")
for column in metadata["schema_columns"]:
    print(f"  {column['name']}: {column['physical_type']} ({column['logical_type']})")

first_row_group = metadata["row_groups"][0]
for column in first_row_group["columns"]:
    print(
        f"{column['name']}: codec={column['compression_codec']}, "
        f"nulls={column['null_count']}, range=({column['min']}, {column['max']})"
    )

read_metadata returns dictionaries composed of Python primitives, ready for JSON serialisation or downstream processing.

Returned metadata layout

{
    "num_rows": int,
    "schema_columns": [
        {
            "name": str,
            "physical_type": str,
            "logical_type": str,
            "nullable": bool,
        },
        ...
    ],
    "row_groups": [
        {
            "num_rows": int,
            "total_byte_size": int,
            "columns": [
                {
                    "name": str,
                    "path_in_schema": str,
                    "type": str,
                    "logical_type": str,
                    "num_values": Optional[int],
                    "total_uncompressed_size": Optional[int],
                    "total_compressed_size": Optional[int],
                    "data_page_offset": Optional[int],
                    "index_page_offset": Optional[int],
                    "dictionary_page_offset": Optional[int],
                    "min": Any,
                    "max": Any,
                    "null_count": Optional[int],
                    "distinct_count": Optional[int],
                    "bloom_offset": Optional[int],
                    "bloom_length": Optional[int],
                    "encodings": List[str],
                    "compression_codec": Optional[str],
                    "key_value_metadata": Optional[Dict[str, str]],
                },
                ...
            ],
        },
        ...
    ],
}

Fields that are not present in the source Parquet file are reported as None. Minimum and maximum values are decoded into Python types when possible; otherwise hexadecimal strings are returned.

Parsing options

All entry points share the same keyword arguments:

  • schema_only (default False): return only the top-level schema without row group details.
  • include_statistics (default True): skip min/max/num_values decoding when set to False.
  • max_row_groups (default -1): limit the number of row groups inspected; handy for very large files.
metadata = parquet_meta.read_metadata(
    "large_file.parquet",
    schema_only=False,
    include_statistics=False,
    max_row_groups=2,
)

Working with in-memory data

with open("example.parquet", "rb") as fh:
    data = fh.read()

from_bytes = parquet_meta.read_metadata_from_bytes(data)
from_view = parquet_meta.read_metadata_from_memoryview(memoryview(data))

read_metadata_from_memoryview performs zero-copy parsing when given a contiguous buffer.

Optional Orso conversion

Install the optional extra (pip install rugo[orso]) to enable Orso helpers:

from rugo.converters.orso import extract_schema_only, rugo_to_orso_schema

metadata = parquet_meta.read_metadata("example.parquet")
relation = rugo_to_orso_schema(metadata, "example_table")
schema_info = extract_schema_only(metadata)

See examples/orso_conversion.py for a complete walkthrough.

Development

make update     # install build and test tooling (uses uv under the hood)
make compile    # rebuild the Cython extension with -O3 and C++17 flags
make test       # run pytest-based validation (includes PyArrow comparisons)
make lint       # run ruff, isort, pycln, cython-lint
make mypy       # type checking

make compile clears previous build artefacts before rebuilding the extension in-place.

Project layout

rugo/
├── rugo/__init__.py
├── rugo/parquet/
│   ├── metadata_reader.pyx
│   ├── metadata.cpp
│   ├── metadata.hpp
│   └── thrift.hpp
├── rugo/converters/orso.py
├── examples/
│   ├── comprehensive_metadata.py
│   └── orso_conversion.py
├── tests/
│   ├── data/
│   ├── test_all_metadata_fields.py
│   ├── test_logical_types.py
│   ├── test_orso_converter.py
│   └── test_statistics.py
├── Makefile
├── pyproject.toml
└── README.md

Status and limitations

  • Active development status (alpha); API details may evolve.
  • Focused on metadata inspection; columnar data reads are out of scope.
  • Requires a C++17 compiler when installing from source or editing the Cython bindings.
  • Bloom filter information is exposed via offsets and lengths; higher-level helpers are planned.

License

Licensed under the Apache License 2.0. See LICENSE for full terms.

Maintainer

Created and maintained by Justin Joyce (@joocer). Contributions are welcome via issues and pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rugo-0.1.7.tar.gz (134.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rugo-0.1.7-cp312-cp312-musllinux_1_1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12musllinux: musl 1.1+ x86-64

rugo-0.1.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

rugo-0.1.7-cp312-cp312-macosx_11_0_arm64.whl (109.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

rugo-0.1.7-cp312-cp312-macosx_10_9_x86_64.whl (113.4 kB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

rugo-0.1.7-cp311-cp311-musllinux_1_1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11musllinux: musl 1.1+ x86-64

rugo-0.1.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

rugo-0.1.7-cp311-cp311-macosx_11_0_arm64.whl (108.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

rugo-0.1.7-cp311-cp311-macosx_10_9_x86_64.whl (112.5 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

rugo-0.1.7-cp310-cp310-musllinux_1_1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10musllinux: musl 1.1+ x86-64

rugo-0.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

rugo-0.1.7-cp310-cp310-macosx_11_0_arm64.whl (108.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

rugo-0.1.7-cp310-cp310-macosx_10_9_x86_64.whl (112.5 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

rugo-0.1.7-cp39-cp39-musllinux_1_1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9musllinux: musl 1.1+ x86-64

rugo-0.1.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

rugo-0.1.7-cp39-cp39-macosx_11_0_arm64.whl (108.8 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

rugo-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl (112.5 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file rugo-0.1.7.tar.gz.

File metadata

  • Download URL: rugo-0.1.7.tar.gz
  • Upload date:
  • Size: 134.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rugo-0.1.7.tar.gz
Algorithm Hash digest
SHA256 062d973af12f62676393f60e2eb92031eb0c07f25190c743583fd6ec9bb94a53
MD5 5d6a8559fca9b1874049e862a1141bfc
BLAKE2b-256 e85f736c39451d4bfd129811910501978776cc9301b75f3adf88ab5cb839346c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7.tar.gz:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp312-cp312-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 f30533568c25b6149aff39d93b5cdd7c0b9d878d82fcfc07c0a77100209212ae
MD5 8305d8043ad67d90cc4071791b390c56
BLAKE2b-256 3cc5afb9eda655c87cc9e8a725546c26f8391011947833c5e7d0f6c9c684b9ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp312-cp312-musllinux_1_1_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 60770bc4b604c3606b7e10294d923f75e487da680066e7e7ac506d6e1b30e96d
MD5 5818b5e4d661d5cf63f7a6b51a8c60f7
BLAKE2b-256 766c14b3ae45616756f3f3d35782d5272eeb4c1eac840b7141aef63c16107184

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64ecb20ad5dd401ae5cc4e36b5fd08a37df688f1be6d4d3d09e16718135f3ff7
MD5 b22fd1c0c47074f4981bc29d2726edf5
BLAKE2b-256 766ea2c75f2f5ad97bafaadefd360000e5bfd50786a91f0e025db7fd166ad780

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a010cd31b626e1e39b12277ebbb73e06f7363b23801de3ad6e21819c4e85af25
MD5 df570bef672f40a182b6a602088741a2
BLAKE2b-256 dcf5ccd1ffbb83c459d912588f428140ec7f78b8d27eb2d9e8cc1b04160cc2bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp312-cp312-macosx_10_9_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 12edefcd49c91e656f8a1af7d396bf8d10f31462d7947ac11f3893163c4f1217
MD5 7ac246d370160dc4155b78ae1be69c2b
BLAKE2b-256 8e030068d914da312063267c44ca3d71fb76319f853daf6d4045925c213b24e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp311-cp311-musllinux_1_1_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d4a83b74375af125337bc5ab1b32b570205606baeac54f5b4922d0a9ea985616
MD5 fb554fb639fb9c63a8093d0fbde5a738
BLAKE2b-256 4dca9e4d66611cf0e3d13e49d543293595c754e279b5b1fa802452d4b6744ec5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c4983d22ba6ebb25a9f2dc792a0699728766d3fdfa58c340fb26fc00a148de83
MD5 cb1360cacee951e2542e84f0ab19ab8e
BLAKE2b-256 918b24dab4941b9e2eebd38334794bc152d0d55ffe92311f039eabcb1ba28cdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c250a86f86abb790a81faa9ed1b7c4c1840c0b55d22477553f77513d22208eab
MD5 605e2d721df7b66175f8466085d1b1c8
BLAKE2b-256 353b49619ca30b1f9083108596394721f03c52032c428385ccdf535217a25f7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp311-cp311-macosx_10_9_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 8f5b498f9f2cee6ad235cef64af5bdac1eea51417c0c412656408eee9b9fd5a4
MD5 9424b959a5905dede6b3058e32c3385d
BLAKE2b-256 47361215d26f79f6b4f077295acabe13a71b80853c409a63cd9f0dff21b9cdfd

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp310-cp310-musllinux_1_1_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2bf0ebb5f47483f27abb7838ed40726e00bac7f2d8b2cded79c5405a3931565d
MD5 1d17291b409b095b957e16d6fa7f234b
BLAKE2b-256 aae78ea8c8fffbf5412dfe9daf4f6489c8b5703f8bcd646e6f88519a5387d5c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d31a41ce04705f50fa8c36b99aef31a30c782afa0105c23826af4bcb6df88b25
MD5 39775893a4e4cffe73140a7d647f1abc
BLAKE2b-256 0eb08db67ff4e3992c11eb2c50daba36f950dfa9076dd0b21c7f1b7dc11d91aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4f7e740ed5710f720a719e40baf572c759d628a0f5346f27956be175ab6504bd
MD5 8c996cdae7ad3894290e371407ecd59e
BLAKE2b-256 7e8f1d5cb90b9f6da417fd920913bd8b8c91d1719fa70cef287a55d2e8ec4044

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp310-cp310-macosx_10_9_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp39-cp39-musllinux_1_1_x86_64.whl.

File metadata

  • Download URL: rugo-0.1.7-cp39-cp39-musllinux_1_1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.9, musllinux: musl 1.1+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rugo-0.1.7-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 8b5cfade3e28dcb4fd42641026f95177aa8e43f4b4c4a54eb752a87a2f9762c0
MD5 7041e84315ad247f7e95605416b013b5
BLAKE2b-256 ec404768e2e905a60ec24369e513b8d30b4a23c8c7b0d7071c8d473f0d801042

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp39-cp39-musllinux_1_1_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rugo-0.1.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1385776909cebedc2b170ec68268ad3d5e7c09a6a92b3466019dab2fde9086d4
MD5 6422cc9fcc277a0bc799ad2e0dcc52a4
BLAKE2b-256 2386036d05d541deb574f33765238919ee7f90abc22d3338c8e3b289124211f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

  • Download URL: rugo-0.1.7-cp39-cp39-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 108.8 kB
  • Tags: CPython 3.9, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rugo-0.1.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b73881b15464a3f79eeef4b599b2c9d69427f4c3bfd914a2cfdde1ea0af7475d
MD5 2ef918c4bb033be9541edf16531ddb20
BLAKE2b-256 f53694e198e616588b977178eb13d3a1b0b6324de030610ae40aaa231ea48f48

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rugo-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: rugo-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 112.5 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rugo-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ad91ace9c72426beb2c58895bb09f9284a038c533c6af289d2d6ae9d59a06f57
MD5 95d21b4d637aa7328c1c646d0980845e
BLAKE2b-256 e2bbc3151847fefa18c6a1121b6b78d58673fac5eb7272ca48347e8f49ffda77

See more details on using hashes here.

Provenance

The following attestation bundles were made for rugo-0.1.7-cp39-cp39-macosx_10_9_x86_64.whl:

Publisher: release.yml on mabel-dev/rugo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page