Pyarrow Dataset wrapper for reading parquet datasets as rows

These details have not been verified by PyPI

Project links

Project description

SMurphyDev - Parquet Reader

version = 0.0.1

The purpose of this library is to enable reading parquet files one row at a time in a relatively memory consious manner. I say relatively because this library is a thin wrapper over pyarrow, and pyarrow Datasets, and arrows favors greedy allocation.

Parquet is a columnar format, which is compressed on disk. It's intended use case is for analytics workflows where you may need to persist large amounts of data to disk that you will want to query later. The problem which inspired this library is a very different usecase. I needed to extract data from a parquet file for use in an ETL style workflow. If you have a similar problem maybe this will be useful for you too.

Installation

Installation is straight forward. Just use pip

pip install parquetreader

Usage

In the simplest case you should be able to read a parquet file like so:

import parquetreader.reader as pr

# Fields/Columns you want to read from the parquet file.
fields = ["Field_1", "Field_2", "Field_3"]

# Path to the file you want to read.
# (Or to a directory containing parquet files, or a list of parquet files)
file_path = "path/to/file.parquet"

reader = rd.ParquetReader(file_path)

for row in reader.get_rows(fields):
    print(row["Field_1"])
    print(row["Field_2"])
    print(row["Field_3"])

get_rows returns a generator which yields data in the underlying file one row at a time. Files/Datasets are read in batches of 10k records, the records are converted into dictionaries of python types and returned in a way which allows us to iterate over them lazily one at a time.

If you need more control you can create the pyarrow dataset yourself. Under the hood get_rows() calls Dataset.to_batches(). You can also pass arguments in directly here which allow you to control the performance of reading the parquet files.

import parquetreader.reader as pr
import pyarrow.dataset as ds

# Fields/Columns you want to read from the parquet file.
fields = ["Field_1", "Field_2", "Field_3"]

# Path to the file you want to read.
# (Or to a directory containing parquet files, or a list of parquet files)
file_path = "path/to/file.parquet"

dataset = ds.dataset(
    file_path,
    format="parquet",
    exclude_invalid_files=True,
)

reader = rd.ParquetReader(dataset)

# Accepts same arguments as Dataset.to_batch()
for record in pbr.get_rows_with_args(
            columns=fields,
            batch_size=batch_size,
            batch_readahead=4,  # Number of batches to read ahead in a file
            fragment_readahead=2,  # Number of files to read ahead in a dataset
            use_threads=False,
        ):
    print(row["Field_1"])
    print(row["Field_2"])
    print(row["Field_3"])

You can read more about the arguments you can pass when creating a dataset or reading a batch from the arrow docs:

Development

To get up and running if you want to contribute:

git clone https://github.com/SMurphyDev/parquet-batch.git
git cd parquet-batch

python3 -m venv venv
source venv/bin/activate
pip install pip-tools
pip-sync requirements.txt dev-requirements.txt

At this point you should have all of the required dependencies set up and you should be good to go.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Feb 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquetreader-0.0.1.tar.gz (10.2 kB view details)

Uploaded Feb 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parquetreader-0.0.1-py3-none-any.whl (6.2 kB view details)

Uploaded Feb 11, 2025 Python 3

File details

Details for the file parquetreader-0.0.1.tar.gz.

File metadata

Download URL: parquetreader-0.0.1.tar.gz
Upload date: Feb 11, 2025
Size: 10.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for parquetreader-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`1b9901a7ead5e45cd9737486538fb40bae7017af1054ed38042f0b267c2802bb`
MD5	`52daba6be76bdbc619b47148e3bb8411`
BLAKE2b-256	`329129e947396ee1bc9adad1872aa43babde17fbe6a4c84a0ef768ac7885f7df`

See more details on using hashes here.

File details

Details for the file parquetreader-0.0.1-py3-none-any.whl.

File metadata

Download URL: parquetreader-0.0.1-py3-none-any.whl
Upload date: Feb 11, 2025
Size: 6.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for parquetreader-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb187698769c593f2fab6e384f40555ad120983c3998c09e2d005be9e2ba711f`
MD5	`7d66b0b45b8fb11fc0ec2916b1580f95`
BLAKE2b-256	`f4bb043b08337e4dcb3daa49daf8104ba737be0df6d935b5d4c697e71bef81fd`

See more details on using hashes here.

parquetreader 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SMurphyDev - Parquet Reader

Installation

Usage

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes