Skip to main content

Python support for Parquet file format

Project description

parquet-python

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format. It comes with a script for reading parquet files and outputting the data to stdout as JSON or TSV (without the overhead of JVM startup). Performance has not yet been optimized, but it’s useful for debugging and quick viewing of data in files.

Not all parts of the parquet-format have been implemented yet or tested e.g. nested data—see Todos below for a full list. With that said, parquet-python is capable of reading all the data files from the parquet-compatability project.

requirements

parquet-python has been tested on python 2.7, 3.6, and 3.7. It depends on pythrift2 and optionally on python-snappy (for snappy compressed files, please also install parquet-python[snappy]).

getting started

parquet-python is available via PyPi and can be installed using pip install parquet. The package includes the parquet command for reading python files, e.g. parquet test.parquet. See parquet –help for full usage.

Example

parquet-python currently has two programatic interfaces with similar functionality to Python’s csv reader. First, it supports a DictReader which returns a dictionary per row. Second, it has a reader which returns a list of values for each row. Both function require a file-like object and support an optional columns field to only read the specified columns.

import parquet
import json

## assuming parquet file with two rows and three columns:
## foo bar baz
## 1   2   3
## 4   5   6

with open("test.parquet") as fo:
   # prints:
   # {"foo": 1, "bar": 2}
   # {"foo": 4, "bar": 5}
   for row in parquet.DictReader(fo, columns=['foo', 'bar']):
       print(json.dumps(row))


with open("test.parquet") as fo:
   # prints:
   # 1,2
   # 4,5
   for row in parquet.reader(fo, columns=['foo', 'bar]):
       print(",".join([str(r) for r in row]))

Todos

  • Support the deprecated bitpacking

  • Fix handling of repetition-levels and definition-levels

  • Tests for nested schemas, null data

  • Support reading of data from HDFS via snakebite and/or webhdfs.

  • Implement writing

  • performance evaluation and optimization (i.e. how does it compare to the c++, java implementations)

Contributing

Is done via Pull Requests. Please include tests with your changes and follow pep8.

To run the tests you must install and execute tox (pip install tox) to run for all supported versions. If you want to run just for your current version, execute: pip install -r requirements-development.txt and then nosetests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet-1.3.1.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

parquet-1.3.1-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file parquet-1.3.1.tar.gz.

File metadata

  • Download URL: parquet-1.3.1.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for parquet-1.3.1.tar.gz
Algorithm Hash digest
SHA256 fb1c90768c1b9159d4d6a9b3112ea8107b0b46d7491c5ac452ba7350f333bf0a
MD5 04e7aaa557a67e3408ae4c4d40c74ed2
BLAKE2b-256 390656482f6834135a67dfc5a2bfce071a75be5c8c91edd8e319d69eb56b4644

See more details on using hashes here.

File details

Details for the file parquet-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: parquet-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for parquet-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a492a08643b51af8b4c2a25d97e1667170a483fc68ca408979b080ae9d771f51
MD5 c2ce908f97beb2e24929b586ca00d721
BLAKE2b-256 14a6d57a2fe5caac3e0e0cdb78c0e450f30f953a590ecf94478065f33feb3d8f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page