Skip to main content

Python support for Parquet file format

Project description

parquet-python

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format. It comes with a script for reading parquet files and outputting the data to stdout as JSON or TSV (without the overhead of JVM startup). Performance has not yet been optimized, but it’s useful for debugging and quick viewing of data in files.

Not all parts of the parquet-format have been implemented yet or tested e.g. nested data—see Todos below for a full list. With that said, parquet-python is capable of reading all the data files from the parquet-compatability project.

requirements

parquet-python has been tested on python 2.7, 3.4, and 3.5. It depends on thrift (0.9) and python-snappy (for snappy compressed files).

getting started

parquet-python is available via PyPi and can be installed using pip install parquet. The package includes the parquet command for reading python files, e.g. parquet test.parquet. See parquet –help for full usage.

Example

parquet-python currently has two programatic interfaces with similar functionality to Python’s csv reader. First, it supports a DictReader which returns a dictionary per row. Second, it has a reader which returns a list of values for each row. Both function require a file-like object and support an optional columns field to only read the specified columns.

import parquet
import json

## assuming parquet file with two rows and three columns:
## foo bar baz
## 1   2   3
## 4   5   6

with open("test.parquet") as fo:
   # prints:
   # {"foo": 1, "bar": 2}
   # {"foo": 4, "bar": 5}
   for row in parquet.DictReader(fo, columns=['foo', 'bar']):
       print(json.dumps(row))


with open("test.parquet") as fo:
   # prints:
   # 1,2
   # 4,5
   for row in parquet.reader(fo, columns=['foo', 'bar]):
       print(",".join([str(r) for r in row]))

Todos

  • Support the deprecated bitpacking

  • Fix handling of repetition-levels and definition-levels

  • Tests for nested schemas, null data

  • Support reading of data from HDFS via snakebite and/or webhdfs.

  • Implement writing

  • performance evaluation and optimization (i.e. how does it compare to the c++, java implementations)

Contributing

Is done via Pull Requests. Please include tests with your changes and follow pep8.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet-1.2.tar.gz (21.5 kB view details)

Uploaded Source

Built Distributions

parquet-1.2-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

parquet-1.2-py2-none-any.whl (20.2 kB view details)

Uploaded Python 2

File details

Details for the file parquet-1.2.tar.gz.

File metadata

  • Download URL: parquet-1.2.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for parquet-1.2.tar.gz
Algorithm Hash digest
SHA256 5b45b63f3381af8d059ecc301954fa15babb6ba96e95939382e42c94520e8045
MD5 05aacec0620ac63ecd7dd77bf7fb9fee
BLAKE2b-256 74b5bc459aab0566fc3cf3397467922c37411ab6e3361bab9e0ca165e1089ce8

See more details on using hashes here.

File details

Details for the file parquet-1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 67a9ac65b3748a4ae1185facd70540cfb5534416b43d0a1650422dbb4f52eb91
MD5 c536aaea853f87cd685e0573e86c7a5f
BLAKE2b-256 d42c31867848b0238fb1cf0b2fcb60296b3bd7e3c455b97c92026b6be652d34c

See more details on using hashes here.

File details

Details for the file parquet-1.2-py2-none-any.whl.

File metadata

File hashes

Hashes for parquet-1.2-py2-none-any.whl
Algorithm Hash digest
SHA256 ff39f63160a1b6226eb0257c0cd6a3d6f015e10681bdd6e4e0713c9df5e8b94e
MD5 63f9785af4d486dcfd708846e0590a55
BLAKE2b-256 4ba3c0aae38ac1bc7137a510d326fb99482ddd6cb6d468e9875e28be011bb833

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page