Skip to main content

Python support for Parquet file format

Project description

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows.

Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below. With that said, fastparquet is capable of reading all the data files from the parquet-compatability project.

Introduction

This software is alpha, expect frequent API changes and breakages.

A list of expected features and their status in this branch can be found in this issue. Please feel free to comment on that list as to missing items and priorities.

In the meantime, the more eyes on this code, the more example files and the more use cases the better.

Requirements

(all development is against recent versions in the default anaconda channels)

Required:

  • numba

  • numpy

  • pandas

Optional (compression algorithms; gzip is always available):

  • snappy

  • lzo

  • brotli

Installation

Install from github:

> pip install git+https://github.com/martindurant/fastparquet

Assuming the requirements have been met. Numba should be installed using conda, and a conda package of this package will be forthcoming.

Usage

Reading

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.

Writing

from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, partitions=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.

History

Since the second week of October, this fork of parquet-python has been undergoing considerable redevelopment. The aim is to have a small and simple and performant library for reading and writing the parquet format from python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastparquet-0.0.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastparquet-0.0.1-py2.py3-none-any.whl (36.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file fastparquet-0.0.1.tar.gz.

File metadata

  • Download URL: fastparquet-0.0.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for fastparquet-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c17b46a6f83314434143ac583968b7b0d99f331c50fae13f82e711b64a926e2a
MD5 fa66d8cee5a9f7a44215a755cba4b10f
BLAKE2b-256 31c03936dcf8e2dd98ed97dc6e7ce930dd3a3470bb558c8bc0e7ebfbe0118ca3

See more details on using hashes here.

File details

Details for the file fastparquet-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for fastparquet-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b088e9f6fcf3d08b30dffefe1d0d3df06576d8f3142f34e47dd2374cfa05e12d
MD5 9e8eaa3295d781fe4a4aaae6b79d9283
BLAKE2b-256 31528261a92702c9dc686409afff01d7b925e63b1f01423c4c7a3cf7db9af81e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page