Skip to main content

Python support for Parquet file format

Project description

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows.

Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below. With that said, fastparquet is capable of reading all the data files from the parquet-compatability project.

Introduction

Details of this project can be found in the documentation.

The original plan listing expected features can be found in this issue. Please feel free to comment on that list as to missing items and priorities, or raise new issues with bugs or requests.

Requirements

(all development is against recent versions in the default anaconda channels)

Required:

  • numba (requires LLVM 4.0.x)

  • numpy

  • pandas

  • cython

  • six

Optional (compression algorithms; gzip is always available):

  • snappy (aka python-snappy)

  • lzo

  • brotli

Installation

Install using conda:

conda install -c conda-forge fastparquet

install from pypi:

pip install fastparquet

or install latest version from github:

pip install git+https://github.com/dask/fastparquet

For the pip methods, numba must have been previously installed (using conda).

Usage

Reading

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.

Writing

from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.

History

Since early October 2016, this fork of parquet-python has been undergoing considerable redevelopment. The aim is to have a small and simple and performant library for reading and writing the parquet format from python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastparquet-0.1.4.tar.gz (135.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastparquet-0.1.4-cp36-cp36m-macosx_10_7_x86_64.whl (168.8 kB view details)

Uploaded CPython 3.6mmacOS 10.7+ x86-64

fastparquet-0.1.4-cp35-cp35m-macosx_10_7_x86_64.whl (168.4 kB view details)

Uploaded CPython 3.5mmacOS 10.7+ x86-64

File details

Details for the file fastparquet-0.1.4.tar.gz.

File metadata

  • Download URL: fastparquet-0.1.4.tar.gz
  • Upload date:
  • Size: 135.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for fastparquet-0.1.4.tar.gz
Algorithm Hash digest
SHA256 4eba698969a7691acae973ad872dcbe3014bef5c446458897627541ac44a21e0
MD5 87a998432e8db4485e91d2b4c9619de4
BLAKE2b-256 9bd428185807a506b7b97f87489a39ac60cb74769fcc3f4cc4cf910e51b122cc

See more details on using hashes here.

File details

Details for the file fastparquet-0.1.4-cp36-cp36m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for fastparquet-0.1.4-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 a2623d01b0b87bdda2105d3dbb3e899f3990ecb8e18841d8e347e8bcbf02e0e8
MD5 0942945c32cd546ef7623f7064f350e1
BLAKE2b-256 b8576054ca060e8a2ac26191aaf77772fb4ffac120015260b0c901664b93c4ee

See more details on using hashes here.

File details

Details for the file fastparquet-0.1.4-cp35-cp35m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for fastparquet-0.1.4-cp35-cp35m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 29754ccae6fd7245e61aa42f006c1c9d37021f3c669f98d40f789490602a4ab3
MD5 a65de0cfa0a2ad9d7d7ac097acea518e
BLAKE2b-256 dfb6492e6a11cb8b7f729d99134d014f72a046afde4c50832734d4fd1b021601

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page