Skip to main content

Easy, efficient and Pythonic data loading of Parquet files for PyTorch-based libraries

Project description

PyTorch Parquet Data Loader

This library holds a number of classes which help reading data from Parquet files into the PyTorch ecosystem easily! Although this library is intended to be used for natural language processing projects and with NLP libraries, it is extremely flexible. Feel free to use, modify or fork this library in any way!

Supported Libraries

Library Requirements Usage Notes
PyTorch-Lightning Requires PyArrow and (optionally) Petastorm The more basic PyArrow implementation is far easier to understand, but not battle tested.
Transformers Can be used with either PyTorch-Lightning implementation, but Petastorm casts data types from one format to another several times midway, which can impare performance
AllenNLP PyArrow Not implemented yet

Please look here for further information on using Petastorm with Hugging Face Transformers.

Difference from Petastorm

Petastorm is a great (albeit complex) library for using Parquet files in a large variety of situations. Although they have basic PyTorch support, their solution is tough to understand. This can make it difficult to debug and modify for personal use.

Alternatively, PyParquetLoaders is focused on providing Python classes which are easy to use, understand and modify. This means anyone can get started with their PyTorch models reasonably quickly, even if they're doing something slightly different/unique. Currently PyParquetLoaders supports PyTorch-Lightning and Hugging Face Transformers, with AllenNLP support comming soon!

Usage Guide

To use a Parquet file for training a PyTorch model simply choose and import the right data set/loader (for your library of choice). Then you can simply try and use it just as you would for any other (simple) text/image file (look at your libraries relavent docs). Some examples are included in the Tests.py script.

Developer/Contributor Guide

To help develop/extend this library please use the following workflow:

  1. Fork and clone the repository
  2. Make your modifications
  3. Test your modifications (run tests with python -m pytest and install in test mode with pip install -e .)
  4. Commit (git commit -m "description of changes") and push (git push) your changes
  5. Create a pull request

Feel free to add any feature you see fit, or fix/report any bug you find (with GitHub Issues). When creating a pull request please ensure you've carefully described what you're doing, why and a brief overview of your changes. The commits should be small, only changing one feature at a time.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyparquetloaders-0.1.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyparquetloaders-0.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file pyparquetloaders-0.1.tar.gz.

File metadata

  • Download URL: pyparquetloaders-0.1.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for pyparquetloaders-0.1.tar.gz
Algorithm Hash digest
SHA256 e0c4ca33815498b7718ac497fdfc1ebde02235815db3a5c6a844ed7f9c9fd2b7
MD5 34fca7b58712b3fc13fc8f435f216eb0
BLAKE2b-256 59f1b58bff39bdd49a30c44032e61aca6ff64a4bf3786d7d47521adb3e09cd24

See more details on using hashes here.

File details

Details for the file pyparquetloaders-0.1-py3-none-any.whl.

File metadata

  • Download URL: pyparquetloaders-0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for pyparquetloaders-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fd911e2f30652521863a5a9c3b6b116196118e94b2aeda2b17225006d1bc65d5
MD5 2e754ca20a5c5660ca47f6f72894ddcd
BLAKE2b-256 2e5fe6f90e019809ce19967f3fa41e13839e096f480814a9107e00f8cb183eb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page