Parallel parser for large files

These details have not been verified by PyPI

Project links

Homepage

Project description

About

ParPar (Parallel Parser) is a light tool which makes it easy to distribute a function across a large file.

ParPar is meant to work on serialized data where some values are highly repeated across records for a given field. e.g.

although we have 12 records, for column 1 there are only 2 unique values. Likewise, for column 2 there are only 3 unique values. We could break this file up into smaller files under a directory:

<out-dir>/<col-1-value>/<col-2-value>

or vis versa.

How to use.

Start by importing the ParPar class:

from parpar import ParPar

Initialize an instance:

ppf = ParPar()

Shard a large file into sub files*:

ppf.shard(
  <input-file>, <output-directory>,
  <columns>, <delim>, <newline>
)

Check to make sure the number of records are the same:

files = ppf.shard_files(<output-directory>)
records = ppf.sharded_records(files)

from parpar.utils import filelines

print(records == filelines(<input-file>))

Map an arbitrary function across all shared files:

def foo(line, *args, **kwargs):
    pass

args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}

ppf.shard_apply(<output-directory>,
  foo, args, kwargs,
  processes=<number-of-processes>
)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.19

Aug 20, 2019

0.0.17

Jul 15, 2019

0.0.16

Jul 7, 2019

0.0.15

Jul 5, 2019

0.0.14

Jul 5, 2019

0.0.13

Jul 5, 2019

0.0.11

Nov 16, 2018

0.0.10

Sep 4, 2018

0.0.9

Sep 4, 2018

0.0.8

Sep 4, 2018

0.0.7

Sep 4, 2018

0.0.6

Sep 4, 2018

0.0.5

Sep 4, 2018

0.0.4

Sep 4, 2018

This version

0.0.3

Sep 4, 2018

0.0.2

Sep 4, 2018

0.0.1

Sep 1, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parpar-0.0.3.tar.gz (5.1 kB view hashes)

Uploaded Sep 4, 2018 Source

Built Distribution

parpar-0.0.3-py3-none-any.whl (5.7 kB view hashes)

Uploaded Sep 4, 2018 Python 3

Hashes for parpar-0.0.3.tar.gz

Hashes for parpar-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`ed9bd9f47e58f78b432b97be6d6bbbc379783d314eece1f6caf8a998ee77fd7b`
MD5	`709cce4d1ef0320781a2f9e9be6d5f28`
BLAKE2b-256	`47979ec6f0d9b0d81e8fa6f0fe2679d9ab8724bc1a5fa09702906ddc4e1c7a2d`

Hashes for parpar-0.0.3-py3-none-any.whl

Hashes for parpar-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d3e2534e4f8b46780d3a1a04ca5d591b4f2b22a716f6349a0ea19438cc41469`
MD5	`cb61242fcf565ec7e8c182697f085035`
BLAKE2b-256	`09f77c28ced80fd8ccd845c91cc7cbb3ab853af828ed16ae3fd53e682ac6bd0c`