Parallel parser for large files
Project description
About
ParPar (Parallel Parser) is a light tool which makes it easy to distribute a function across a large file.
ParPar is meant to work on serialized data where some values are highly repeated across records for a given field. e.g.
a 1 1
a 1 2
a 2 3
a 2 4
a 3 5
a 3 6
b 1 7
b 1 8
b 2 9
b 2 10
b 3 11
b 3 12
although we have 12 records, for column 1 there are only 2 unique values. Likewise, for column 2 there are only 3 unique values. We could break this file up into smaller files under a directory:
<out-dir>/<col-1-value>/<col-2-value>
or vis versa.
How to use.
- Start by importing the ParPar class:
from parpar import ParPar
- Initialize an instance:
ppf = ParPar()
- Shard a large file into sub files*:
ppf.shard(
<input-file>, <output-directory>,
<columns>, <delim>, <newline>
)
- Check to make sure the number of records are the same:
files = ppf.shard_files(<output-directory>)
records = ppf.sharded_records(files)
from parpar.utils import filelines
print(records == filelines(<input-file>))
- Map an arbitrary function across all shared files:
def foo(line, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_apply(<output-directory>,
foo, args, kwargs,
processes=<number-of-processes>
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.