Parallel parser for large files
Project description
About
ParPar (Parallel Parser) is a light tool which makes it easy to distribute a function across a large file.
ParPar is meant to work on serialized data where some values are highly repeated across records for a given field. e.g.
a 1 1
a 1 2
a 2 3
a 2 4
a 3 5
a 3 6
b 1 7
b 1 8
b 2 9
b 2 10
b 3 11
b 3 12
although we have 12 records, for column 1 there are only 2 unique values. Likewise, for column 2 there are only 3 unique values. We could break this file up into smaller files under a directory:
<out-dir>/<col-1-value>/<col-2-value>
or vis versa.
How to use.
- Start by importing the ParPar class:
from parpar import ParPar
- Initialize an instance:
ppf = ParPar()
- Shard a large file into sub files*:
ppf.shard(
<input-file>, <output-directory>,
<columns>, <delim>, <newline>
)
- Check to make sure the number of records are the same:
files = ppf.shard_files(<output-directory>)
records = ppf.sharded_records(files)
from parpar.utils import filelines
print(records == filelines(<input-file>))
- Map an arbitrary function across all shared files:
def foo(line, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_apply(<output-directory>,
foo, args, kwargs,
processes=<number-of-processes>
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parpar-0.0.5.tar.gz
(5.4 kB
view hashes)
Built Distribution
parpar-0.0.5-py3-none-any.whl
(6.0 kB
view hashes)