Parallel parser for large files
Project description
About
ParPar (Parallel Parser) is a light tool which makes it easy to distribute a function across a large file.
ParPar is meant to work on serialized data where some values are highly repeated across records for a given field. e.g.
a 1 1
a 1 2
a 2 3
a 2 4
a 3 5
a 3 6
b 1 7
b 1 8
b 2 9
b 2 10
b 3 11
b 3 12
although we have 12 records, for column 1 there are only 2 unique values. Likewise, for column 2 there are only 3 unique values. We could break this file up into smaller files under a directory:
<out-dir>/<col-1-value>/<col-2-value>
or vis versa.
How to use.
- Start by importing the ParPar class:
from parpar import ParPar
- Initialize an instance:
ppf = ParPar()
- Shard a large file into sub-files:
# shard by columns
ppf.shard(
<input-file>, <output-directory>,
<columns>, <delim>, <newline>
)
# shard by lines
ppf.shard_by_lines(
<input-file>, <output-directory>,
<number_of_lines>, <delim>, <newline>
)
- Check to make sure the number of records are the same:
files = ppf.shard_files(<output-directory>)
records = ppf.sharded_records(files)
from parpar.utils import filelines
print(records == filelines(<input-file>))
- Map an arbitrary function across each line of all shared files:
def foo(line, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_line_apply(<output-directory>,
foo, args, kwargs,
processes=<number-of-processes>
)
- Map an arbitrary function across all shared files:
def bar(file, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_file_apply(<output-directory>,
bar, args, kwargs,
processes=<number-of-processes>
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parpar-0.0.19.tar.gz
(7.0 kB
view details)
Built Distribution
File details
Details for the file parpar-0.0.19.tar.gz
.
File metadata
- Download URL: parpar-0.0.19.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d72349456f083121b9995e10bb1f69eb2be2aaab6c1cdab0df5976c8d4c7bc3c |
|
MD5 | a1fb32f4b03cced81ca79ca38e42bb30 |
|
BLAKE2b-256 | f359f96aa503c15fd01725779cd369be5d0ba890215d1964e22fcb7fdb353253 |
File details
Details for the file parpar-0.0.19-py3-none-any.whl
.
File metadata
- Download URL: parpar-0.0.19-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbf554f3ae7775558b00c67062f3d72dc9337c05c97e80ce742da7583d4b4d4a |
|
MD5 | f0ab93712d944c425c176996f4c685bb |
|
BLAKE2b-256 | b67dbcef68c5809355278d259d5f7a747576dea3ffd10d5a28cea0fb15c34e02 |