Parallel parser for large files
Project description
About
ParPar (Parallel Parser) is a light tool which makes it easy to distribute a function across a large file.
ParPar is meant to work on serialized data where some values are highly repeated across records for a given field. e.g.
a 1 1
a 1 2
a 2 3
a 2 4
a 3 5
a 3 6
b 1 7
b 1 8
b 2 9
b 2 10
b 3 11
b 3 12
although we have 12 records, for column 1 there are only 2 unique values. Likewise, for column 2 there are only 3 unique values. We could break this file up into smaller files under a directory:
<out-dir>/<col-1-value>/<col-2-value>
or vis versa.
How to use.
- Start by importing the ParPar class:
from parpar import ParPar
- Initialize an instance:
ppf = ParPar()
- Shard a large file into sub-files:
# shard by columns
ppf.shard(
<input-file>, <output-directory>,
<columns>, <delim>, <newline>
)
# shard by lines
ppf.shard_by_lines(
<input-file>, <output-directory>,
<number_of_lines>, <delim>, <newline>
)
- Check to make sure the number of records are the same:
files = ppf.shard_files(<output-directory>)
records = ppf.sharded_records(files)
from parpar.utils import filelines
print(records == filelines(<input-file>))
- Map an arbitrary function across each line of all shared files:
def foo(line, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_line_apply(<output-directory>,
foo, args, kwargs,
processes=<number-of-processes>
)
- Map an arbitrary function across all shared files:
def bar(file, *args, **kwargs):
pass
args = [1, 2, 3]
kwargs = {'a': 'x', 'b': 'y'}
ppf.shard_file_apply(<output-directory>,
bar, args, kwargs,
processes=<number-of-processes>
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parpar-0.0.19.tar.gz.
File metadata
- Download URL: parpar-0.0.19.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d72349456f083121b9995e10bb1f69eb2be2aaab6c1cdab0df5976c8d4c7bc3c
|
|
| MD5 |
a1fb32f4b03cced81ca79ca38e42bb30
|
|
| BLAKE2b-256 |
f359f96aa503c15fd01725779cd369be5d0ba890215d1964e22fcb7fdb353253
|
File details
Details for the file parpar-0.0.19-py3-none-any.whl.
File metadata
- Download URL: parpar-0.0.19-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbf554f3ae7775558b00c67062f3d72dc9337c05c97e80ce742da7583d4b4d4a
|
|
| MD5 |
f0ab93712d944c425c176996f4c685bb
|
|
| BLAKE2b-256 |
b67dbcef68c5809355278d259d5f7a747576dea3ffd10d5a28cea0fb15c34e02
|