Change Large CSVs Faster with Easy Parallel Processing
Project description
Celery Stalk provides users a way to quickly mutate values in extremely large CSV files using parallel processing. Celery performs operations as fast as multithreading/multiprocessing out of the box with python, but is resilient to errors and outages and reallocated tasks across CPUs. You might find it useful for data cleaning tasks. Typical usage often looks like this:
COMMAND LINE INTERFACE EXAMPLE CMD: python3 paralleliser.py --task_file parallelised_pandas_apply.py --ip_loc /Users/rudyvenguswamy/Coding/samplecsv.csv --op_loc /Users/rudyvenguswamy/Coding/
- Example Broken Down:
- You will always run paralleliser.py (the function that sets up the workers)
–task file: the .py file containing the tasks that will be parallelized. Parallelised_pandas_apply.py is an example of a task, applied to pandas apply. If you make your own task file, it will require modifying paralleliser.py to have the master_run() function take the inputs your function requires (+ modifying the argument parser to take your inputs from command line if desired) –ip_loc: location of the CSV file for input –op_loc: where the resulting CSV will be placed
CeleryStalk uses the ‘examplefn’ inside the parallelised_pandas_apply.py by default. Don’t worry it prints out a line letting you know it’s using the default function. However, if you’d like to apply your own function, write it inside the task file (parallelised_pandas_apply.py) and pass it as an argument to master_run.
To report any bugs, email rvenguswamy@vmware.com
When to Use Celery Stalk
Celery Stalk should be used when:
You have large data sets (personal testing shows it’s better to just do the mutation [via Pandas.read_csv > mutate > pd.to_csv] when the data set is small.
This library is especially useful if the computation performed on rows is intensive
If your computer has at least a few CPUs (If your computer does not and you’re working with extremely large data set computation, I recommend reconsidering your hardware choices or using a cloud GPU/CPU service).
Benchmark
Parallelizing the apply function on pandas improves performance significantly. A 705MB CSV file takes about 20:40 minutes to run the example function, which adds a value to a numerical column of a dataframe that ingests the CSV file.
With CeleryStalk on a 12 Logical Cores computer (6 Physical CPU Cores x2), the process takes 4:50 minutes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for CeleryStalk-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f551630e124db2ae2ea8d759b977527dfddc9ca159557de29c1f8d600545d480 |
|
MD5 | 9c0ea84fd67570cb18dfdded54784737 |
|
BLAKE2b-256 | fae6257de79f78720940b33406197e706651547a1e156a01d772ce0afd3bf64a |