Doing some data things in a memory efficient manner
Project description
How To Data
Split data.
Create a generator that will take the data as an iterator, yielding key,value pairs.
Sort each list of key/value pairs by the key.
Use heap to merge lists of key/value pairs by the key.
Group key/value pairs by the key.
Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the same key throughout the map-sort-merge-group phases.
Split data
Use split_file to split up your data files or use split_csv_file to split up csv files which may have multi-line fields to ensure they are not broken up.:
import os import karld big_file_names = [ "bigfile1.csv", "bigfile2.csv", "bigfile3.csv" ] data_path = os.path.join('path','to','data', 'root') def main(): for filename in big_file_names: # Name the directory to write the split files into based # on the name of the file. out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', '')) # Split the file, with a default max_lines=200000 per shard of the file. karld.io.split_csv_file(os.path.join(data_path, filename), out_dir) if __name__ == "__main__": main()
When you’re generating data and want to shard it out to files based on quantity, use one of the split output functions such as split_file_output_csv, split_file_output or split_file_output_json:
import os import pathlib import karld def main(): """ Python 2 version """ items = (str(x) + os.linesep for x in range(2000)) out_dir = pathlib.Path('shgen') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output('big_data', items, str(out_dir)) if __name__ == "__main__": main()
CSV serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_csv') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_csv('big_data.csv', items, str(out_dir)) if __name__ == "__main__": main()
Rows of json serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_json') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_json('big_data.json', items, str(out_dir)) if __name__ == "__main__": main()
Contributing:
- Make pull requests to development branch of
- Documentation is written in reStructuredText and currently uses the
Sphinx style for field lists http://sphinx-doc.org/domains.html#info-field-lists
Check out closed pull requests to see the flow of development, as almost every change to master is done via a pull request on GitHub. Code Reviews are welcome, even on merged Pull Requests. Feel free to ask questions about the code.
Documentation
Read the docs: http://karld.readthedocs.org/en/latest/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for karld-0.2.4.macosx-10.9-intel.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35d839644fb2c9a72d8a24243aab6a8eae52ff0fe4a47ec404b6681020d6e0d7 |
|
MD5 | a1c69c010a4b2bef88828e1592995e9e |
|
BLAKE2b-256 | 3807d7afdd6743abdd4141fa409d6e854b1c39e3efe08086b66510a4ff07fa82 |