Doing some data things in a memory efficient manner
Project description
How To Data
Split data.
Create a generator that will take the data as an iterator, yielding key,value pairs.
Sort each list of key/value pairs by the key.
Use heap to merge lists of key/value pairs by the key.
Group key/value pairs by the key.
Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the same key throughout the map-sort-merge-group phases.
Split data
Use split_file to split up your data files or use split_csv_file to split up csv files which may have multi-line fields to ensure they are not broken up.:
import os import karld big_file_names = [ "bigfile1.csv", "bigfile2.csv", "bigfile3.csv" ] data_path = os.path.join('path','to','data', 'root') def main(): for filename in big_file_names: # Name the directory to write the split files into based # on the name of the file. out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', '')) # Split the file, with a default max_lines=200000 per shard of the file. karld.io.split_csv_file(os.path.join(data_path, filename), out_dir) if __name__ == "__main__": main()
When you’re generating data and want to shard it out to files based on quantity, use one of the split output functions such as split_file_output_csv, split_file_output or split_file_output_json:
import os import pathlib import karld def main(): """ Python 2 version """ items = (str(x) + os.linesep for x in range(2000)) out_dir = pathlib.Path('shgen') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output('big_data', items, str(out_dir)) if __name__ == "__main__": main()
CSV serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_csv') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_csv('big_data.csv', items, str(out_dir)) if __name__ == "__main__": main()
Rows of json serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_json') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_json('big_data.json', items, str(out_dir)) if __name__ == "__main__": main()
Documentation
Read the docs: http://karld.readthedocs.org/en/latest/
Expanded Getting Started at http://karld.readthedocs.org/en/latest/getting-started.html.
More examples are documented at http://karld.readthedocs.org/en/latest/source/example.html. View the source of the example files, for examples…
Contributing:
Submit any issues or questions here: https://github.com/johnwlockwood/karl_data/issues.
- Make pull requests to development branch of
- Documentation is written in reStructuredText and currently uses the
Sphinx style for field lists http://sphinx-doc.org/domains.html#info-field-lists
Check out closed pull requests to see the flow of development, as almost every change to master is done via a pull request on GitHub. Code Reviews are welcome, even on merged Pull Requests. Feel free to ask questions about the code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for karld-0.3.1.macosx-10.12-x86_64.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | a75f90072b372dc46b2ae567b87cbb5ab07f360c1c7f854ebcd45555dbdbf668 |
|
MD5 | 24bca1de47408f38c926fe085d2b55f5 |
|
BLAKE2b-256 | 13e4131a6ca81969db6be8bb87e4b7b78f2e8fbd115c874d6ec3c291aca5e0f9 |