Doing some data things in a memory efficient manner
Project description
How To Data
Split data.
Create a generator that will take the data as an iterator, yielding key,value pairs.
Sort each list of key/value pairs by the key.
Use heap to merge lists of key/value pairs by the key.
Group key/value pairs by the key.
Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the same key throughout the map-sort-merge-group phases.
Split data
Use split_file to split up your data files or use split_csv_file to split up csv files which may have multi-line fields to ensure they are not broken up.:
import os import karld big_file_names = [ "bigfile1.csv", "bigfile2.csv", "bigfile3.csv" ] data_path = os.path.join('path','to','data', 'root') def main(): for filename in big_file_names: # Name the directory to write the split files into based # on the name of the file. out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', '')) # Split the file, with a default max_lines=200000 per shard of the file. karld.io.split_csv_file(os.path.join(data_path, filename), out_dir) if __name__ == "__main__": main()
When you’re generating data and want to shard it out to files based on quantity, use one of the split output functions such as split_file_output_csv, split_file_output or split_file_output_json:
import os import pathlib import karld def main(): """ Python 2 version """ items = (str(x) + os.linesep for x in range(2000)) out_dir = pathlib.Path('shgen') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output('big_data', items, str(out_dir)) if __name__ == "__main__": main()
CSV serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_csv') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_csv('big_data.csv', items, str(out_dir)) if __name__ == "__main__": main()
Rows of json serializable data:
import pathlib import karld def main(): """ From a source of data, shard it to csv files. """ if karld.is_py3(): third = chr else: third = unichr # Your data source items = ((x, x + 1, third(x + 10)) for x in range(2000)) out_dir = pathlib.Path('shard_out_json') karld.io.ensure_dir(str(out_dir)) karld.io.split_file_output_json('big_data.json', items, str(out_dir)) if __name__ == "__main__": main()
Contributing:
- Make pull requests to development branch of
- Documentation is written in reStructuredText and currently uses the
Sphinx style for field lists http://sphinx-doc.org/domains.html#info-field-lists
Check out closed pull requests to see the flow of development, as almost every change to master is done via a pull request on GitHub. Code Reviews are welcome, even on merged Pull Requests. Feel free to ask questions about the code.
Documentation
Read the docs: http://karld.readthedocs.org/en/latest/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for karld-0.2.3.macosx-10.9-intel.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb0fcbcde6c1d8b8a046737f30735b818629e51f6405fbd217b6785924f2f3b1 |
|
MD5 | 1f5f16fdb59b5f4d7ce58860099b03ca |
|
BLAKE2b-256 | db548015b1e722fe2b919baf22c49e73a95bd5821df61c966035c148a776b28b |