Doing some data things in a memory efficient manner
Project description
How To Data
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files or use split_csv_file to split up
csv files which may have multi-line fields to ensure they are not broken up.
import os
from karld import split_csv_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_csv_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
Contributing:
==================
Make pull requests to **development** branch of
https://github.com/johnwlockwood/karl_data.
**Documentation** is written in reStructuredText and currently uses the
Sphinx style for field
lists http://sphinx-doc.org/domains.html#info-field-lists
Check out closed pull requests to see the flow of development, as almost
every change to master is done via a pull request on **GitHub**. Code Reviews
are welcome, even on merged Pull Requests. Feel free to ask questions about
the code.
Documentation
========================
Read the docs: http://karld.readthedocs.org/en/latest/
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files or use split_csv_file to split up
csv files which may have multi-line fields to ensure they are not broken up.
import os
from karld import split_csv_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_csv_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
Contributing:
==================
Make pull requests to **development** branch of
https://github.com/johnwlockwood/karl_data.
**Documentation** is written in reStructuredText and currently uses the
Sphinx style for field
lists http://sphinx-doc.org/domains.html#info-field-lists
Check out closed pull requests to see the flow of development, as almost
every change to master is done via a pull request on **GitHub**. Code Reviews
are welcome, even on merged Pull Requests. Feel free to ask questions about
the code.
Documentation
========================
Read the docs: http://karld.readthedocs.org/en/latest/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
karld-0.2.1.tar.gz
(26.4 kB
view hashes)
Built Distributions
karld-0.2.1.macosx-10.9-intel.exe
(86.2 kB
view hashes)
karld-0.2.1-py3.4.egg
(47.3 kB
view hashes)
karld-0.2.1-py2.7.egg
(46.2 kB
view hashes)
Close
Hashes for karld-0.2.1.macosx-10.9-intel.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | e80d801d71c2f4b6b212daffd9c95780687358c83dad394b54938a323d3def68 |
|
MD5 | 87c8cb0ebd5a08652828b282bce67dd6 |
|
BLAKE2b-256 | 588a44fe1c54be4d72733e556387d4049cb7e7fc0e454ead6451d8ec2b2caf6d |