Doing some data things in a memory efficient manner
Project description
How To Data
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
======================
1.) Split data.
2.) Create a generator that will take the data as an iterator, yielding key,value pairs.
3.) Sort each list of key/value pairs by the key.
4.) Use heap to merge lists of key/value pairs by the key.
5.) Group key/value pairs by the key.
6.) Reduce each key grouped values to one value yielding a single key/value pair.
In lieu of a key, you may use a key function as long as it produces the
same key throughout the map-sort-merge-group phases.
Split data
----------------------
Use split_file to split up your data files.
import os
from karld.loadump import split_file
big_file_names = [
"bigfile1.csv",
"bigfile2.csv",
"bigfile3.csv"
]
data_path = os.path.join('path','to','data', 'root')
def main():
for filename in big_file_names:
# Name the directory to write the split files into.
# I'll make it after the name of the file, removing the extension.
out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))
# Split the file, with a default max_lines=200000 per shard of the file.
split_file(os.path.join(data_path, filename), out_dir)
if __name__ == "__main__":
main()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
karld-0.0.12.tar.gz
(19.3 kB
view hashes)
Built Distribution
Close
Hashes for karld-0.0.12.macosx-10.9-intel.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | d64d35215e9cf4c84073c8f9bed2104ea3e68996700f875c470c8f2a55b6d5a5 |
|
MD5 | b2b4f371c335229e58bcdd7040e6960e |
|
BLAKE2b-256 | 89adc312818d54c64474fa178ca9b5f1d4bd09a97d40cccb3139cf660decec8f |