Skip to main content

Doing some data things in a memory efficient manner

Project description

How To Data

  1. Split data.

  2. Create a generator that will take the data as an iterator, yielding key,value pairs.

  3. Sort each list of key/value pairs by the key.

  4. Use heap to merge lists of key/value pairs by the key.

  5. Group key/value pairs by the key.

  6. Reduce each key grouped values to one value yielding a single key/value pair.

In lieu of a key, you may use a key function as long as it produces the same key throughout the map-sort-merge-group phases.

Split data

Use split_file to split up your data files or use split_csv_file to split up csv files which may have multi-line fields to ensure they are not broken up.:

import os

import karld

big_file_names = [
    "bigfile1.csv",
    "bigfile2.csv",
    "bigfile3.csv"
]

data_path = os.path.join('path','to','data', 'root')


def main():
    for filename in big_file_names:
        # Name the directory to write the split files into based
        # on the name of the file.
        out_dir = os.path.join(data_path, 'split_data', filename.replace('.csv', ''))

        # Split the file, with a default max_lines=200000 per shard of the file.
        karld.io.split_csv_file(os.path.join(data_path, filename), out_dir)


if __name__ == "__main__":
    main()

When you’re generating data and want to shard it out to files based on quantity, use one of the split output functions such as split_file_output_csv, split_file_output or split_file_output_json:

import os
import pathlib

import karld


def main():
    """
    Python 2 version
    """

    items = (str(x) + os.linesep for x in range(2000))

    out_dir = pathlib.Path('shgen')
    karld.io.ensure_dir(str(out_dir))

    karld.io.split_file_output('big_data', items, str(out_dir))


if __name__ == "__main__":
    main()

CSV serializable data:

import pathlib

import karld


def main():
    """
    From a source of data, shard it to csv files.
    """
    if karld.is_py3():
        third = chr
    else:
        third = unichr

    # Your data source
    items = ((x, x + 1, third(x + 10)) for x in range(2000))

    out_dir = pathlib.Path('shard_out_csv')

    karld.io.ensure_dir(str(out_dir))

    karld.io.split_file_output_csv('big_data.csv', items, str(out_dir))


if __name__ == "__main__":
    main()

Rows of json serializable data:

import pathlib

import karld


def main():
    """
    From a source of data, shard it to csv files.
    """
    if karld.is_py3():
        third = chr
    else:
        third = unichr

    # Your data source
    items = ((x, x + 1, third(x + 10)) for x in range(2000))

    out_dir = pathlib.Path('shard_out_json')

    karld.io.ensure_dir(str(out_dir))

    karld.io.split_file_output_json('big_data.json', items, str(out_dir))


if __name__ == "__main__":
    main()

Documentation

Read the docs: http://karld.readthedocs.org/en/latest/

Expanded Getting Started at http://karld.readthedocs.org/en/latest/getting-started.html.

More examples are documented at http://karld.readthedocs.org/en/latest/source/example.html. View the source of the example files, for examples…

Contributing:

Submit any issues or questions here: https://github.com/johnwlockwood/karl_data/issues.

Make pull requests to development branch of

https://github.com/johnwlockwood/karl_data.

Documentation is written in reStructuredText and currently uses the

Sphinx style for field lists http://sphinx-doc.org/domains.html#info-field-lists

Check out closed pull requests to see the flow of development, as almost every change to master is done via a pull request on GitHub. Code Reviews are welcome, even on merged Pull Requests. Feel free to ask questions about the code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

karld-0.3.1.tar.gz (36.4 kB view details)

Uploaded Source

Built Distributions

karld-0.3.1.macosx-10.12-x86_64.exe (94.2 kB view details)

Uploaded Source

karld-0.3.1-py3.5.egg (62.7 kB view details)

Uploaded Source

karld-0.3.1-py2.7.egg (61.3 kB view details)

Uploaded Source

File details

Details for the file karld-0.3.1.tar.gz.

File metadata

  • Download URL: karld-0.3.1.tar.gz
  • Upload date:
  • Size: 36.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for karld-0.3.1.tar.gz
Algorithm Hash digest
SHA256 68b86432af4b0c488dd5821409104dd32e5783f34095312d74ed31446234322c
MD5 dc93c5ba89870f1e6419c53b29a3b455
BLAKE2b-256 aedf3f535e971594c3392a9aec115965c249dae486b47555673350a01ab4b30a

See more details on using hashes here.

File details

Details for the file karld-0.3.1.macosx-10.12-x86_64.exe.

File metadata

File hashes

Hashes for karld-0.3.1.macosx-10.12-x86_64.exe
Algorithm Hash digest
SHA256 a75f90072b372dc46b2ae567b87cbb5ab07f360c1c7f854ebcd45555dbdbf668
MD5 24bca1de47408f38c926fe085d2b55f5
BLAKE2b-256 13e4131a6ca81969db6be8bb87e4b7b78f2e8fbd115c874d6ec3c291aca5e0f9

See more details on using hashes here.

File details

Details for the file karld-0.3.1-py3.5.egg.

File metadata

  • Download URL: karld-0.3.1-py3.5.egg
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for karld-0.3.1-py3.5.egg
Algorithm Hash digest
SHA256 b748b65c62737d4eee4ed974e9921194de9d319ed7f07255740dccb1f353ae35
MD5 ccd66acd8e65e98d762d96eaf3617da7
BLAKE2b-256 79052c0f866b89bf598f21496bc161ed0f2e8a9e944f19d6b6baf2d63558f248

See more details on using hashes here.

File details

Details for the file karld-0.3.1-py2.7.egg.

File metadata

  • Download URL: karld-0.3.1-py2.7.egg
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for karld-0.3.1-py2.7.egg
Algorithm Hash digest
SHA256 436444dc6b4820bc0b164300b6e3fd581406d29ff4f08cdccbef1b3c4065fbba
MD5 92f7c130f8234bb317789180859456aa
BLAKE2b-256 dbfef138e0e42b3ca83cbca320272f90afdbdb93e4f1ffbb0808a5d49f7d6eca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page