Skip to main content

A library for chunking different types of data files.

Project description

chunkr

PyPI version

Support Python versions

A library for chunking different types of data files into another file format. Large files friendly.

Currently supported input formats: csv, parquet

Getting started

pip install chunkr

Usage

from chunkr import create_chunks_dir

with create_chunks_dir(file_format, name, input_filepath, output_dir, chunk_size, **extra_args) as chunks_dir:
    # process chunk files inside dir

parameters:

  • format (str): input format (csv, parquet)
  • name (str): a distinct name of the chunking job
  • path (str): the path of the input (local, sftp etc, see fsspec for possible input)
  • output_path (str): the path of the directory to output the chunks to
  • chunk_size (int, optional): number of records in a chunk. Defaults to 100_000.
  • storage_options (dict, optional): extra options to pass to the underlying storage e.g. username, password etc. Defaults to None.
  • write_options (dict, optional): extra options for writing the chunks passed to the respective library. Defaults to None.
  • extra_args (dict, optional): extra options passed on to the parsing system, file specific

Note: currently chunkr only supports parquet as the output chunk files format

Examples

CSV

Suppose you want to chunk a csv file of 1 million records into 10 parquet pieces, you can do this

CSV extra args are passed to PyArrows Parsing Options

from chunkr import create_chunks_dir
import pandas as pd

with create_chunks_dir(
            'csv',
            'csv_test',
            'path/to/file',
            'temp/output',
            100_000,
            None,
            None,
            quote_char='"',
            delimiter=',',
            escape_char='\\',
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

Parquet

from chunkr import create_chunks_dir
import pandas as pd

with create_chunks_dir(
            'parquet',
            'parquet_test',
            'path/to/file',
            'temp/output'
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

Reading file(s) inside an archive (zip, tar)

reading multiple files from a zip archive is possible, for csv files in /folder_in_archive/*.csv within an archive csv/archive.zip you can do:

from chunkr import create_chunks_dir
import pandas as pd

with create_chunks_dir(
            'csv',
            'csv_test_zip',
            'zip://folder_in_archive/*.csv::csv/archive.zip',
            'temp/output'
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

The only exception is when particularly reading a csv file from a tar.gz, there can be only 1 csv file within the archive:

from chunkr import create_chunks_dir
import pandas as pd

with create_chunks_dir(
            'csv',
            'csv_test_tar',
            'tar://*.csv::csv/archive_single.tar.gz',
            'temp/output'
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

but it's okay for other file types like parquet:

from chunkr import create_chunks_dir
import pandas as pd

with create_chunks_dir(
            'parquet',
            'parquet_test',
            'tar://partition_idx=*/*.parquet::test/parquet/archive.tar.gz',
            'temp/output'
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

Reading from an SFTP remote system

To authenticate to the SFTP server, you can pass the credentials via storage_options:

from chunkr import create_chunks_dir
import pandas as pd

sftp_path = f"sftp://{sftpserver.host}:{sftpserver.port}/parquet/pyarrow_snappy.parquet"

with create_chunks_dir(
            'parquet',
            'parquet_test_sftp',
            sftp_path,
            'temp/output',
            1000,
            {
                "username": "user",
                "password": "pw",
            }
    ) as chunks_dir:

        assert 1_000_000 == sum(
            len(pd.read_parquet(file)) for file in chunks_dir.iterdir()
        )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkr-0.2.0.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

chunkr-0.2.0-py3-none-any.whl (6.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page