Skip to main content

A utility to split large files into smaller chunks.

Project description

PyFileChunker

PyPI version

A Python utility to intelligently split large files into smaller, manageable chunks based on user-defined record boundaries, not just arbitrary sizes or line counts.

Pains

Working with extremely large text-based files (e.g., logs, data dumps, XML/JSON streams) can be challenging:

  • Memory Issues: Loading the entire file into memory is often impossible.
  • Processing Bottlenecks: Processing the file sequentially can be slow.
  • Incomplete Records: Simple splitting by size or line count can break records mid-way, corrupting data or making parsing difficult.

PyFileChunker addresses these pains by allowing you to split files between records, ensuring each chunk contains complete records. It uses memory mapping (mmap) for efficiency when searching for boundaries.

Features

  • Record Boundary Splitting: Define custom start and end markers for your records (e.g., <RECORD>, </RECORD>, BEGIN TRANSACTION, END TRANSACTION).
  • Intelligent Boundary Finding: Attempts to find the nearest record boundaries around the ideal chunk split points.
  • Memory Efficient: Uses mmap to avoid loading the entire file into memory when locating split points.
  • Configurable: Control the desired number of chunks and the record markers.
  • Command-Line Interface: Easy-to-use CLI powered by Typer.
  • Python Module: Can be imported and used directly in your Python scripts.

Installation

From PyPI (Recommended):

pip install pyfilechunker

From Source:

  1. Clone the repository:
    git clone https://github.com/fxyzbtc/pyfilechunker.git # Replace with your actual repo URL
    cd pyfilechunker
    
  2. Install using pip:
    pip install .
    
    For development, install in editable mode with development dependencies:
    pip install -e .[dev]
    

Usage

You can use PyFileChunker in three ways:

1. As an Installed Script (filechunker)

This is the most common way after installing via pip.

filechunker [OPTIONS] FILE_PATH

Arguments:

  • FILE_PATH: The path to the large file you want to chunk. [required]

Options:

  • --num-chunks INTEGER: The desired number of chunks. [default: 5]
  • --record-begin TEXT: String marking the beginning of a record. [default: <SUBBEGIN>]
  • --record-end TEXT: String marking the end of a record. [default: <SUBEND>]
  • --output-dir DIRECTORY: The directory to save the chunk files. [default: . (current directory)]
  • --help: Show the help message and exit.

Example:

Split my_large_log.log into approximately 10 chunks, using START and END as record markers, saving chunks to the output_chunks/ directory:

filechunker my_large_log.log --num-chunks 10 --record-begin "START" --record-end "END" --output-dir output_chunks/

2. As a Python Module (python -m pyfilechunker)

You can run the module directly using Python's -m flag. This is useful if the script isn't in your PATH or you prefer this invocation. The arguments and options are the same as the script.

python -m pyfilechunker [OPTIONS] FILE_PATH

Example:

python -m pyfilechunker data.xml --num-chunks 20 --record-begin "<item>" --record-end "</item>" --output-dir ./chunks

3. Importing in Python Code

You can import and use the chunk_it function directly in your Python scripts for more complex workflows.

from pyfilechunker import chunk_it
from pathlib import Path

input_file = "path/to/your/large_file.txt"
output_directory = "chunk_output"
num_chunks_desired = 15
start_marker = "BEGIN_RECORD"
end_marker = "END_RECORD"

try:
    # Ensure output directory exists
    Path(output_directory).mkdir(parents=True, exist_ok=True)

    created_files = chunk_it(
        filename=input_file,
        num_chunks=num_chunks_desired,
        record_begin=start_marker,
        record_end=end_marker,
        output_dir=output_directory
    )

    if created_files:
        print(f"Successfully created {len(created_files)} chunks in '{output_directory}':")
        for f in created_files:
            print(f"- {f}")
    else:
        print("No chunk files were created.")

except FileNotFoundError:
    print(f"Error: Input file not found at {input_file}")
except Exception as e:
    print(f"An error occurred: {e}")

Development Guide

Prerequisites:

  • Python >= 3.12
  • Git
  • pip and venv (recommended)

Setup:

  1. Clone: git clone https://github.com/fxyzbtc/pyfilechunker.git && cd pyfilechunker
  2. Create Virtual Environment: python -m venv .venv
  3. Activate Environment:
    • Windows: .venv\Scripts\activate
    • macOS/Linux: source .venv/bin/activate
  4. Install Dependencies: pip install -e .[dev] (This installs the package in editable mode along with pytest)

Running Tests:

Make sure your virtual environment is activated.

pytest

Building the Package:

Ensure you have the build tools installed:

pip install build

Then run the build command:

python -m build

This will create distribution files (wheel and sdist) in the dist/ directory.

Contributing:

Contributions are welcome! Please feel free to open an issue or submit a pull request. (Add more specific contribution guidelines if desired, e.g., code style, PR process).

License

This project is licensed under the MIT License. (Update if needed)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfilechunker-0.1.2.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyfilechunker-0.1.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file pyfilechunker-0.1.2.tar.gz.

File metadata

  • Download URL: pyfilechunker-0.1.2.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.25

File hashes

Hashes for pyfilechunker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 46d1a586b8166354c94dd7afa0dc55577ed65e36f2039da94c76a85195a188a3
MD5 10218bae33fe38551d420f9be3e0dbd4
BLAKE2b-256 d53a56510bc1cb409a2937937f2f16c9283f211453007f0b5ab70ca88fe34c4f

See more details on using hashes here.

File details

Details for the file pyfilechunker-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyfilechunker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 15f6a74080f39e50ba8218ec5699ec280fb2414f83790dec6e8bee7fb06ed704
MD5 9f4a219a3ab691ecc1c1f4ca32eab290
BLAKE2b-256 f786ced708e119195e25c3acea3d73f4d4cf9ee7b34bee63017e998b5a506d90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page