A utility to split large files into smaller chunks.

These details have not been verified by PyPI

Project links

Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Utilities

Project description

PyFileChunker

A Python utility to intelligently split large files into smaller, manageable chunks based on user-defined record boundaries, not just arbitrary sizes or line counts.

Pains

Working with extremely large text-based files (e.g., logs, data dumps, XML/JSON streams) can be challenging:

Memory Issues: Loading the entire file into memory is often impossible.
Processing Bottlenecks: Processing the file sequentially can be slow.
Incomplete Records: Simple splitting by size or line count can break records mid-way, corrupting data or making parsing difficult.

PyFileChunker addresses these pains by allowing you to split files between records, ensuring each chunk contains complete records. It uses memory mapping (mmap) for efficiency when searching for boundaries.

Features

Record Boundary Splitting: Define custom start and end markers for your records (e.g., <RECORD>, </RECORD>, BEGIN TRANSACTION, END TRANSACTION).
Intelligent Boundary Finding: Attempts to find the nearest record boundaries around the ideal chunk split points.
Memory Efficient: Uses mmap to avoid loading the entire file into memory when locating split points.
Configurable: Control the desired number of chunks and the record markers.
Command-Line Interface: Easy-to-use CLI powered by Typer.
Python Module: Can be imported and used directly in your Python scripts.

Installation

From PyPI (Recommended):

pip install pyfilechunker

From Source:

Clone the repository:

git clone https://github.com/fxyzbtc/pyfilechunker.git # Replace with your actual repo URL
cd pyfilechunker

Install using pip:
```
pip install .
```
For development, install in editable mode with development dependencies:
```
pip install -e .[dev]
```

Usage

You can use PyFileChunker in three ways:

1. As an Installed Script (filechunker)

This is the most common way after installing via pip.

filechunker [OPTIONS] FILE_PATH

Arguments:

FILE_PATH: The path to the large file you want to chunk. [required]

Options:

--num-chunks INTEGER: The desired number of chunks. [default: 5]
--record-begin TEXT: String marking the beginning of a record. [default: <SUBBEGIN>]
--record-end TEXT: String marking the end of a record. [default: <SUBEND>]
--output-dir DIRECTORY: The directory to save the chunk files. [default: . (current directory)]
--help: Show the help message and exit.

Example:

Split my_large_log.log into approximately 10 chunks, using START and END as record markers, saving chunks to the output_chunks/ directory:

filechunker my_large_log.log --num-chunks 10 --record-begin "START" --record-end "END" --output-dir output_chunks/

2. As a Python Module (python -m pyfilechunker)

You can run the module directly using Python's -m flag. This is useful if the script isn't in your PATH or you prefer this invocation. The arguments and options are the same as the script.

python -m pyfilechunker [OPTIONS] FILE_PATH

Example:

python -m pyfilechunker data.xml --num-chunks 20 --record-begin "<item>" --record-end "</item>" --output-dir ./chunks

3. Importing in Python Code

You can import and use the chunk_it function directly in your Python scripts for more complex workflows.

from pyfilechunker import chunk_it
from pathlib import Path

input_file = "path/to/your/large_file.txt"
output_directory = "chunk_output"
num_chunks_desired = 15
start_marker = "BEGIN_RECORD"
end_marker = "END_RECORD"

try:
    # Ensure output directory exists
    Path(output_directory).mkdir(parents=True, exist_ok=True)

    created_files = chunk_it(
        filename=input_file,
        num_chunks=num_chunks_desired,
        record_begin=start_marker,
        record_end=end_marker,
        output_dir=output_directory
    )

    if created_files:
        print(f"Successfully created {len(created_files)} chunks in '{output_directory}':")
        for f in created_files:
            print(f"- {f}")
    else:
        print("No chunk files were created.")

except FileNotFoundError:
    print(f"Error: Input file not found at {input_file}")
except Exception as e:
    print(f"An error occurred: {e}")

Development Guide

Prerequisites:

Python >= 3.12
Git
pip and venv (recommended)

Setup:

Clone: git clone https://github.com/fxyzbtc/pyfilechunker.git && cd pyfilechunker
Create Virtual Environment: python -m venv .venv
Activate Environment:
- Windows: .venv\Scripts\activate
- macOS/Linux: source .venv/bin/activate
Install Dependencies: pip install -e .[dev] (This installs the package in editable mode along with pytest)

Running Tests:

Make sure your virtual environment is activated.

pytest

Building the Package:

Ensure you have the build tools installed:

pip install build

Then run the build command:

python -m build

This will create distribution files (wheel and sdist) in the dist/ directory.

Contributing:

Contributions are welcome! Please feel free to open an issue or submit a pull request. (Add more specific contribution guidelines if desired, e.g., code style, PR process).

License

This project is licensed under the MIT License. (Update if needed)

Project details

These details have not been verified by PyPI

Project links

Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Utilities

Release history Release notifications | RSS feed

This version

0.1.2

Apr 28, 2025

0.1.1

Apr 27, 2025

0.1.0

Apr 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfilechunker-0.1.2.tar.gz (14.2 kB view details)

Uploaded Apr 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyfilechunker-0.1.2-py3-none-any.whl (10.1 kB view details)

Uploaded Apr 28, 2025 Python 3

File details

Details for the file pyfilechunker-0.1.2.tar.gz.

File metadata

Download URL: pyfilechunker-0.1.2.tar.gz
Upload date: Apr 28, 2025
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.25

File hashes

Hashes for pyfilechunker-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`46d1a586b8166354c94dd7afa0dc55577ed65e36f2039da94c76a85195a188a3`
MD5	`10218bae33fe38551d420f9be3e0dbd4`
BLAKE2b-256	`d53a56510bc1cb409a2937937f2f16c9283f211453007f0b5ab70ca88fe34c4f`

See more details on using hashes here.

File details

Details for the file pyfilechunker-0.1.2-py3-none-any.whl.

File metadata

Download URL: pyfilechunker-0.1.2-py3-none-any.whl
Upload date: Apr 28, 2025
Size: 10.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.25

File hashes

Hashes for pyfilechunker-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`15f6a74080f39e50ba8218ec5699ec280fb2414f83790dec6e8bee7fb06ed704`
MD5	`9f4a219a3ab691ecc1c1f4ca32eab290`
BLAKE2b-256	`f786ced708e119195e25c3acea3d73f4d4cf9ee7b34bee63017e998b5a506d90`

See more details on using hashes here.

pyfilechunker 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyFileChunker

Pains

Features

Installation

Usage

Development Guide

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes