A utility to split large files into smaller chunks.
Project description
PyFileChunker
A Python utility to intelligently split large files into smaller, manageable chunks based on user-defined record boundaries, not just arbitrary sizes or line counts.
Pains
Working with extremely large text-based files (e.g., logs, data dumps, XML/JSON streams) can be challenging:
- Memory Issues: Loading the entire file into memory is often impossible.
- Processing Bottlenecks: Processing the file sequentially can be slow.
- Incomplete Records: Simple splitting by size or line count can break records mid-way, corrupting data or making parsing difficult.
PyFileChunker addresses these pains by allowing you to split files between records, ensuring each chunk contains complete records. It uses memory mapping (mmap) for efficiency when searching for boundaries.
Features
- Record Boundary Splitting: Define custom start and end markers for your records (e.g.,
<RECORD>,</RECORD>,BEGIN TRANSACTION,END TRANSACTION). - Intelligent Boundary Finding: Attempts to find the nearest record boundaries around the ideal chunk split points.
- Memory Efficient: Uses
mmapto avoid loading the entire file into memory when locating split points. - Configurable: Control the desired number of chunks and the record markers.
- Command-Line Interface: Easy-to-use CLI powered by Typer.
- Python Module: Can be imported and used directly in your Python scripts.
Installation
From PyPI (Recommended):
pip install pyfilechunker
From Source:
- Clone the repository:
git clone https://github.com/fxyzbtc/pyfilechunker.git # Replace with your actual repo URL cd pyfilechunker
- Install using pip:
pip install .
For development, install in editable mode with development dependencies:pip install -e .[dev]
Usage
You can use PyFileChunker in three ways:
1. As an Installed Script (filechunker)
This is the most common way after installing via pip.
filechunker [OPTIONS] FILE_PATH
Arguments:
FILE_PATH: The path to the large file you want to chunk. [required]
Options:
--num-chunks INTEGER: The desired number of chunks. [default: 5]--record-begin TEXT: String marking the beginning of a record. [default:<SUBBEGIN>]--record-end TEXT: String marking the end of a record. [default:<SUBEND>]--output-dir DIRECTORY: The directory to save the chunk files. [default: . (current directory)]--help: Show the help message and exit.
Example:
Split my_large_log.log into approximately 10 chunks, using START and END as record markers, saving chunks to the output_chunks/ directory:
filechunker my_large_log.log --num-chunks 10 --record-begin "START" --record-end "END" --output-dir output_chunks/
2. As a Python Module (python -m pyfilechunker)
You can run the module directly using Python's -m flag. This is useful if the script isn't in your PATH or you prefer this invocation. The arguments and options are the same as the script.
python -m pyfilechunker [OPTIONS] FILE_PATH
Example:
python -m pyfilechunker data.xml --num-chunks 20 --record-begin "<item>" --record-end "</item>" --output-dir ./chunks
3. Importing in Python Code
You can import and use the chunk_it function directly in your Python scripts for more complex workflows.
from pyfilechunker import chunk_it
from pathlib import Path
input_file = "path/to/your/large_file.txt"
output_directory = "chunk_output"
num_chunks_desired = 15
start_marker = "BEGIN_RECORD"
end_marker = "END_RECORD"
try:
# Ensure output directory exists
Path(output_directory).mkdir(parents=True, exist_ok=True)
created_files = chunk_it(
filename=input_file,
num_chunks=num_chunks_desired,
record_begin=start_marker,
record_end=end_marker,
output_dir=output_directory
)
if created_files:
print(f"Successfully created {len(created_files)} chunks in '{output_directory}':")
for f in created_files:
print(f"- {f}")
else:
print("No chunk files were created.")
except FileNotFoundError:
print(f"Error: Input file not found at {input_file}")
except Exception as e:
print(f"An error occurred: {e}")
Development Guide
Prerequisites:
- Python >= 3.12
- Git
pipandvenv(recommended)
Setup:
- Clone:
git clone https://github.com/fxyzbtc/pyfilechunker.git && cd pyfilechunker - Create Virtual Environment:
python -m venv .venv - Activate Environment:
- Windows:
.venv\Scripts\activate - macOS/Linux:
source .venv/bin/activate
- Windows:
- Install Dependencies:
pip install -e .[dev](This installs the package in editable mode along withpytest)
Running Tests:
Make sure your virtual environment is activated.
pytest
Building the Package:
Ensure you have the build tools installed:
pip install build
Then run the build command:
python -m build
This will create distribution files (wheel and sdist) in the dist/ directory.
Contributing:
Contributions are welcome! Please feel free to open an issue or submit a pull request. (Add more specific contribution guidelines if desired, e.g., code style, PR process).
License
This project is licensed under the MIT License. (Update if needed)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyfilechunker-0.1.2.tar.gz.
File metadata
- Download URL: pyfilechunker-0.1.2.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46d1a586b8166354c94dd7afa0dc55577ed65e36f2039da94c76a85195a188a3
|
|
| MD5 |
10218bae33fe38551d420f9be3e0dbd4
|
|
| BLAKE2b-256 |
d53a56510bc1cb409a2937937f2f16c9283f211453007f0b5ab70ca88fe34c4f
|
File details
Details for the file pyfilechunker-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pyfilechunker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f6a74080f39e50ba8218ec5699ec280fb2414f83790dec6e8bee7fb06ed704
|
|
| MD5 |
9f4a219a3ab691ecc1c1f4ca32eab290
|
|
| BLAKE2b-256 |
f786ced708e119195e25c3acea3d73f4d4cf9ee7b34bee63017e998b5a506d90
|