A Python command-line tool to split large Excel (.xls or .xlsx) files into smaller parts with low memory usage.
Project description
Spreadsheet Splitter
spreadsheet_splitter is a Python command-line tool designed to split large Excel files (.xls or .xlsx) into smaller, manageable parts. It supports processing large files iteratively with low memory usage, making it ideal for handling files that exceed available RAM (e.g., >4 GB). The tool preserves specified title rows, logs processing details (including file profiling and memory usage), and supports resuming interrupted tasks.
Features
- Split
.xlsand.xlsxfiles into two equal parts or smaller chunks. - Process large files iteratively with configurable rows per iteration.
- Preserve title rows in non-large mode.
- Validate column counts and profile data types for
.xlsxfiles. - Monitor memory usage during processing.
- Resume processing from the last checkpoint using a log file.
- Built with
xlrd,xlwt,openpyxl,pandas,pyexcelerate, andpsutil.
Prerequisites
To build and run spreadsheet_splitter locally, ensure you have:
-
Python: Version 3.13 or higher.
-
uv: A Python package manager for managing dependencies and virtual environments. Install it with:
# macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_1>
# Windows powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_2>
-
Git: To clone the repository.
-
An Excel file (
.xlsor.xlsx) to process.
Building from Source
Follow these steps to build spreadsheet_splitter from source:
-
Clone the Repository
git clone https://github.com/parsaloi/spreadsheet_splitter.git cd spreadsheet_splitter
-
Set Up a Virtual Environment Create and activate a virtual environment using
uv:uv venv source .venv/bin/activate # On Linux/macOS .venv\Scripts\activate # On Windows
-
Install Dependencies Install the required dependencies specified in
pyproject.toml:uv sync -
Verify Setup Check that the
spreadsheet-splittercommand is available:uv run spreadsheet-splitter --help
This should display the CLI help message with available arguments.
Running the Tool
The spreadsheet_splitter tool provides two modes:
- Default Mode: Splits an Excel file into two equal parts, preserving title rows.
- Large Mode: Splits large files into smaller chunks with a specified number of rows per iteration, ideal for low-memory environments.
Example Commands
-
Split a File into Two Parts
uv run spreadsheet-splitter /path/to/file.xlsx --output-dir split_files --title-rows 1
- Splits
file.xlsxintosplit_files/file_part1.xlsxandsplit_files/file_part2.xlsx. - Preserves 1 title row in each output file.
- Splits
-
Split a Large File Iteratively
uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
- Processes
very_large_file.xlsxin chunks of 500 rows, up to 5 iterations (2500 rows total). - Reads only the first 22 columns.
- Outputs files like
split_files/very_large_file_part001.xlsx,split_files/very_large_file_part002.xlsx, etc. - Logs progress to
split_log.json.
- Processes
-
Resume Processing
uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --resume --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
- Resumes from the last processed row recorded in
split_log.json.
- Resumes from the last processed row recorded in
Command-Line Arguments
| Argument | Description | Default |
|---|---|---|
input_file |
Path to the input .xls or .xlsx file |
Required |
-o, --output-dir |
Directory to save output files | output |
-t, --title-rows |
Number of top rows to treat as title rows (ignored in --large mode) |
1 |
--large |
Process large files iteratively with limited rows per iteration | False |
--rows-per-iteration |
Number of data rows to process per iteration | 500 |
--max-iterations |
Maximum number of iterations to process | 10 |
--resume |
Resume processing from the last row in the log file | False |
--log-file |
Path to the log file for tracking processed rows | split_log.json |
--columns |
Number of columns to process | All columns |
Log File
The tool generates a split_log.json file to track processing details, including:
- Input and Output: Input file path and output directory.
- Processed Ranges: Start and end rows for each output file.
- File Profile: Total rows, columns, and data types (for
.xlsx). - Memory Usage: Memory before and after reading each chunk (for
.xlsx). - Timestamp: When the log was last updated.
Example split_log.json:
{
"input_path": "/path/to/very_large_file.xlsx",
"output_dir": "split_files",
"rows_per_iteration": 500,
"processed_ranges": [
{
"file": "split_files/very_large_file_part001.xlsx",
"start_row": 1,
"end_row": 500,
"iteration": 1
}
],
"last_row_processed": 499,
"timestamp": "2025-05-27T11:12:46.172255",
"columns": 22,
"file_profile": {
"total_rows": 1,
"total_columns": 22,
"data_types": {
"ZONE": "float64",
"SITE": "object",
...
}
},
"memory_usage": [
{
"iteration": 1,
"memory_before_mb": 256.55078125,
"memory_after_mb": 259.6328125
}
]
}
Troubleshooting
-
Column Count Error: If you see
Specified columns (N) exceeds actual columns (M), verify the column count:import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=1, engine='openpyxl') print(df.shape[1], "columns")
Use the correct number with
--columns. -
Empty File Warning: If the tool warns that the input file has no rows, check the file content:
import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=10, engine='openpyxl') print(df.shape[0], "rows")
-
Memory Issues: If memory usage is high, reduce
--rows-per-iteration(e.g., to 100). Checksplit_log.jsonfor memory usage details. -
Slow Processing: Ensure the input file is on a fast local drive (e.g., SSD). For
.xlsxfiles with many unique strings, consider converting to.csvfirst:import pandas as pd df = pd.read_excel("/path/to/very_large_file.xlsx", engine='openpyxl') df.to_csv("very_large_file.csv", index=False)
Future Work
- Package the tool for PyPI with cross-platform installation instructions.
- Add support for
.csvinput files. - Enhance profiling with more detailed statistics (e.g., unique values per column).
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spreadsheet_splitter-0.1.0.tar.gz.
File metadata
- Download URL: spreadsheet_splitter-0.1.0.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8359ca62c06b93841fb397a5a4caa5426fa0d4a68d1338571a0c033d77fd384f
|
|
| MD5 |
e2bf0c3ea24152051b3eef4ea8b7b23f
|
|
| BLAKE2b-256 |
f77a54b07bbbb51f5755b13ed80353bef7a05ef86b197145b3bec64ffefb3c4c
|
File details
Details for the file spreadsheet_splitter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spreadsheet_splitter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
225ff75533369f9ca21b9e2f00b754268a9e797590a14761436ed6df28d756a4
|
|
| MD5 |
f3e3c25bb379cbac249ccacb9d844c0d
|
|
| BLAKE2b-256 |
c5eaacf342438768b15a3873c48502f80b753fdbc9a36f63098d39c7e6a3993e
|