A Python command-line tool to split large Excel (.xls or .xlsx) files into smaller parts with low memory usage.

These details have not been verified by PyPI

Project links

Project description

Spreadsheet Splitter

spreadsheet_splitter is a Python command-line tool designed to split large Excel files (.xls or .xlsx) into smaller, manageable parts. It supports processing large files iteratively with low memory usage, making it ideal for handling files that exceed available RAM (e.g., >4 GB). The tool preserves specified title rows, logs processing details (including file profiling and memory usage), and supports resuming interrupted tasks.

Features

Split .xls and .xlsx files into two equal parts or smaller chunks.
Process large files iteratively with configurable rows per iteration.
Preserve title rows in non-large mode.
Validate column counts and profile data types for .xlsx files.
Monitor memory usage during processing.
Resume processing from the last checkpoint using a log file.
Built with xlrd, xlwt, openpyxl, pandas, pyexcelerate, and psutil.

Prerequisites

To build and run spreadsheet_splitter locally, ensure you have:

Python: Version 3.13 or higher.
uv: A Python package manager for managing dependencies and virtual environments. Install it with:
```
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
```
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_1>
```
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```
<https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_2>
Git: To clone the repository.
An Excel file (.xls or .xlsx) to process.

Building from Source

Follow these steps to build spreadsheet_splitter from source:

Clone the Repository

git clone https://github.com/parsaloi/spreadsheet_splitter.git
cd spreadsheet_splitter

Set Up a Virtual Environment Create and activate a virtual environment using uv:

uv venv
source .venv/bin/activate  # On Linux/macOS
.venv\Scripts\activate     # On Windows

Install Dependencies Install the required dependencies specified in pyproject.toml:
```
uv sync
```
Verify Setup Check that the spreadsheet-splitter command is available:
```
uv run spreadsheet-splitter --help
```
This should display the CLI help message with available arguments.

Running the Tool

The spreadsheet_splitter tool provides two modes:

Default Mode: Splits an Excel file into two equal parts, preserving title rows.
Large Mode: Splits large files into smaller chunks with a specified number of rows per iteration, ideal for low-memory environments.

Example Commands

Split a File into Two Parts
```
uv run spreadsheet-splitter /path/to/file.xlsx --output-dir split_files --title-rows 1
```
- Splits file.xlsx into split_files/file_part1.xlsx and split_files/file_part2.xlsx.
- Preserves 1 title row in each output file.
Split a Large File Iteratively
```
uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
```
- Processes very_large_file.xlsx in chunks of 500 rows, up to 5 iterations (2500 rows total).
- Reads only the first 22 columns.
- Outputs files like split_files/very_large_file_part001.xlsx, split_files/very_large_file_part002.xlsx, etc.
- Logs progress to split_log.json.

Resume Processing

uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --resume --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22

Resumes from the last processed row recorded in split_log.json.

Command-Line Arguments

Argument	Description	Default
`input_file`	Path to the input `.xls` or `.xlsx` file	Required
`-o, --output-dir`	Directory to save output files	`output`
`-t, --title-rows`	Number of top rows to treat as title rows (ignored in `--large` mode)	1
`--large`	Process large files iteratively with limited rows per iteration	False
`--rows-per-iteration`	Number of data rows to process per iteration	500
`--max-iterations`	Maximum number of iterations to process	10
`--resume`	Resume processing from the last row in the log file	False
`--log-file`	Path to the log file for tracking processed rows	`split_log.json`
`--columns`	Number of columns to process	All columns

Log File

The tool generates a split_log.json file to track processing details, including:

Input and Output: Input file path and output directory.
Processed Ranges: Start and end rows for each output file.
File Profile: Total rows, columns, and data types (for .xlsx).
Memory Usage: Memory before and after reading each chunk (for .xlsx).
Timestamp: When the log was last updated.

Example split_log.json:

{
  "input_path": "/path/to/very_large_file.xlsx",
  "output_dir": "split_files",
  "rows_per_iteration": 500,
  "processed_ranges": [
    {
      "file": "split_files/very_large_file_part001.xlsx",
      "start_row": 1,
      "end_row": 500,
      "iteration": 1
    }
  ],
  "last_row_processed": 499,
  "timestamp": "2025-05-27T11:12:46.172255",
  "columns": 22,
  "file_profile": {
    "total_rows": 1,
    "total_columns": 22,
    "data_types": {
      "ZONE": "float64",
      "SITE": "object",
      ...
    }
  },
  "memory_usage": [
    {
      "iteration": 1,
      "memory_before_mb": 256.55078125,
      "memory_after_mb": 259.6328125
    }
  ]
}

Troubleshooting

Column Count Error: If you see Specified columns (N) exceeds actual columns (M), verify the column count:

import pandas as pd
df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=1, engine='openpyxl')
print(df.shape[1], "columns")

Use the correct number with --columns.

Empty File Warning: If the tool warns that the input file has no rows, check the file content:

import pandas as pd
df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=10, engine='openpyxl')
print(df.shape[0], "rows")

Memory Issues: If memory usage is high, reduce --rows-per-iteration (e.g., to 100). Check split_log.json for memory usage details.
Slow Processing: Ensure the input file is on a fast local drive (e.g., SSD). For .xlsx files with many unique strings, consider converting to .csv first:
```
import pandas as pd
df = pd.read_excel("/path/to/very_large_file.xlsx", engine='openpyxl')
df.to_csv("very_large_file.csv", index=False)
```

Future Work

Package the tool for PyPI with cross-platform installation instructions.
Add support for .csv input files.
Enhance profiling with more detailed statistics (e.g., unique values per column).

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spreadsheet_splitter-0.1.0.tar.gz (19.8 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spreadsheet_splitter-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file spreadsheet_splitter-0.1.0.tar.gz.

File metadata

Download URL: spreadsheet_splitter-0.1.0.tar.gz
Upload date: May 28, 2025
Size: 19.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.8

File hashes

Hashes for spreadsheet_splitter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8359ca62c06b93841fb397a5a4caa5426fa0d4a68d1338571a0c033d77fd384f`
MD5	`e2bf0c3ea24152051b3eef4ea8b7b23f`
BLAKE2b-256	`f77a54b07bbbb51f5755b13ed80353bef7a05ef86b197145b3bec64ffefb3c4c`

See more details on using hashes here.

File details

Details for the file spreadsheet_splitter-0.1.0-py3-none-any.whl.

File metadata

Download URL: spreadsheet_splitter-0.1.0-py3-none-any.whl
Upload date: May 28, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.8

File hashes

Hashes for spreadsheet_splitter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`225ff75533369f9ca21b9e2f00b754268a9e797590a14761436ed6df28d756a4`
MD5	`f3e3c25bb379cbac249ccacb9d844c0d`
BLAKE2b-256	`c5eaacf342438768b15a3873c48502f80b753fdbc9a36f63098d39c7e6a3993e`

See more details on using hashes here.

spreadsheet-splitter 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spreadsheet Splitter

Features

Prerequisites

Building from Source

Running the Tool

Example Commands

Command-Line Arguments

Log File

Troubleshooting

Future Work

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes