Skip to main content

A Python command-line tool to split large Excel (.xls or .xlsx) files into smaller parts with low memory usage.

Project description

Spreadsheet Splitter

spreadsheet_splitter is a Python command-line tool designed to split large Excel files (.xls or .xlsx) into smaller, manageable parts. It supports processing large files iteratively with low memory usage, making it ideal for handling files that exceed available RAM (e.g., >4 GB). The tool preserves specified title rows, logs processing details (including file profiling and memory usage), and supports resuming interrupted tasks.

Features

  • Split .xls and .xlsx files into two equal parts or smaller chunks.
  • Process large files iteratively with configurable rows per iteration.
  • Preserve title rows in non-large mode.
  • Validate column counts and profile data types for .xlsx files.
  • Monitor memory usage during processing.
  • Resume processing from the last checkpoint using a log file.
  • Built with xlrd, xlwt, openpyxl, pandas, pyexcelerate, and psutil.

Prerequisites

To build and run spreadsheet_splitter locally, ensure you have:

Building from Source

Follow these steps to build spreadsheet_splitter from source:

  1. Clone the Repository

    git clone https://github.com/parsaloi/spreadsheet_splitter.git
    cd spreadsheet_splitter
    
  2. Set Up a Virtual Environment Create and activate a virtual environment using uv:

    uv venv
    source .venv/bin/activate  # On Linux/macOS
    .venv\Scripts\activate     # On Windows
    
  3. Install Dependencies Install the required dependencies specified in pyproject.toml:

    uv sync
    
  4. Verify Setup Check that the spreadsheet-splitter command is available:

    uv run spreadsheet-splitter --help
    

    This should display the CLI help message with available arguments.

Running the Tool

The spreadsheet_splitter tool provides two modes:

  • Default Mode: Splits an Excel file into two equal parts, preserving title rows.
  • Large Mode: Splits large files into smaller chunks with a specified number of rows per iteration, ideal for low-memory environments.

Example Commands

  1. Split a File into Two Parts

    uv run spreadsheet-splitter /path/to/file.xlsx --output-dir split_files --title-rows 1
    
    • Splits file.xlsx into split_files/file_part1.xlsx and split_files/file_part2.xlsx.
    • Preserves 1 title row in each output file.
  2. Split a Large File Iteratively

    uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
    
    • Processes very_large_file.xlsx in chunks of 500 rows, up to 5 iterations (2500 rows total).
    • Reads only the first 22 columns.
    • Outputs files like split_files/very_large_file_part001.xlsx, split_files/very_large_file_part002.xlsx, etc.
    • Logs progress to split_log.json.
  3. Resume Processing

    uv run spreadsheet-splitter /path/to/very_large_file.xlsx --large --resume --output-dir split_files --rows-per-iteration 500 --max-iterations 5 --columns 22
    
    • Resumes from the last processed row recorded in split_log.json.

Command-Line Arguments

Argument Description Default
input_file Path to the input .xls or .xlsx file Required
-o, --output-dir Directory to save output files output
-t, --title-rows Number of top rows to treat as title rows (ignored in --large mode) 1
--large Process large files iteratively with limited rows per iteration False
--rows-per-iteration Number of data rows to process per iteration 500
--max-iterations Maximum number of iterations to process 10
--resume Resume processing from the last row in the log file False
--log-file Path to the log file for tracking processed rows split_log.json
--columns Number of columns to process All columns

Log File

The tool generates a split_log.json file to track processing details, including:

  • Input and Output: Input file path and output directory.
  • Processed Ranges: Start and end rows for each output file.
  • File Profile: Total rows, columns, and data types (for .xlsx).
  • Memory Usage: Memory before and after reading each chunk (for .xlsx).
  • Timestamp: When the log was last updated.

Example split_log.json:

{
  "input_path": "/path/to/very_large_file.xlsx",
  "output_dir": "split_files",
  "rows_per_iteration": 500,
  "processed_ranges": [
    {
      "file": "split_files/very_large_file_part001.xlsx",
      "start_row": 1,
      "end_row": 500,
      "iteration": 1
    }
  ],
  "last_row_processed": 499,
  "timestamp": "2025-05-27T11:12:46.172255",
  "columns": 22,
  "file_profile": {
    "total_rows": 1,
    "total_columns": 22,
    "data_types": {
      "ZONE": "float64",
      "SITE": "object",
      ...
    }
  },
  "memory_usage": [
    {
      "iteration": 1,
      "memory_before_mb": 256.55078125,
      "memory_after_mb": 259.6328125
    }
  ]
}

Troubleshooting

  • Column Count Error: If you see Specified columns (N) exceeds actual columns (M), verify the column count:

    import pandas as pd
    df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=1, engine='openpyxl')
    print(df.shape[1], "columns")
    

    Use the correct number with --columns.

  • Empty File Warning: If the tool warns that the input file has no rows, check the file content:

    import pandas as pd
    df = pd.read_excel("/path/to/very_large_file.xlsx", nrows=10, engine='openpyxl')
    print(df.shape[0], "rows")
    
  • Memory Issues: If memory usage is high, reduce --rows-per-iteration (e.g., to 100). Check split_log.json for memory usage details.

  • Slow Processing: Ensure the input file is on a fast local drive (e.g., SSD). For .xlsx files with many unique strings, consider converting to .csv first:

    import pandas as pd
    df = pd.read_excel("/path/to/very_large_file.xlsx", engine='openpyxl')
    df.to_csv("very_large_file.csv", index=False)
    

Future Work

  • Package the tool for PyPI with cross-platform installation instructions.
  • Add support for .csv input files.
  • Enhance profiling with more detailed statistics (e.g., unique values per column).

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spreadsheet_splitter-0.1.0.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spreadsheet_splitter-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file spreadsheet_splitter-0.1.0.tar.gz.

File metadata

File hashes

Hashes for spreadsheet_splitter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8359ca62c06b93841fb397a5a4caa5426fa0d4a68d1338571a0c033d77fd384f
MD5 e2bf0c3ea24152051b3eef4ea8b7b23f
BLAKE2b-256 f77a54b07bbbb51f5755b13ed80353bef7a05ef86b197145b3bec64ffefb3c4c

See more details on using hashes here.

File details

Details for the file spreadsheet_splitter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spreadsheet_splitter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 225ff75533369f9ca21b9e2f00b754268a9e797590a14761436ed6df28d756a4
MD5 f3e3c25bb379cbac249ccacb9d844c0d
BLAKE2b-256 c5eaacf342438768b15a3873c48502f80b753fdbc9a36f63098d39c7e6a3993e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page