Skip to main content

A simple and flexible Excel-to-CSV extraction tool using JSON configuration.

Project description

excelextract

excelextract is a simple yet powerful command-line tool to extract structured data from Excel spreadsheets using a declarative JSON configuration.
It’s designed to be accessible to researchers and data collectors working with standardized interview forms and surveys data.

What It Does

  • Parses Excel (.xlsx) files from a folder
  • Applies looping logic across sheets, rows, and columns
  • Uses a JSON configuration to define where and how to extract data
  • Outputs clean CSV files for each defined table

Installation

You can install excelextract using pip:

pip install excelextract

Make sure your Python version is 3.8 or higher.

Example Usage

Given a JSON configuration like:

{
  "exports": [
    {
      "inputFolder": "data",
      "inputRegex": ".*",
      "output": "outputFileName",
      "loops" : [
        {
          "type" : "sheetLoop",
          "token" : "SHEET_NAME",
          "regex" : "Survey (.*)"
        },
        {
          "type": "dynamicRowLoop",
          "token": "ROW",
          "sheet": "%%SHEET_NAME%%",
          "start": 
            {
              "regex": "Participants",
              "column": "C",
              "offset": 3
            },
          "end": 
            {
              "regex": "Total",
              "column": "O",
              "offset": -1
            }
        }
      ],
      "columns": [
        {
          "name": "survey_file",
          "type": "string",
          "value": "%%FILE_NAME%%",
          "doNotInitiate": true
        },
        {
          "name": "survey_name",
          "type": "string",
          "value": "overview!C2",
          "doNotInitiate": true
        },
        {
          "name": "participant",
          "type": "string",
          "value": "%%SHEET_NAME%%!B%%ROW%%"
        },
        {
          "name": "number",
          "type": "number",
          "value": "%%SHEET_NAME%%!I%%ROW%%",
        },
      ]
    }
  ]
}

You can run:

excelextract config.json

And it will generate outputFileName.csv in the working directory.

Features Illustrated in the Example

This configuration is designed for a common research scenario:
you have many Excel files, each containing an overview sheet and multiple survey sheets (e.g., Survey A, Survey B, ...).
Each survey sheet includes a list of participants, but the number of rows varies per file and sheet.

The configuration extracts these participant rows dynamically and compiles them into a single CSV file, with one row per participant per sheet.

It demonstrates:

  • File matching: Processes all .xlsx files in the data/ folder using a regex.
  • Sheet loop: Iterates over all sheets matching "Survey (.*)", storing the sheet name as %%SHEET_NAME%%.
  • Dynamic row loop:
    • Starts 3 rows after "Participants" in column C
    • Ends 1 row before "Total" in column O
    • Sets the %%ROW%% token for use in cell references
  • Token-based cell addressing: Extracts values like participant name or count using dynamic tokens.
  • Fixed metadata: Reads values like survey name from static cells (e.g., overview!C2) once per file.
  • Selective row inclusion:
    • doNotInitiate: true excludes metadata fields from triggering output
    • type: "number" ensures proper numeric conversion and filtering

This setup allows you to turn a folder of semi-structured Excel forms into a single clean dataset with no manual editing.

Features

  • Token-based substitution using %%SHEET_NAME%%, %%ROW%%, etc.
  • Supports sheet loops, row loops, column loops, and regex-based searching
  • Dynamically determines start/end positions in Excel
  • Cleanly separates config from logic — no programming needed to extract new forms
  • UTF-8-BOM output (compatible with Excel and Windows)

License

MIT License © 2025 Philippe

This project is not affiliated with or endorsed by Microsoft. "Excel" is a registered trademark of Microsoft Corporation. This tool uses the .xlsx format purely as a data source.

Contributing

This tool is shared in the hope it helps others with structured data collection workflows. Pull requests, feedback, and improvements are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excelextract-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excelextract-0.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file excelextract-0.1.0.tar.gz.

File metadata

  • Download URL: excelextract-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for excelextract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 944172e5a44bf78a8a7bd8693dc60614498488349d41d418618611cebeba7d71
MD5 63282244f520a5ee756f32ba90a8ec59
BLAKE2b-256 58766d60b178b9b1083dca4ed2f491b207f29c62a7a1e4d4f9dcb27801dd351e

See more details on using hashes here.

File details

Details for the file excelextract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: excelextract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for excelextract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8cc4cf9bcb8bf45cb8e573e6d97272460d35eddf042c0ac52567f3e43765edf
MD5 381d42e84db74b1d6476ace25e9077a0
BLAKE2b-256 4b537ade2d84dc0a78ace8866ee5d78181eb2de7169938ba4fe782b85e28650b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page