Skip to main content

Extract information-region bounding boxes from Excel sheets without semantic labels

Project description

Excel Region Extractor

Extract rectangular information regions from Excel workbooks and return them as range strings such as A1:D10.

The extractor uses cell values, merged cells, borders, embedded image anchors, and chart anchors. It can write sheet JSON, workbook summary JSON, optional overlay PNGs, extracted embedded image files, and simple chart preview PNGs.

Install

pip install excel-region-extractor

Install directly from GitHub:

pip install git+https://github.com/LampSeeker/ExcelRegionExtractor.git

For local development:

pip install -e .

CLI Usage

Run on your workbook:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions

Run one sheet:

excel-regions --workbook path/to/workbook.xlsx --sheet "Sheet1" --out outputs/sheet1

Skip overlay PNG generation:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions --no-images

excel-info-regions is kept as a backward-compatible alias.

Python API

from excel_info_region import extract_workbook_info_regions
from excel_info_region.config import load_config

config = load_config("config/default.json")
result = extract_workbook_info_regions("path/to/workbook.xlsx", config=config)

For writing JSON, overlay PNGs, and extracted images:

from excel_info_region import run_and_write

run_and_write("path/to/workbook.xlsx", out_dir="outputs/regions")

Demo

The source repository includes a synthetic, non-sensitive workbook:

excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo

Example overlay:

Synthetic Excel region overlay

Chart demo:

excel-regions --workbook examples/chart.xlsx --sheet "XXX Summary Northeast" --out outputs/chart_demo

Chart overlay:

Chart Excel region overlay

Extracted chart preview:

Chart preview

Output

outputs/chart_demo/
  info_regions_full.json
  info_regions_summary.json

  XXX Summary Northeast/
    info_regions.json
    info_regions.png
    charts/
      CHART001_E3_Q20_Chart_1.png

Sheet JSON:

{
  "sheet_name": "XXX Summary Northeast",
  "regions": [
    "A1:D15",
    "E3:Q20",
    "A25:M49"
  ],
  "images": [],
  "charts": [
    {
      "name": "Chart 1",
      "kind": "BarChart",
      "range_ref": "E3:Q20",
      "path": "charts/CHART001_E3_Q20_Chart_1.png",
      "sources": [
        {
          "role": "cat",
          "range_ref": "B25:M25",
          "cached_values": ["CCRX [Person_12]", "CCRX Newport", "..."]
        },
        {
          "role": "val",
          "range_ref": "B26:M26",
          "values": [["=AVERAGEIF(B27:B45,\">0\")", "..."]],
          "cached_values": [0.0, 2.0, 4.219298245614035, "..."]
        }
      ]
    }
  ]
}

regions is the list of detected Excel ranges. images records embedded image metadata. charts records chart metadata, source ranges, cached chart values when available, and a preview PNG path.

How It Works

Current extractor flow:

1. Calculate working bounds from non-empty cells, merged cells, and images
2. Collect non-empty cells as occupied cells
3. Find connected components from occupied cells
4. Convert each connected component to a rectangular bbox
5. Expand bboxes with border/table shell information
6. Merge some boxes that touch the same border component
7. Add images as separate regions
8. Output range refs such as A1:D10

Images are intentionally kept separate from cell connected components. This avoids over-merging drawings with nearby tables.

Configuration

Default config lives at:

config/default.json

Common options:

{
  "include_values": true,
  "include_merged_cells": true,
  "include_images": true,
  "include_grouped_drawing_images": true,
  "include_charts": true,
  "include_chart_source_values": true,
  "respect_hidden_rows_cols": false,
  "use_print_area_bounds": false,
  "use_borders": true,
  "strong_borders_only": true,
  "use_border_contact_merge": true,
  "extract_embedded_images": true,
  "embedded_image_dir": "images",
  "extract_chart_images": true,
  "chart_image_dir": "charts"
}

Set a font path if text is broken in overlay PNGs:

{
  "visualization": {
    "font_path": "C:/Windows/Fonts/malgun.ttf"
  }
}

--no-images skips overlay PNG generation. Embedded image extraction still runs when extract_embedded_images is true.

Project Structure

src/excel_info_region/
  cli.py             console entrypoint
  runner.py          writes JSON, overlay PNG, extracted images
  extractor.py       workbook/sheet orchestration
  cells.py           cell and merged-cell occupied logic
  borders.py         border expansion and border-contact merge
  components.py      connected components and bbox helpers
  image_regions.py   image anchors to region boxes
  image_export.py    embedded image extraction
  raw_drawing.py     raw xlsx DrawingML parsing
  visualize.py       overlay PNG renderer

Development

pytest
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo --no-images

Run without --no-images when changing visualization or image extraction.

Private/local Excel samples are ignored:

examples/sample.xlsx
examples/sample2.xlsx

Notes

openpyxl does not calculate formulas. Overlay rendering uses data_only=True, so formula cells need cached values saved by Excel to show calculated results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excel_region_extractor-0.1.2.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excel_region_extractor-0.1.2-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file excel_region_extractor-0.1.2.tar.gz.

File metadata

  • Download URL: excel_region_extractor-0.1.2.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for excel_region_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 cfcbd7934a6ce471df9c67524bb1716042d15e516b15e612a110d17f9335b02c
MD5 ebddf70085b2b2a120c85e3f98f0e3b9
BLAKE2b-256 4293a5597fc70141ee20e07ca990d9f797df1238d53ce1b1471fa4da6c8c9393

See more details on using hashes here.

File details

Details for the file excel_region_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for excel_region_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 86299829d563e781653ff9cf9f7e3ff220fa8b71bb4621adb65d71c040e7ca94
MD5 9e7212bdc4ab7a7adb558dd60db2b902
BLAKE2b-256 e23546eb7a54088b032868c7f72da36b4cd0d8b6a733d504f5eee8a7f8032816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page