Skip to main content

Extract information-region bounding boxes from Excel sheets without semantic labels

Project description

Excel Region Extractor

Extract rectangular information regions from Excel workbooks and return them as range strings such as A1:D10.

The extractor uses cell values, merged cells, borders, embedded image anchors, and chart anchors. It can write sheet JSON, workbook summary JSON, optional overlay PNGs, extracted embedded image files, and simple chart preview PNGs.

Install

pip install excel-region-extractor

Install directly from GitHub:

pip install git+https://github.com/LampSeeker/ExcelRegionExtractor.git

For local development:

pip install -e .

CLI Usage

Run on your workbook:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions

Run one sheet:

excel-regions --workbook path/to/workbook.xlsx --sheet "Sheet1" --out outputs/sheet1

Skip overlay PNG generation:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions --no-overlay

excel-info-regions is kept as a backward-compatible alias.

Python API

from excel_info_region import extract_workbook_info_regions
from excel_info_region.config import load_config

config = load_config()
result = extract_workbook_info_regions("path/to/workbook.xlsx", config=config)

For writing JSON, overlay PNGs, and extracted images:

from excel_info_region import run_and_write

run_and_write("path/to/workbook.xlsx", out_dir="outputs/regions")

Demo

The source repository includes a synthetic, non-sensitive workbook:

excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo

Example overlay:

Synthetic Excel region overlay

Chart demo:

excel-regions --workbook examples/chart.xlsx --sheet "XXX Summary Northeast" --out outputs/chart_demo

Chart overlay:

Chart Excel region overlay

Extracted chart preview:

Chart preview

Output

outputs/chart_demo/
  info_regions_full.json
  info_regions_summary.json

  XXX Summary Northeast/
    info_regions.json
    info_regions.png
    charts/
      CHART001_E3_Q20_Chart_1.png

Sheet JSON:

{
  "sheet_name": "XXX Summary Northeast",
  "regions": [
    "A1:D15",
    "E3:Q20",
    "A25:M49"
  ],
  "images": [],
  "charts": [
    {
      "name": "Chart 1",
      "kind": "BarChart",
      "range_ref": "E3:Q20",
      "path": "charts/CHART001_E3_Q20_Chart_1.png",
      "sources": [
        {
          "role": "cat",
          "range_ref": "B25:M25",
          "cached_values": ["CCRX [Person_12]", "CCRX Newport", "..."]
        },
        {
          "role": "val",
          "range_ref": "B26:M26",
          "values": [["=AVERAGEIF(B27:B45,\">0\")", "..."]],
          "cached_values": [0.0, 2.0, 4.219298245614035, "..."]
        }
      ]
    }
  ]
}

regions is the list of detected Excel ranges. images records embedded image metadata. charts records chart metadata, source ranges, cached chart values when available, and a preview PNG path.

How It Works

Current extractor flow:

1. Calculate working bounds from non-empty cells, merged cells, and images
2. Collect non-empty cells as occupied cells
3. Find connected components from occupied cells
4. Convert each connected component to a rectangular bbox
5. Expand bboxes with border/table shell information
6. Merge some boxes that touch the same border component
7. Add images as separate regions
8. Output range refs such as A1:D10

Images are intentionally kept separate from cell connected components. This avoids over-merging drawings with nearby tables.

Configuration

The packaged default config is loaded by:

load_config()

Common options:

{
  "include_values": true,
  "include_merged_cells": true,
  "include_images": true,
  "include_grouped_drawing_images": true,
  "include_charts": true,
  "include_chart_source_values": true,
  "respect_hidden_rows_cols": false,
  "use_print_area_bounds": false,
  "use_borders": true,
  "strong_borders_only": true,
  "use_border_contact_merge": true,
  "extract_embedded_images": true,
  "embedded_image_dir": "images",
  "extract_chart_images": true,
  "chart_image_dir": "charts"
}

Set a font path if text is broken in overlay PNGs:

{
  "visualization": {
    "font_path": "C:/Windows/Fonts/malgun.ttf"
  }
}

--no-overlay skips overlay PNG generation. Embedded image extraction still runs when extract_embedded_images is true.

Project Structure

src/excel_info_region/
  cli.py             console entrypoint
  runner.py          writes JSON, overlay PNG, extracted images
  extractor.py       workbook/sheet orchestration
  cells.py           cell and merged-cell occupied logic
  borders.py         border expansion and border-contact merge
  components.py      connected components and bbox helpers
  image_regions.py   image anchors to region boxes
  image_export.py    embedded image extraction
  raw_drawing.py     raw xlsx DrawingML parsing
  visualize.py       overlay PNG renderer

Development

pytest
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo --no-overlay

Run without --no-overlay when changing visualization or image extraction.

Private/local Excel samples are ignored:

examples/sample.xlsx
examples/sample2.xlsx

Notes

openpyxl does not calculate formulas. Overlay rendering uses data_only=True, so formula cells need cached values saved by Excel to show calculated results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excel_region_extractor-0.1.7.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excel_region_extractor-0.1.7-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file excel_region_extractor-0.1.7.tar.gz.

File metadata

  • Download URL: excel_region_extractor-0.1.7.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for excel_region_extractor-0.1.7.tar.gz
Algorithm Hash digest
SHA256 3b0a9d6c89ef5b4213f7842304e4d329cb7cee0f04c1c67f2a5d827348a73345
MD5 7f8ee76d637bdedbbf792a2677ac6b69
BLAKE2b-256 586ee136e8a6d9189fef7f2743a85c0176da8354ad41c48b3d9cefd896a9acc6

See more details on using hashes here.

File details

Details for the file excel_region_extractor-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for excel_region_extractor-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3b55f048660cbbbbab3ad4d89d897141048d0635b4d7180d90a15c778aa73b87
MD5 3d83b6cd30d4ea7d6f2a45a4a15f65c2
BLAKE2b-256 9af208d6edf2aedc8b0c429e9481d2b060f0d202d1346b547e70bb00d2f7245b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page