Skip to main content

Extract information-region bounding boxes from Excel sheets without semantic labels

Project description

Excel Region Extractor

Extract rectangular information regions from Excel workbooks and return them as range strings such as A1:D10.

The extractor uses cell values, merged cells, borders, embedded image anchors, and chart anchors. It can write sheet JSON, workbook summary JSON, optional overlay PNGs, extracted embedded image files, and simple chart preview PNGs.

Install

pip install excel-region-extractor

Install directly from GitHub:

pip install git+https://github.com/LampSeeker/ExcelRegionExtractor.git

For local development:

pip install -e .

CLI Usage

Run on your workbook:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions

Run one sheet:

excel-regions --workbook path/to/workbook.xlsx --sheet "Sheet1" --out outputs/sheet1

Skip overlay PNG generation:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions --no-overlay

excel-info-regions is kept as a backward-compatible alias.

Python API

from excel_info_region import extract_workbook_info_regions
from excel_info_region.config import load_config

config = load_config()
result = extract_workbook_info_regions("path/to/workbook.xlsx", config=config)

For writing JSON, overlay PNGs, and extracted images:

from excel_info_region import run_and_write

run_and_write("path/to/workbook.xlsx", out_dir="outputs/regions")

Demo

The source repository includes a synthetic, non-sensitive workbook:

excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo

Example overlay:

Synthetic Excel region overlay

Chart demo:

excel-regions --workbook examples/chart.xlsx --sheet "XXX Summary Northeast" --out outputs/chart_demo

Chart overlay:

Chart Excel region overlay

Extracted chart preview:

Chart preview

Output

outputs/chart_demo/
  info_regions_full.json
  info_regions_summary.json

  XXX Summary Northeast/
    info_regions.json
    info_regions.png
    charts/
      CHART001_E3_Q20_Chart_1.png

Sheet JSON:

{
  "sheet_name": "XXX Summary Northeast",
  "regions": [
    "A1:D15",
    "E3:Q20",
    "A25:M49"
  ],
  "images": [],
  "charts": [
    {
      "name": "Chart 1",
      "kind": "BarChart",
      "range_ref": "E3:Q20",
      "path": "charts/CHART001_E3_Q20_Chart_1.png",
      "sources": [
        {
          "role": "cat",
          "range_ref": "B25:M25",
          "cached_values": ["CCRX [Person_12]", "CCRX Newport", "..."]
        },
        {
          "role": "val",
          "range_ref": "B26:M26",
          "values": [["=AVERAGEIF(B27:B45,\">0\")", "..."]],
          "cached_values": [0.0, 2.0, 4.219298245614035, "..."]
        }
      ]
    }
  ]
}

regions is the list of detected Excel ranges. images records embedded image metadata. charts records chart metadata, source ranges, cached chart values when available, and a preview PNG path.

How It Works

Current extractor flow:

1. Calculate working bounds from non-empty cells, merged cells, and images
2. Collect non-empty cells as occupied cells
3. Find connected components from occupied cells
4. Convert each connected component to a rectangular bbox
5. Expand bboxes with border/table shell information
6. Merge some boxes that touch the same border component
7. Add images as separate regions
8. Output range refs such as A1:D10

Images are intentionally kept separate from cell connected components. This avoids over-merging drawings with nearby tables.

Configuration

The packaged default config is loaded by:

load_config()

Common options:

{
  "include_values": true,
  "include_merged_cells": true,
  "include_images": true,
  "include_grouped_drawing_images": true,
  "include_charts": true,
  "include_chart_source_values": true,
  "respect_hidden_rows_cols": false,
  "use_print_area_bounds": false,
  "use_borders": true,
  "strong_borders_only": true,
  "use_border_contact_merge": true,
  "extract_embedded_images": true,
  "embedded_image_dir": "images",
  "extract_chart_images": true,
  "chart_image_dir": "charts"
}

Set a font path if text is broken in overlay PNGs:

{
  "visualization": {
    "font_path": "C:/Windows/Fonts/malgun.ttf"
  }
}

--no-overlay skips overlay PNG generation. Embedded image extraction still runs when extract_embedded_images is true.

Project Structure

src/excel_info_region/
  cli.py             console entrypoint
  runner.py          writes JSON, overlay PNG, extracted images
  extractor.py       workbook/sheet orchestration
  cells.py           cell and merged-cell occupied logic
  borders.py         border expansion and border-contact merge
  components.py      connected components and bbox helpers
  image_regions.py   image anchors to region boxes
  image_export.py    embedded image extraction
  raw_drawing.py     raw xlsx DrawingML parsing
  visualize.py       overlay PNG renderer

Development

pytest
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo --no-overlay

Run without --no-overlay when changing visualization or image extraction.

Private/local Excel samples are ignored:

examples/sample.xlsx
examples/sample2.xlsx

Notes

openpyxl does not calculate formulas. Overlay rendering uses data_only=True, so formula cells need cached values saved by Excel to show calculated results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excel_region_extractor-0.1.6.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excel_region_extractor-0.1.6-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file excel_region_extractor-0.1.6.tar.gz.

File metadata

  • Download URL: excel_region_extractor-0.1.6.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for excel_region_extractor-0.1.6.tar.gz
Algorithm Hash digest
SHA256 c0b2fb251364ea9c96237a63756d0ea358a181b99cf76903bd22d659728fd12a
MD5 5a4f54011d8da3dccd806cb5dc18332e
BLAKE2b-256 c409474e4ce3bc32f193b209ccbdab030af4f0201e779c4b2e3786fd10e6b9bc

See more details on using hashes here.

File details

Details for the file excel_region_extractor-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for excel_region_extractor-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ec6e6b487e5b196a1d00cabc3e721786bf86e8e6956c4ccc613b542f7e442101
MD5 8bfac1fe0083426aba97ef1ff5b22f7e
BLAKE2b-256 553b986bff38f5cd96be5d472c9fec56e6bafb91d9c007bbfdb72b6a037d77aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page