Skip to main content

Extract information-region bounding boxes from Excel sheets without semantic labels

Project description

Excel Region Extractor

Extract rectangular information regions from Excel workbooks and return them as range strings such as A1:D10.

The extractor uses cell values, merged cells, borders, embedded image anchors, and chart anchors. It can write sheet JSON, workbook summary JSON, optional overlay PNGs, extracted embedded image files, and simple chart preview PNGs.

Install

pip install excel-region-extractor

Install directly from GitHub:

pip install git+https://github.com/LampSeeker/ExcelRegionExtractor.git

For local development:

pip install -e .

CLI Usage

Run on your workbook:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions

Run one sheet:

excel-regions --workbook path/to/workbook.xlsx --sheet "Sheet1" --out outputs/sheet1

Skip overlay PNG generation:

excel-regions --workbook path/to/workbook.xlsx --out outputs/regions --no-overlay

excel-info-regions is kept as a backward-compatible alias.

Python API

from excel_info_region import extract_workbook_info_regions
from excel_info_region.config import load_config

config = load_config()
result = extract_workbook_info_regions("path/to/workbook.xlsx", config=config)

For writing JSON, overlay PNGs, and extracted images:

from excel_info_region import run_and_write

run_and_write("path/to/workbook.xlsx", out_dir="outputs/regions")

Demo

The source repository includes a synthetic, non-sensitive workbook:

excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo

Example overlay:

Synthetic Excel region overlay

Chart demo:

excel-regions --workbook examples/chart.xlsx --sheet "XXX Summary Northeast" --out outputs/chart_demo

Chart overlay:

Chart Excel region overlay

Extracted chart preview:

Chart preview

Output

outputs/chart_demo/
  info_regions_full.json
  info_regions_summary.json

  XXX Summary Northeast/
    info_regions.json
    info_regions.png
    charts/
      CHART001_E3_Q20_Chart_1.png

Sheet JSON:

{
  "sheet_name": "XXX Summary Northeast",
  "regions": [
    "A1:D15",
    "E3:Q20",
    "A25:M49"
  ],
  "images": [],
  "charts": [
    {
      "name": "Chart 1",
      "kind": "BarChart",
      "range_ref": "E3:Q20",
      "path": "charts/CHART001_E3_Q20_Chart_1.png",
      "sources": [
        {
          "role": "cat",
          "range_ref": "B25:M25",
          "cached_values": ["CCRX [Person_12]", "CCRX Newport", "..."]
        },
        {
          "role": "val",
          "range_ref": "B26:M26",
          "values": [["=AVERAGEIF(B27:B45,\">0\")", "..."]],
          "cached_values": [0.0, 2.0, 4.219298245614035, "..."]
        }
      ]
    }
  ]
}

regions is the list of detected Excel ranges. images records embedded image metadata. charts records chart metadata, source ranges, cached chart values when available, and a preview PNG path.

How It Works

Current extractor flow:

1. Calculate working bounds from non-empty cells, merged cells, and images
2. Collect non-empty cells as occupied cells
3. Find connected components from occupied cells
4. Convert each connected component to a rectangular bbox
5. Expand bboxes with border/table shell information
6. Merge some boxes that touch the same border component
7. Add images as separate regions
8. Output range refs such as A1:D10

Images are intentionally kept separate from cell connected components. This avoids over-merging drawings with nearby tables.

Configuration

The packaged default config is loaded by:

load_config()

Common options:

{
  "include_values": true,
  "include_merged_cells": true,
  "include_images": true,
  "include_grouped_drawing_images": true,
  "include_charts": true,
  "include_chart_source_values": true,
  "respect_hidden_rows_cols": false,
  "use_print_area_bounds": false,
  "use_borders": true,
  "strong_borders_only": true,
  "use_border_contact_merge": true,
  "extract_embedded_images": true,
  "embedded_image_dir": "images",
  "extract_chart_images": true,
  "chart_image_dir": "charts"
}

Set a font path if text is broken in overlay PNGs:

{
  "visualization": {
    "font_path": "C:/Windows/Fonts/malgun.ttf"
  }
}

--no-overlay skips overlay PNG generation. Embedded image extraction still runs when extract_embedded_images is true.

Project Structure

src/excel_info_region/
  cli.py             console entrypoint
  runner.py          writes JSON, overlay PNG, extracted images
  extractor.py       workbook/sheet orchestration
  cells.py           cell and merged-cell occupied logic
  borders.py         border expansion and border-contact merge
  components.py      connected components and bbox helpers
  image_regions.py   image anchors to region boxes
  image_export.py    embedded image extraction
  raw_drawing.py     raw xlsx DrawingML parsing
  visualize.py       overlay PNG renderer

Development

pytest
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo --no-overlay

Run without --no-overlay when changing visualization or image extraction.

Private/local Excel samples are ignored:

examples/sample.xlsx
examples/sample2.xlsx

Notes

openpyxl does not calculate formulas. Overlay rendering uses data_only=True, so formula cells need cached values saved by Excel to show calculated results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excel_region_extractor-0.1.3.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

excel_region_extractor-0.1.3-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file excel_region_extractor-0.1.3.tar.gz.

File metadata

  • Download URL: excel_region_extractor-0.1.3.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for excel_region_extractor-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7d79ff8ca7cef4ca2dd41d8916619febaaa383fd3a9c52fde3a5489cdd4b2432
MD5 79e33d7c45dc5630b45167a47eade2c8
BLAKE2b-256 563986864cdf53a3932cf7ffafa5bcc59e80a0692d07bd8e4f5dcac32965cf11

See more details on using hashes here.

File details

Details for the file excel_region_extractor-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for excel_region_extractor-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a98dbf28fb7dc9b442c113111dea39ded9de9a4911871a5870e9e88166ee545b
MD5 9460452e525965791b41b4434e8a0d3c
BLAKE2b-256 b4bf7c0cd6ff321f643edb04bd9bb3a5b4eb091a133eb700ccf40ec98f5de9f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page