Extract information-region bounding boxes from Excel sheets without semantic labels
Project description
Excel Region Extractor
Extract rectangular information regions from Excel workbooks and return them as range strings such as A1:D10.
The extractor uses cell values, merged cells, borders, embedded image anchors, and chart anchors. It can write sheet JSON, workbook summary JSON, optional overlay PNGs, extracted embedded image files, and simple chart preview PNGs.
Install
pip install excel-region-extractor
Install directly from GitHub:
pip install git+https://github.com/LampSeeker/ExcelRegionExtractor.git
For local development:
pip install -e .
CLI Usage
Run on your workbook:
excel-regions --workbook path/to/workbook.xlsx --out outputs/regions
Run one sheet:
excel-regions --workbook path/to/workbook.xlsx --sheet "Sheet1" --out outputs/sheet1
Skip overlay PNG generation:
excel-regions --workbook path/to/workbook.xlsx --out outputs/regions --no-overlay
excel-info-regions is kept as a backward-compatible alias.
Python API
from excel_info_region import extract_workbook_info_regions
from excel_info_region.config import load_config
config = load_config()
result = extract_workbook_info_regions("path/to/workbook.xlsx", config=config)
For writing JSON, overlay PNGs, and extracted images:
from excel_info_region import run_and_write
run_and_write("path/to/workbook.xlsx", out_dir="outputs/regions")
Demo
The source repository includes a synthetic, non-sensitive workbook:
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo
Example overlay:
Chart demo:
excel-regions --workbook examples/chart.xlsx --sheet "XXX Summary Northeast" --out outputs/chart_demo
Chart overlay:
Extracted chart preview:
Output
outputs/chart_demo/
info_regions_full.json
info_regions_summary.json
XXX Summary Northeast/
info_regions.json
info_regions.png
charts/
CHART001_E3_Q20_Chart_1.png
Sheet JSON:
{
"sheet_name": "XXX Summary Northeast",
"regions": [
"A1:D15",
"E3:Q20",
"A25:M49"
],
"images": [],
"charts": [
{
"name": "Chart 1",
"kind": "BarChart",
"range_ref": "E3:Q20",
"path": "charts/CHART001_E3_Q20_Chart_1.png",
"sources": [
{
"role": "cat",
"range_ref": "B25:M25",
"cached_values": ["CCRX [Person_12]", "CCRX Newport", "..."]
},
{
"role": "val",
"range_ref": "B26:M26",
"values": [["=AVERAGEIF(B27:B45,\">0\")", "..."]],
"cached_values": [0.0, 2.0, 4.219298245614035, "..."]
}
]
}
]
}
regions is the list of detected Excel ranges. images records embedded image metadata. charts records chart metadata, source ranges, cached chart values when available, and a preview PNG path.
How It Works
Current extractor flow:
1. Calculate working bounds from non-empty cells, merged cells, and images
2. Collect non-empty cells as occupied cells
3. Find connected components from occupied cells
4. Convert each connected component to a rectangular bbox
5. Expand bboxes with border/table shell information
6. Merge some boxes that touch the same border component
7. Add images as separate regions
8. Output range refs such as A1:D10
Images are intentionally kept separate from cell connected components. This avoids over-merging drawings with nearby tables.
Configuration
The packaged default config is loaded by:
load_config()
Common options:
{
"include_values": true,
"include_merged_cells": true,
"include_images": true,
"include_grouped_drawing_images": true,
"include_charts": true,
"include_chart_source_values": true,
"respect_hidden_rows_cols": false,
"use_print_area_bounds": false,
"use_borders": true,
"strong_borders_only": true,
"use_border_contact_merge": true,
"extract_embedded_images": true,
"embedded_image_dir": "images",
"extract_chart_images": true,
"chart_image_dir": "charts"
}
Set a font path if text is broken in overlay PNGs:
{
"visualization": {
"font_path": "C:/Windows/Fonts/malgun.ttf"
}
}
--no-overlay skips overlay PNG generation. Embedded image extraction still runs when extract_embedded_images is true.
Project Structure
src/excel_info_region/
cli.py console entrypoint
runner.py writes JSON, overlay PNG, extracted images
extractor.py workbook/sheet orchestration
cells.py cell and merged-cell occupied logic
borders.py border expansion and border-contact merge
components.py connected components and bbox helpers
image_regions.py image anchors to region boxes
image_export.py embedded image extraction
raw_drawing.py raw xlsx DrawingML parsing
visualize.py overlay PNG renderer
Development
pytest
excel-regions --workbook examples/synthetic_demo.xlsx --out outputs/demo --no-overlay
Run without --no-overlay when changing visualization or image extraction.
Private/local Excel samples are ignored:
examples/sample.xlsx
examples/sample2.xlsx
Notes
openpyxl does not calculate formulas. Overlay rendering uses data_only=True, so formula cells need cached values saved by Excel to show calculated results.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file excel_region_extractor-0.1.3.tar.gz.
File metadata
- Download URL: excel_region_extractor-0.1.3.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d79ff8ca7cef4ca2dd41d8916619febaaa383fd3a9c52fde3a5489cdd4b2432
|
|
| MD5 |
79e33d7c45dc5630b45167a47eade2c8
|
|
| BLAKE2b-256 |
563986864cdf53a3932cf7ffafa5bcc59e80a0692d07bd8e4f5dcac32965cf11
|
File details
Details for the file excel_region_extractor-0.1.3-py3-none-any.whl.
File metadata
- Download URL: excel_region_extractor-0.1.3-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a98dbf28fb7dc9b442c113111dea39ded9de9a4911871a5870e9e88166ee545b
|
|
| MD5 |
9460452e525965791b41b4434e8a0d3c
|
|
| BLAKE2b-256 |
b4bf7c0cd6ff321f643edb04bd9bb3a5b4eb091a133eb700ccf40ec98f5de9f8
|