Extract images from markdown files and highlight text chunks with bounding boxes
Project description
Markitdown Reference Image
A Python package for extracting images from markdown files and highlighting specific text chunks with bounding boxes.
Features
- Find specific text chunks in markdown content
- Convert markdown to HTML and capture as image
- Draw bounding boxes around found text chunks
- Optionally add scores to the bounding boxes
- Save processed images to specified paths or temporary files
Installation
pip install markitdown-reference-image
Usage
Basic Usage
Extract and highlight text from a markdown file:
from markitdown_reference_image import MarkitdownImageExtractor
# Initialize the extractor
extractor = MarkitdownImageExtractor()
# Extract image with highlighted text
image_path = extractor.extract_with_highlight(
markdown_file="test_document.md",
chunk_text="important information that we want to highlight",
output_path="assets/example_basic.png",
score=0.95
)
print(f"Image saved to: {image_path}")
Result:
With Score Display
Add a similarity score to show retrieval confidence (great for RAG systems):
image_path = extractor.extract_with_highlight(
markdown_file="test_document.md",
chunk_text="Text Finding: Find specific text chunks in markdown content",
output_path="assets/example_with_score.png",
score=0.88 # RAG similarity score
)
Result:
Custom Styling
Customize colors and appearance to match your brand:
image_path = extractor.extract_with_highlight(
markdown_file="test_document.md",
chunk_text="Image Extraction",
output_path="assets/example_custom_styling.png",
score=0.92,
box_color=(0, 255, 0), # Green box
box_width=5, # Thicker border
score_color=(255, 255, 0), # Yellow text
score_bg_color=(0, 128, 0) # Dark green background
)
Result:
Using Temporary Output
No output path? No problem - a temporary file is created:
# Temporary file will be created automatically
image_path = extractor.extract_with_highlight(
markdown_file="test_document.md",
chunk_text="visual references from markdown"
# No output_path specified - uses temporary file
)
print(f"Temporary image: {image_path}")
# Example: /tmp/tmp8x3k9m2p.png
Result:
Command Line Interface
Quick testing and automation via CLI:
# Basic usage
markitdown-extract test_document.md "important information" -o output.png
# With score
markitdown-extract test_document.md "Text Finding" -o output.png -s 0.88
# Custom styling
markitdown-extract test_document.md "Image Extraction" \
-o output.png \
-s 0.92 \
--box-color 0 255 0 \
--box-width 5 \
--score-color 255 255 0 \
--score-bg-color 0 128 0
Examples
The package includes comprehensive examples in the markitdown_reference_image/examples/ directory:
Quick Example Runner
python run_examples.py
Individual Examples
# Basic usage examples
python -m markitdown_reference_image.examples.basic_extraction
python -m markitdown_reference_image.examples.with_score
python -m markitdown_reference_image.examples.custom_styling
# Advanced usage examples
python -m markitdown_reference_image.examples.batch_processing
python -m markitdown_reference_image.examples.component_usage
python -m markitdown_reference_image.examples.error_handling
# Command-line examples
python -m markitdown_reference_image.examples.cli_basic
python -m markitdown_reference_image.examples.cli_with_score
python -m markitdown_reference_image.examples.cli_custom_styling
Available Examples:
Basic Usage:
basic_extraction.py- Basic image extractionwith_score.py- Adding scores to bounding boxescustom_styling.py- Custom styling options
Advanced Usage:
batch_processing.py- Batch processing multiple filescomponent_usage.py- Using individual componentserror_handling.py- Proper error handling
Command Line:
cli_basic.py- Basic CLI usagecli_with_score.py- CLI with score displaycli_custom_styling.py- CLI with custom styling
Development
Setup
- Clone the repository:
git clone https://github.com/yourusername/markitdown-reference-image.git
cd markitdown-reference-image
- Install in development mode:
pip install -e .
- Install development dependencies:
pip install -r requirements-dev.txt
Running Tests
pytest
Code Formatting
black .
Linting
flake8 .
mypy .
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for your changes
- Run the test suite
- Submit a pull request
Changelog
See CHANGELOG.md for a list of changes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_reference_image-0.1.1.tar.gz.
File metadata
- Download URL: markitdown_reference_image-0.1.1.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66613e0d98b49bcf1971a5c9327b4df4b11face53d4dfef1b870d5e0217204df
|
|
| MD5 |
7e2d680313a306c9ccb99990994c112d
|
|
| BLAKE2b-256 |
3db68e0b7a62be424206d818f6935d9a6b21decb6a788f32a239092528207daf
|
Provenance
The following attestation bundles were made for markitdown_reference_image-0.1.1.tar.gz:
Publisher:
publish.yml on Naveenkumarar/markitdown-reference-image
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markitdown_reference_image-0.1.1.tar.gz -
Subject digest:
66613e0d98b49bcf1971a5c9327b4df4b11face53d4dfef1b870d5e0217204df - Sigstore transparency entry: 719585529
- Sigstore integration time:
-
Permalink:
Naveenkumarar/markitdown-reference-image@59284ddae8058c99a887132095a8ededd8509887 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Naveenkumarar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@59284ddae8058c99a887132095a8ededd8509887 -
Trigger Event:
release
-
Statement type:
File details
Details for the file markitdown_reference_image-0.1.1-py3-none-any.whl.
File metadata
- Download URL: markitdown_reference_image-0.1.1-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd1d70e0d7dc216ea798ca5e015fbf44d6137d4584bea2f313515abbc8f95f65
|
|
| MD5 |
f95896b5e1900ea76a682b9ab8212e49
|
|
| BLAKE2b-256 |
947b66e174f98023da117bc936bfa099fa801c6dcefd4a6b554e3689d4cd64b0
|
Provenance
The following attestation bundles were made for markitdown_reference_image-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on Naveenkumarar/markitdown-reference-image
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markitdown_reference_image-0.1.1-py3-none-any.whl -
Subject digest:
fd1d70e0d7dc216ea798ca5e015fbf44d6137d4584bea2f313515abbc8f95f65 - Sigstore transparency entry: 719585531
- Sigstore integration time:
-
Permalink:
Naveenkumarar/markitdown-reference-image@59284ddae8058c99a887132095a8ededd8509887 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Naveenkumarar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@59284ddae8058c99a887132095a8ededd8509887 -
Trigger Event:
release
-
Statement type: