A package for processing and analyzing raw document formats
Project description
Raw DOCX
A Python library that extends python-docx to convert Word documents into structured, traversable Python objects with export to dictionary, HTML, and plain text formats.
Installation
pip install raw_docx
Features
- Document hierarchy - Automatic section numbering with multi-level headings (1-6)
- Rich text - Colors, highlighting, bold, italic, superscript, and subscript
- Tables - Full support for merged cells (row and column spans)
- Nested lists - Arbitrary nesting depth with level tracking and numId boundary detection
- Indentation hierarchy - Infers nesting from indentation when all items share the same level
- Bookmarks and cross-references - Bookmark anchors and field-based references
- Image extraction - Extracts embedded images with base64 HTML embedding
- Multiple export formats - Dictionary, HTML, and plain text
- Search - Find text across sections, tables, and the full document
- Error tracking - Integrated logging via simple_error_log
Quick Start
from raw_docx import RawDocx
# Load and process a document
docx = RawDocx("path/to/document.docx", work_dir="/tmp/output")
# Disable indentation-based hierarchy inference if needed
# docx = RawDocx("path/to/document.docx", infer_indent_hierarchy=False)
# Access the structured document
document = docx.target_document
# Export to dictionary
data = docx.to_dict()
# Work with sections
section = document.section_by_title("Introduction")
paragraphs = section.paragraphs()
tables = section.tables()
lists = section.lists()
# Search for content
results = section.find("keyword")
# Generate HTML
html = section.to_html()
Key Classes
| Class | Description |
|---|---|
RawDocx |
Main entry point; loads and processes a .docx file |
RawDocument |
Top-level container managing sections and hierarchy |
RawSection |
A document section/heading with its content |
RawParagraph |
A paragraph containing runs and bookmarks |
RawRun |
A text run with formatting attributes |
RawTable / RawTableRow / RawTableCell |
Table structure with merged cell support |
RawList / RawListItem |
Nested list structure |
RawImage |
Embedded image handling |
Requirements
- Python >= 3.12
- python-docx >= 1.1.2
- simple_error_log >= 0.6.0
License
MIT - see LICENSE for details.
Build and Release
pytest
ruff format
ruff check
python3 -m build --sdist --wheel
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
raw_docx-0.14.0.tar.gz
(30.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
raw_docx-0.14.0-py3-none-any.whl
(35.1 kB
view details)
File details
Details for the file raw_docx-0.14.0.tar.gz.
File metadata
- Download URL: raw_docx-0.14.0.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47e9cce61cc55264379194a3c9aa6d6b4df15418f959c5b6a46ab08bdcbdcd7f
|
|
| MD5 |
f1b5a469b0b02d71c545ce903288ebd0
|
|
| BLAKE2b-256 |
0414b011f268fe731da046faed5d02bbf226635189ed555d9c4640406a092d9a
|
File details
Details for the file raw_docx-0.14.0-py3-none-any.whl.
File metadata
- Download URL: raw_docx-0.14.0-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73ff2fafc99352213cdec0034249579639f4d95dfc7b2508fdbd87a6d0023798
|
|
| MD5 |
f230dcdca3e8bc2ab0fdc7f75fd5166a
|
|
| BLAKE2b-256 |
e0bcc509863ad879da6fe95030b4064c94e821ac9faa7d6e7aeb47cf28a7b2d2
|