A package for processing and analyzing raw document formats
Project description
Raw DOCX
A Python library that extends python-docx to convert Word documents into structured, traversable Python objects with export to dictionary, HTML, and plain text formats.
Installation
pip install raw_docx
Features
- Document hierarchy - Automatic section numbering with multi-level headings (1-6)
- Rich text - Colors, highlighting, bold, italic, superscript, and subscript
- Tables - Full support for merged cells (row and column spans)
- Nested lists - Arbitrary nesting depth with level tracking
- Bookmarks and cross-references - Bookmark anchors and field-based references
- Image extraction - Extracts embedded images with base64 HTML embedding
- Multiple export formats - Dictionary, HTML, and plain text
- Search - Find text across sections, tables, and the full document
- Error tracking - Integrated logging via simple_error_log
Quick Start
from raw_docx import RawDocx
# Load and process a document
docx = RawDocx("path/to/document.docx")
# Access the structured document
document = docx.target_document
# Export to dictionary
data = docx.to_dict()
# Work with sections
section = document.section_by_title("Introduction")
paragraphs = section.paragraphs()
tables = section.tables()
lists = section.lists()
# Search for content
results = section.find("keyword")
# Generate HTML
html = section.to_html()
Key Classes
| Class | Description |
|---|---|
RawDocx |
Main entry point; loads and processes a .docx file |
RawDocument |
Top-level container managing sections and hierarchy |
RawSection |
A document section/heading with its content |
RawParagraph |
A paragraph containing runs and bookmarks |
RawRun |
A text run with formatting attributes |
RawTable / RawTableRow / RawTableCell |
Table structure with merged cell support |
RawList / RawListItem |
Nested list structure |
RawImage |
Embedded image handling |
Requirements
- Python >= 3.8
- python-docx >= 1.1.2
- simple_error_log >= 0.6.0
License
MIT - see LICENSE for details.
Build and Release
pytest
ruff format
ruff check
python3 -m build --sdist --wheel
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
raw_docx-0.12.0.tar.gz
(27.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
raw_docx-0.12.0-py3-none-any.whl
(32.0 kB
view details)
File details
Details for the file raw_docx-0.12.0.tar.gz.
File metadata
- Download URL: raw_docx-0.12.0.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
351b1aa14eada7265f6753e2517036a8abc1a4035ed8d2fde88d45233099a1da
|
|
| MD5 |
40d6dec5e92dbb7d5e8912ba7218e3c3
|
|
| BLAKE2b-256 |
d555ad18475fcf1377f90ef5f90a3eaf896469f9d1ba40041dcdddca84708c84
|
File details
Details for the file raw_docx-0.12.0-py3-none-any.whl.
File metadata
- Download URL: raw_docx-0.12.0-py3-none-any.whl
- Upload date:
- Size: 32.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d37bff20c4c480fc3440147dcae20631a44897a0998d5055f55c52a2e4b5fb74
|
|
| MD5 |
a380874c33c5aa30eba543eb3f928566
|
|
| BLAKE2b-256 |
b7084503cda97677e47857f858158d5d81451132868426fc38e1e056ac49920d
|