Skip to main content

A package for processing and analyzing raw document formats

Project description

Raw DOCX

A Python library that extends python-docx to convert Word documents into structured, traversable Python objects with export to dictionary, HTML, and plain text formats.

Installation

pip install raw_docx

Features

  • Document hierarchy - Automatic section numbering with multi-level headings (1-6)
  • Rich text - Colors, highlighting, bold, italic, superscript, and subscript
  • Tables - Full support for merged cells (row and column spans)
  • Nested lists - Arbitrary nesting depth with level tracking and numId boundary detection
  • Indentation hierarchy - Infers nesting from indentation when all items share the same level
  • Bookmarks and cross-references - Bookmark anchors and field-based references
  • Image extraction - Extracts embedded images with base64 HTML embedding
  • Multiple export formats - Dictionary, HTML, and plain text
  • Search - Find text across sections, tables, and the full document
  • Error tracking - Integrated logging via simple_error_log

Quick Start

from raw_docx import RawDocx

# Load and process a document
docx = RawDocx("path/to/document.docx", work_dir="/tmp/output")

# Disable indentation-based hierarchy inference if needed
# docx = RawDocx("path/to/document.docx", infer_indent_hierarchy=False)

# Access the structured document
document = docx.target_document

# Export to dictionary
data = docx.to_dict()

# Work with sections
section = document.section_by_title("Introduction")
paragraphs = section.paragraphs()
tables = section.tables()
lists = section.lists()

# Search for content
results = section.find("keyword")

# Generate HTML
html = section.to_html()

Key Classes

Class Description
RawDocx Main entry point; loads and processes a .docx file
RawDocument Top-level container managing sections and hierarchy
RawSection A document section/heading with its content
RawParagraph A paragraph containing runs and bookmarks
RawRun A text run with formatting attributes
RawTable / RawTableRow / RawTableCell Table structure with merged cell support
RawList / RawListItem Nested list structure
RawImage Embedded image handling

Requirements

License

MIT - see LICENSE for details.

Build and Release

pytest
ruff format
ruff check
python3 -m build --sdist --wheel
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raw_docx-0.14.0.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raw_docx-0.14.0-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file raw_docx-0.14.0.tar.gz.

File metadata

  • Download URL: raw_docx-0.14.0.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for raw_docx-0.14.0.tar.gz
Algorithm Hash digest
SHA256 47e9cce61cc55264379194a3c9aa6d6b4df15418f959c5b6a46ab08bdcbdcd7f
MD5 f1b5a469b0b02d71c545ce903288ebd0
BLAKE2b-256 0414b011f268fe731da046faed5d02bbf226635189ed555d9c4640406a092d9a

See more details on using hashes here.

File details

Details for the file raw_docx-0.14.0-py3-none-any.whl.

File metadata

  • Download URL: raw_docx-0.14.0-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for raw_docx-0.14.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73ff2fafc99352213cdec0034249579639f4d95dfc7b2508fdbd87a6d0023798
MD5 f230dcdca3e8bc2ab0fdc7f75fd5166a
BLAKE2b-256 e0bcc509863ad879da6fe95030b4064c94e821ac9faa7d6e7aeb47cf28a7b2d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page