Skip to main content

A package for processing and analyzing raw document formats

Project description

Raw DOCX

A Python library that extends python-docx to convert Word documents into structured, traversable Python objects with export to dictionary, HTML, and plain text formats.

Installation

pip install raw_docx

Features

  • Document hierarchy - Automatic section numbering with multi-level headings (1-6)
  • Rich text - Colors, highlighting, bold, italic, superscript, and subscript
  • Tables - Full support for merged cells (row and column spans)
  • Nested lists - Arbitrary nesting depth with level tracking
  • Bookmarks and cross-references - Bookmark anchors and field-based references
  • Image extraction - Extracts embedded images with base64 HTML embedding
  • Multiple export formats - Dictionary, HTML, and plain text
  • Search - Find text across sections, tables, and the full document
  • Error tracking - Integrated logging via simple_error_log

Quick Start

from raw_docx import RawDocx

# Load and process a document
docx = RawDocx("path/to/document.docx")

# Access the structured document
document = docx.target_document

# Export to dictionary
data = docx.to_dict()

# Work with sections
section = document.section_by_title("Introduction")
paragraphs = section.paragraphs()
tables = section.tables()
lists = section.lists()

# Search for content
results = section.find("keyword")

# Generate HTML
html = section.to_html()

Key Classes

Class Description
RawDocx Main entry point; loads and processes a .docx file
RawDocument Top-level container managing sections and hierarchy
RawSection A document section/heading with its content
RawParagraph A paragraph containing runs and bookmarks
RawRun A text run with formatting attributes
RawTable / RawTableRow / RawTableCell Table structure with merged cell support
RawList / RawListItem Nested list structure
RawImage Embedded image handling

Requirements

License

MIT - see LICENSE for details.

Build and Release

pytest
ruff format
ruff check
python3 -m build --sdist --wheel
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raw_docx-0.12.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raw_docx-0.12.0-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file raw_docx-0.12.0.tar.gz.

File metadata

  • Download URL: raw_docx-0.12.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for raw_docx-0.12.0.tar.gz
Algorithm Hash digest
SHA256 351b1aa14eada7265f6753e2517036a8abc1a4035ed8d2fde88d45233099a1da
MD5 40d6dec5e92dbb7d5e8912ba7218e3c3
BLAKE2b-256 d555ad18475fcf1377f90ef5f90a3eaf896469f9d1ba40041dcdddca84708c84

See more details on using hashes here.

File details

Details for the file raw_docx-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: raw_docx-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for raw_docx-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d37bff20c4c480fc3440147dcae20631a44897a0998d5055f55c52a2e4b5fb74
MD5 a380874c33c5aa30eba543eb3f928566
BLAKE2b-256 b7084503cda97677e47857f858158d5d81451132868426fc38e1e056ac49920d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page