Document structure extraction tool for markdown, with extensibility to PDF and HTML

These details have not been verified by PyPI

Project links

Project description

🏗️ Doxstrux

Document structure extraction tool for markdown, with extensibility to PDF and HTML.

Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.

✨ Features

Zero-regex parsing: Token-based extraction using markdown-it-py
Security-first design: Three security profiles (strict/moderate/permissive)
Document IR: Clean intermediate representation for RAG chunking
Structure extraction: Headings, lists, tables, code blocks, links, images
Content integrity: Parse without mutation, fail-closed security
Extensible architecture: Ready for PDF and HTML support

📦 Installation

pip install doxstrux

🚀 Quick Start

from doxstrux.markdown_parser_core import MarkdownParserCore

# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()

# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])

# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()

# With custom config
parser = MarkdownParserCore(
    content,
    config={
        'preset': 'gfm',
        'plugins': ['table', 'strikethrough'],
        'allows_html': False
    },
    security_profile='moderate'
)
result = parser.parse()

🏗️ Architecture

Core Principles

Extract everything, analyze nothing: Focus on structural extraction, not semantics
No file I/O in core: Parser accepts content strings, not paths
Plain dict outputs: Lightweight, no heavy dependencies
Security layered throughout: Size limits, plugin validation, content sanitization
Modular extractors (Phase 7): 11 specialized modules with dependency injection
Single responsibility: Each extractor handles one markdown element type

Security Profiles

Profile	Max Size	Max Lines	Recursion Depth	Use Case
strict	100KB	2K	50	Untrusted input
moderate	1MB	10K	100	Standard use (default)
permissive	10MB	50K	150	Trusted documents

Document IR

Clean intermediate representation for RAG pipelines and chunking:

from doxstrux.markdown.ir import DocumentIR, ChunkPolicy

# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)

# Apply chunking policy
policy = ChunkPolicy(
    max_chunk_tokens=512,
    overlap_tokens=50,
    respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)

🧪 Testing

# Run all tests
pytest

# With coverage
pytest --cov=src/doxstrux

# Type checking
mypy src/doxstrux

# Linting
ruff check src/ tests/

📊 Project Status

Version: 0.2.1 ✅ Published on PyPI
Python: 3.12+
Test Coverage: 69% (working toward 80% target)
Tests: 95/95 pytest passing + 542/542 baseline tests passing
Regex Count: 0 (zero-regex architecture)
Core Parser: 1944 lines (reduced from 2900, -33%)
PyPI: https://pypi.org/project/doxstrux/

Phase 7: Modular Architecture ✅ COMPLETE

Completed: Full modularization of parser into 11 specialized extractors

✅ 7.0.5: Rename from docpipe to doxstrux
✅ 7.1: Create namespace structure
✅ 7.2: Move existing modules to new namespace
✅ 7.3: Extract line & text utilities
✅ 7.4: Extract configuration & budgets
✅ 7.5: Extract simple extractors (media, footnotes, blockquotes, html)
✅ 7.6: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)

Achievements:

Core parser reduced by 33% (2900 → 1944 lines)
11 specialized extractor modules created
100% baseline test parity maintained
Clean dependency injection pattern throughout
Zero behavioral changes (byte-for-byte output identical)

🗺️ Roadmap

Phase 7: Modular architecture ✅ COMPLETE
Phase 8: Enhanced testing & documentation
PDF support: Extract structure from PDF documents
HTML support: Parse HTML with same IR
Enhanced chunking: Semantic-aware chunking strategies
Performance: Cython optimization for hot paths

📚 Documentation

Architecture: See CLAUDE.md for detailed architecture notes
Phase 7 Plan: See regex_refactor_docs/DETAILED_TASK_LIST.md
Testing: See regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md

🤝 Contributing

This project follows a phased refactoring methodology with comprehensive test gates.

All changes must pass 63 pytest tests
All changes must maintain byte-for-byte output parity (542 baseline tests)
Security-first: No untrusted regex, validated links, sanitized HTML
Type-safe: Full mypy strict mode compliance

📜 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on:

markdown-it-py - CommonMark compliant parser
mdit-py-plugins - Extended markdown features

Previous name: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Oct 12, 2025

0.2.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doxstrux-0.2.1.tar.gz (65.3 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doxstrux-0.2.1-py3-none-any.whl (62.6 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file doxstrux-0.2.1.tar.gz.

File metadata

Download URL: doxstrux-0.2.1.tar.gz
Upload date: Oct 12, 2025
Size: 65.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for doxstrux-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`c8a1add602a1fea6de6ca20871138ae705828c2bee8422db3f22bfa4a70a0ebc`
MD5	`55890223ec42c543641df77b02932df2`
BLAKE2b-256	`7700ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a`

See more details on using hashes here.

File details

Details for the file doxstrux-0.2.1-py3-none-any.whl.

File metadata

Download URL: doxstrux-0.2.1-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 62.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for doxstrux-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`666c8cf971a0fcc44a793717345278bd4fc6b0f52177dcb3d18b9e8a45f9e5c0`
MD5	`77bd9913c7a76bfd73fa631ccf931393`
BLAKE2b-256	`0f003973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb`

See more details on using hashes here.

doxstrux 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🏗️ Doxstrux

✨ Features

📦 Installation

🚀 Quick Start

🏗️ Architecture

Core Principles

Security Profiles

Document IR

🧪 Testing

📊 Project Status

Phase 7: Modular Architecture ✅ COMPLETE

🗺️ Roadmap

📚 Documentation

🤝 Contributing

📜 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes