Skip to main content

Document structure extraction tool for markdown, with extensibility to PDF and HTML

Project description

🏗️ Doxstrux

PyPI version Python 3.12+ License: MIT Code style: black Downloads

Document structure extraction tool for markdown, with extensibility to PDF and HTML.

Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.

✨ Features

  • Zero-regex parsing: Token-based extraction using markdown-it-py
  • Security-first design: Three security profiles (strict/moderate/permissive)
  • Document IR: Clean intermediate representation for RAG chunking
  • Structure extraction: Headings, lists, tables, code blocks, links, images
  • Content integrity: Parse without mutation, fail-closed security
  • Extensible architecture: Ready for PDF and HTML support

📦 Installation

pip install doxstrux

🚀 Quick Start

from doxstrux.markdown_parser_core import MarkdownParserCore

# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()

# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])

# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()

# With custom config
parser = MarkdownParserCore(
    content,
    config={
        'preset': 'gfm',
        'plugins': ['table', 'strikethrough'],
        'allows_html': False
    },
    security_profile='moderate'
)
result = parser.parse()

🏗️ Architecture

Core Principles

  • Extract everything, analyze nothing: Focus on structural extraction, not semantics
  • No file I/O in core: Parser accepts content strings, not paths
  • Plain dict outputs: Lightweight, no heavy dependencies
  • Security layered throughout: Size limits, plugin validation, content sanitization
  • Modular extractors (Phase 7): 11 specialized modules with dependency injection
  • Single responsibility: Each extractor handles one markdown element type

Security Profiles

Profile Max Size Max Lines Recursion Depth Use Case
strict 100KB 2K 50 Untrusted input
moderate 1MB 10K 100 Standard use (default)
permissive 10MB 50K 150 Trusted documents

Document IR

Clean intermediate representation for RAG pipelines and chunking:

from doxstrux.markdown.ir import DocumentIR, ChunkPolicy

# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)

# Apply chunking policy
policy = ChunkPolicy(
    max_chunk_tokens=512,
    overlap_tokens=50,
    respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)

🧪 Testing

# Run all tests
pytest

# With coverage
pytest --cov=src/doxstrux

# Type checking
mypy src/doxstrux

# Linting
ruff check src/ tests/

📊 Project Status

  • Version: 0.2.1 ✅ Published on PyPI
  • Python: 3.12+
  • Test Coverage: 69% (working toward 80% target)
  • Tests: 95/95 pytest passing + 542/542 baseline tests passing
  • Regex Count: 0 (zero-regex architecture)
  • Core Parser: 1944 lines (reduced from 2900, -33%)
  • PyPI: https://pypi.org/project/doxstrux/

Phase 7: Modular Architecture ✅ COMPLETE

Completed: Full modularization of parser into 11 specialized extractors

  • 7.0.5: Rename from docpipe to doxstrux
  • 7.1: Create namespace structure
  • 7.2: Move existing modules to new namespace
  • 7.3: Extract line & text utilities
  • 7.4: Extract configuration & budgets
  • 7.5: Extract simple extractors (media, footnotes, blockquotes, html)
  • 7.6: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)

Achievements:

  • Core parser reduced by 33% (2900 → 1944 lines)
  • 11 specialized extractor modules created
  • 100% baseline test parity maintained
  • Clean dependency injection pattern throughout
  • Zero behavioral changes (byte-for-byte output identical)

🗺️ Roadmap

  • Phase 7: Modular architecture ✅ COMPLETE
  • Phase 8: Enhanced testing & documentation
  • PDF support: Extract structure from PDF documents
  • HTML support: Parse HTML with same IR
  • Enhanced chunking: Semantic-aware chunking strategies
  • Performance: Cython optimization for hot paths

📚 Documentation

  • Architecture: See CLAUDE.md for detailed architecture notes
  • Phase 7 Plan: See regex_refactor_docs/DETAILED_TASK_LIST.md
  • Testing: See regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md

🤝 Contributing

This project follows a phased refactoring methodology with comprehensive test gates.

  1. All changes must pass 63 pytest tests
  2. All changes must maintain byte-for-byte output parity (542 baseline tests)
  3. Security-first: No untrusted regex, validated links, sanitized HTML
  4. Type-safe: Full mypy strict mode compliance

📜 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on:


Previous name: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doxstrux-0.2.1.tar.gz (65.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doxstrux-0.2.1-py3-none-any.whl (62.6 kB view details)

Uploaded Python 3

File details

Details for the file doxstrux-0.2.1.tar.gz.

File metadata

  • Download URL: doxstrux-0.2.1.tar.gz
  • Upload date:
  • Size: 65.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for doxstrux-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c8a1add602a1fea6de6ca20871138ae705828c2bee8422db3f22bfa4a70a0ebc
MD5 55890223ec42c543641df77b02932df2
BLAKE2b-256 7700ddadb0915e13e097a4ce1f8aca49345c3c6c4485dd53bde149e1e011186a

See more details on using hashes here.

File details

Details for the file doxstrux-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: doxstrux-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 62.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for doxstrux-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 666c8cf971a0fcc44a793717345278bd4fc6b0f52177dcb3d18b9e8a45f9e5c0
MD5 77bd9913c7a76bfd73fa631ccf931393
BLAKE2b-256 0f003973b010842e70cfac6c8c9e4249690f4c2727db39d45207813c9a811cdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page