Skip to main content

Powerful Python library to convert documents (PDF, DOCX, TXT) into structured JSON trees for legal, institutional, and NLP applications.

Project description

📚 doc23

Convert documents into structured JSON effortlessly.
A Python library for extracting text from various document formats and structuring it hierarchically into JSON.


📌 Features

  • ✅ Extract text from PDFs, DOCX, TXT, RTF, ODT, MD, and images.
  • 🖼️ OCR support for scanned documents and images.
  • ⚙️ Flexible configuration using regex patterns and field mapping.
  • 🌳 Nested hierarchical structure output in JSON.
  • ✨ Explicit leaf-level control using is_leaf=True.
  • 🔍 Built-in validations to catch config mistakes (regex, hierarchy, field conflicts).
  • 🧪 Comprehensive pytest suite with coverage reporting.

📦 Installation

pip install doc23

To enable OCR:

sudo apt install tesseract-ocr
pip install pytesseract

🚀 Quickstart Example

Basic Text Extraction

from doc23 import extract_text

# Extract text from any supported document
text = extract_text("document.pdf", scan_or_image="auto")
print(text)

Structured Document Parsing

from doc23 import Doc23, Config, LevelConfig

config = Config(
    root_name="art_of_war",
    sections_field="chapters",
    description_field="description",
    levels={
        "chapter": LevelConfig(
            pattern=r"^CHAPTER\s+([IVXLCDM]+)\n(.+)$",
            name="chapter",
            title_field="title",
            description_field="description",
            sections_field="paragraphs"
        ),
        "paragraph": LevelConfig(
            pattern=r"^(\d+)\.\s+(.+)$",
            name="paragraph",
            title_field="number",
            description_field="text",
            is_leaf=True
        )
    }
)

with open("art_of_war.txt") as f:
    text = f.read()

doc = Doc23(text, config)
structure = doc.prune()

print(structure["chapters"][0]["title"])  # → I

🧾 Output Example

{
  "description": "",
  "chapters": [
    {
      "type": "chapter",
      "title": "I",
      "description": "Laying Plans",
      "paragraphs": [
        {
          "type": "paragraph",
          "number": "1",
          "text": "Sun Tzu said: The art of war is of vital importance to the State."
        }
      ]
    }
  ]
}

🛠️ Document Configuration

Use Config and LevelConfig to define how your document is parsed:

Field Purpose
pattern Regex to match each level
title_field Field to assign the first regex group
description_field (Optional) Field for second group
sections_field (Optional) Where sublevels go
paragraph_field (Optional) Where text/nodes go if leaf
is_leaf (Optional) Forces this level to be terminal

Capture Group Rules

Fields Defined Required Groups in Regex
title_field only ≥1
title_field + description_field ≥2
title_field + paragraph_field ≥1 (second group optional)

🏗️ Architecture Overview

doc23 consists of several key components:

Doc23 (core.py)
├── Extractors (extractors/)
│   ├── PDFExtractor
│   ├── DocxExtractor
│   ├── TextExtractor
│   └── ...
├── Config (config_tree.py)
│   └── LevelConfig
└── Gardener (gardener.py)
  1. Doc23: Main entry point, handles file detection and orchestration
  2. Extractors: Convert various document types to plain text
  3. Config: Defines how to structure the document hierarchy
  4. Gardener: Parses text and builds the JSON structure

✅ Built-in Validation

The library validates your config when creating Doc23:

  • ✋ Ensures all parents exist.
  • 🔁 Detects circular relationships.
  • ⚠️ Checks field name reuse.
  • 🧪 Verifies group counts match pattern.

If any issue is found, a ValueError will be raised immediately.


🧪 Testing

The library includes a comprehensive test suite covering various scenarios:

Basic Initialization

def test_gardener_initialization():
    config = Config(
        root_name="document",
        sections_field="sections",
        description_field="description",
        levels={
            "book": LevelConfig(
                pattern=r"^BOOK\s+(.+)$",
                name="book",
                title_field="title",
                description_field="description",
                sections_field="sections"
            ),
            "article": LevelConfig(
                pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
                name="article",
                title_field="title",
                description_field="content",
                paragraph_field="paragraphs",
                parent="book"
            )
        }
    )
    gardener = Gardener(config)
    assert gardener.leaf == "article"

Document Structure

def test_prune_basic_structure():
    config = Config(
        root_name="document",
        sections_field="sections",
        description_field="description",
        levels={
            "book": LevelConfig(
                pattern=r"^BOOK\s+(.+)$",
                name="book",
                title_field="title",
                description_field="description",
                sections_field="sections"
            ),
            "article": LevelConfig(
                pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
                name="article",
                title_field="title",
                description_field="content",
                paragraph_field="paragraphs",
                parent="book"
            )
        }
    )
    gardener = Gardener(config)
    text = """BOOK First Book
This is a description
ARTICLE 1. First article
This is article content
More content"""
    result = gardener.prune(text)
    assert result["sections"][0]["title"] == "First Book"
    assert result["sections"][0]["sections"][0]["paragraphs"] == ["This is article content", "More content"]

Edge Cases

def test_prune_empty_document():
    config = Config(
        root_name="document",
        sections_field="sections",
        description_field="description",
        levels={}
    )
    gardener = Gardener(config)
    result = gardener.prune("")
    assert result["sections"] == []

Free Text Handling

def test_prune_with_free_text():
    config = Config(
        root_name="document",
        sections_field="sections",
        description_field="description",
        levels={
            "title": LevelConfig(
                pattern=r"^TITLE\s+(.+)$",
                name="title",
                title_field="title",
                description_field="description",
                sections_field="sections"
            )
        }
    )
    gardener = Gardener(config)
    text = """This is free text at the top level
TITLE First Title
Title description"""
    result = gardener.prune(text)
    assert result["description"] == "This is free text at the top level"

Run tests with:

python -m pytest tests/

❓ Troubleshooting FAQ

OCR not working

Make sure Tesseract is installed and accessible in your PATH.

Text extraction issues

Different document formats may require specific libraries. Check your dependencies:

  • PDF: pdfplumber, pdf2image
  • DOCX: docx2txt
  • ODT: odf

Regex pattern not matching

Test your patterns with tools like regex101.com and ensure you have the correct number of capture groups.


🔄 Compatibility

  • Python 3.8+
  • Tested on Linux, macOS, and Windows

👥 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

MIT


🔗 Resources


🧠 Advanced Usage

For advanced patterns, dynamic configs, exception handling and OCR examples, see:

📄 ADVANCED_USAGE_doc23.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc23-0.1.1.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc23-0.1.1-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file doc23-0.1.1.tar.gz.

File metadata

  • Download URL: doc23-0.1.1.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for doc23-0.1.1.tar.gz
Algorithm Hash digest
SHA256 152817a9e55c65b3655de07455cc530ac24bc22d7f9b982ca9c9901217f6074f
MD5 80db4e18bf99f3186c3f0678a1f09c44
BLAKE2b-256 58dc1fb595c8197188afa022a1fd84eee8a7dd1ae1d0bbfb8e6b659bfe5b0f1c

See more details on using hashes here.

File details

Details for the file doc23-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: doc23-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for doc23-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1a0c518fbb6c0250430216d1c437df9376e2449702357e91168021de8b6d774f
MD5 9953ad66f2e2277c0a3e0289cd49a4c6
BLAKE2b-256 bcb32c550e71fdaf32c456726e3d22b8137b269a4b974f5df2dd1f19213952cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page