Powerful Python library to convert documents (PDF, DOCX, TXT) into structured JSON trees for legal, institutional, and NLP applications.
Project description
📚 doc23
Convert documents into structured JSON effortlessly.
A Python library for extracting text from various document formats and structuring it hierarchically into JSON.
📌 Features
- ✅ Extract text from PDFs, DOCX, TXT, RTF, ODT, MD, and images.
- 🖼️ OCR support for scanned documents and images.
- ⚙️ Flexible configuration using regex patterns and field mapping.
- 🌳 Nested hierarchical structure output in JSON.
- ✨ Explicit leaf-level control using
is_leaf=True. - 🔍 Built-in validations to catch config mistakes (regex, hierarchy, field conflicts).
- 🧪 Comprehensive pytest suite with coverage reporting.
📦 Installation
pip install doc23
To enable OCR:
sudo apt install tesseract-ocr
pip install pytesseract
🚀 Quickstart Example
Basic Text Extraction
from doc23 import extract_text
# Extract text from any supported document
text = extract_text("document.pdf", scan_or_image="auto")
print(text)
Structured Document Parsing
from doc23 import Doc23, Config, LevelConfig
config = Config(
root_name="art_of_war",
sections_field="chapters",
description_field="description",
levels={
"chapter": LevelConfig(
pattern=r"^CHAPTER\s+([IVXLCDM]+)\n(.+)$",
name="chapter",
title_field="title",
description_field="description",
sections_field="paragraphs"
),
"paragraph": LevelConfig(
pattern=r"^(\d+)\.\s+(.+)$",
name="paragraph",
title_field="number",
description_field="text",
is_leaf=True
)
}
)
with open("art_of_war.txt") as f:
text = f.read()
doc = Doc23(text, config)
structure = doc.prune()
print(structure["chapters"][0]["title"]) # → I
🧾 Output Example
{
"description": "",
"chapters": [
{
"type": "chapter",
"title": "I",
"description": "Laying Plans",
"paragraphs": [
{
"type": "paragraph",
"number": "1",
"text": "Sun Tzu said: The art of war is of vital importance to the State."
}
]
}
]
}
🛠️ Document Configuration
Use Config and LevelConfig to define how your document is parsed:
| Field | Purpose |
|---|---|
pattern |
Regex to match each level |
title_field |
Field to assign the first regex group |
description_field |
(Optional) Field for second group |
sections_field |
(Optional) Where sublevels go |
paragraph_field |
(Optional) Where text/nodes go if leaf |
is_leaf |
(Optional) Forces this level to be terminal |
Capture Group Rules
| Fields Defined | Required Groups in Regex |
|---|---|
title_field only |
≥1 |
title_field + description_field |
≥2 |
title_field + paragraph_field |
≥1 (second group optional) |
🏗️ Architecture Overview
doc23 consists of several key components:
Doc23 (core.py)
├── Extractors (extractors/)
│ ├── PDFExtractor
│ ├── DocxExtractor
│ ├── TextExtractor
│ └── ...
├── Config (config_tree.py)
│ └── LevelConfig
└── Gardener (gardener.py)
- Doc23: Main entry point, handles file detection and orchestration
- Extractors: Convert various document types to plain text
- Config: Defines how to structure the document hierarchy
- Gardener: Parses text and builds the JSON structure
✅ Built-in Validation
The library validates your config when creating Doc23:
- ✋ Ensures all parents exist.
- 🔁 Detects circular relationships.
- ⚠️ Checks field name reuse.
- 🧪 Verifies group counts match pattern.
If any issue is found, a ValueError will be raised immediately.
🧪 Testing
The library includes a comprehensive test suite covering various scenarios:
Basic Initialization
def test_gardener_initialization():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
assert gardener.leaf == "article"
Document Structure
def test_prune_basic_structure():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
text = """BOOK First Book
This is a description
ARTICLE 1. First article
This is article content
More content"""
result = gardener.prune(text)
assert result["sections"][0]["title"] == "First Book"
assert result["sections"][0]["sections"][0]["paragraphs"] == ["This is article content", "More content"]
Edge Cases
def test_prune_empty_document():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={}
)
gardener = Gardener(config)
result = gardener.prune("")
assert result["sections"] == []
Free Text Handling
def test_prune_with_free_text():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"title": LevelConfig(
pattern=r"^TITLE\s+(.+)$",
name="title",
title_field="title",
description_field="description",
sections_field="sections"
)
}
)
gardener = Gardener(config)
text = """This is free text at the top level
TITLE First Title
Title description"""
result = gardener.prune(text)
assert result["description"] == "This is free text at the top level"
Run tests with:
python -m pytest tests/
❓ Troubleshooting FAQ
OCR not working
Make sure Tesseract is installed and accessible in your PATH.
Text extraction issues
Different document formats may require specific libraries. Check your dependencies:
- PDF: pdfplumber, pdf2image
- DOCX: docx2txt
- ODT: odf
Regex pattern not matching
Test your patterns with tools like regex101.com and ensure you have the correct number of capture groups.
🔄 Compatibility
- Python 3.8+
- Tested on Linux, macOS, and Windows
👥 Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
MIT
🔗 Resources
🧠 Advanced Usage
For advanced patterns, dynamic configs, exception handling and OCR examples, see:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc23-0.1.1.tar.gz.
File metadata
- Download URL: doc23-0.1.1.tar.gz
- Upload date:
- Size: 20.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
152817a9e55c65b3655de07455cc530ac24bc22d7f9b982ca9c9901217f6074f
|
|
| MD5 |
80db4e18bf99f3186c3f0678a1f09c44
|
|
| BLAKE2b-256 |
58dc1fb595c8197188afa022a1fd84eee8a7dd1ae1d0bbfb8e6b659bfe5b0f1c
|
File details
Details for the file doc23-0.1.1-py3-none-any.whl.
File metadata
- Download URL: doc23-0.1.1-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a0c518fbb6c0250430216d1c437df9376e2449702357e91168021de8b6d774f
|
|
| MD5 |
9953ad66f2e2277c0a3e0289cd49a4c6
|
|
| BLAKE2b-256 |
bcb32c550e71fdaf32c456726e3d22b8137b269a4b974f5df2dd1f19213952cf
|