A Python framework for normalizing PDFs, Word files, CSV, Excel, JPG, and PNG into AI-ready document chunks.
Project description
DocFrame
DocFrame is a Python framework for turning messy enterprise documents into structured, AI-ready chunks.
It gives developers one API and one CLI for PDFs, Word files, CSVs, Excel workbooks, JPGs, and PNGs.
docframe process contract.pdf --format markdown
docframe process ./inbox --recursive --out normalized.json
python3 -m docframe process report.xlsx
Why DocFrame
Document workflows usually start with format chaos: text PDFs, scanned images, spreadsheets, Word files, CSV exports, and ad hoc attachments. DocFrame gives you a normalized result model so downstream systems can search, extract, summarize, validate, and route documents without rewriting parsers for every file type.
Install
From PyPI:
python3 -m pip install docframe-ai
Local development:
python3 -m pip install -e .
Then:
docframe formats
docframe process examples/sample.csv --format markdown
The PyPI distribution is docframe-ai; the Python import remains docframe.
See the repository's
PyPI publishing guide
for the GitHub Trusted Publishing setup.
Python API
import docframe as df
framework = df.DocFrame()
result = framework.process_sync("examples/sample.csv")
print(result.metadata.document_type)
print(result.chunks[0].rows)
Async processing:
import docframe as df
framework = df.DocFrame()
results = await framework.process_many(["contract.pdf", "report.xlsx"])
Safe corpus processing:
import docframe as df
framework = df.DocFrame()
results = await framework.process_many(
["good.pdf", "malformed.pdf"],
continue_on_error=True,
)
for result in results:
if result.errors:
print(result.metadata.filename, result.errors)
Supported Formats
- PDF: text and page metadata via
pypdf - DOCX: paragraphs and tables via direct OOXML package parsing
- DOC: OOXML extraction when possible, metadata-only fallback for legacy binary Word files
- CSV: table chunks via Python's standard
csvparser - XLSX/XLSM: worksheet tables via
openpyxl - JPG/JPEG/PNG: image metadata via
Pillow
Images currently emit image chunks and metadata. OCR is intentionally a provider extension point so teams can choose local OCR, cloud OCR, or multimodal AI.
Many real corpora contain OOXML Word documents with a .doc extension. DocFrame
extracts those with the Word adapter and emits a warning. True legacy binary
.doc files emit metadata and a warning; convert them to .docx or register a
custom adapter when full text extraction is required.
Core Concepts
DocFrame: framework object for processing documentsDocumentAdapter: parser for a file familyAdapterRegistry: maps file extensions to adaptersDocumentResult: normalized output for one documentDocumentChunk: text, table, image, or metadata unitPipeline: ordered post-processing stepsProcessingOptions: runtime limits and concurrency controls for large files
CLI
docframe process FILE_OR_DIRECTORY
docframe process FILE_OR_DIRECTORY --format markdown
docframe process FILE_OR_DIRECTORY --recursive --out normalized.json
docframe formats
Status
DocFrame is public alpha software. The core API, adapters, CLI, tests, MIT license, and landing site are in place. See PUBLIC_ALPHA.md for the production-readiness checklist.
Verify
python3 -m unittest discover -s tests
python3 -m compileall docframe tests
Website
The static site lives in site/index.html.
Run it locally:
python3 -m http.server 8080 -d site
Then open:
http://127.0.0.1:8080/
Deploy Static Site On Render
The repository includes a Render Blueprint in render.yaml. It
publishes the static site from site/ as docframe-site.
After pushing the repository to GitHub, GitLab, or Bitbucket:
git push -u origin main
Then create the Blueprint from the Render Dashboard:
https://dashboard.render.com/blueprint/new
Connect the repository and Render will use render.yaml from the repo root.
Corpus Utilities
Validate a private corpus before a release:
python3 scripts/validate_corpus.py test_corpus --out corpus-report.json
The validator exits nonzero if any supported file produces a structured error.
Use --allow-errors for exploratory runs where malformed files are expected.
Collect any supported corpus files by extension:
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_csv" --ext csv --dry-run --quiet
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_csv" --ext csv --quiet
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_images" --ext jpg --ext png --quiet
Collect PDFs from a deeply nested archive into one flat folder:
python3 scripts/collect_pdfs.py "/path/to/archive" "/path/to/all_pdfs" --dry-run --quiet
python3 scripts/collect_pdfs.py "/path/to/archive" "/path/to/all_pdfs" --quiet
The collector copies by default, avoids overwriting existing files, and gives duplicate basenames a stable hash suffix.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docframe_ai-0.2.0.tar.gz.
File metadata
- Download URL: docframe_ai-0.2.0.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5795f5f5130ca75b0c4315478c063c8f5205d9ea69fe51e34743b17ee6c212c
|
|
| MD5 |
2b3cdb8d77714ef64f3206861238bb67
|
|
| BLAKE2b-256 |
411ff8cedbb43bae08f2f7c420ae755b6c42992f189978e3f97a817bb1debfb2
|
Provenance
The following attestation bundles were made for docframe_ai-0.2.0.tar.gz:
Publisher:
publish.yml on Meet2147/docframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docframe_ai-0.2.0.tar.gz -
Subject digest:
c5795f5f5130ca75b0c4315478c063c8f5205d9ea69fe51e34743b17ee6c212c - Sigstore transparency entry: 1419276221
- Sigstore integration time:
-
Permalink:
Meet2147/docframe@8946f5d921cd2b460c3600350cb45c63e9de3737 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Meet2147
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8946f5d921cd2b460c3600350cb45c63e9de3737 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file docframe_ai-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docframe_ai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9176b5b09ef55db356c0e8a9d70f88b2c734da0d5212052424ed7d51565e424c
|
|
| MD5 |
ad81d22dbdf09eafcda47a482992bd90
|
|
| BLAKE2b-256 |
a112625f84939e671709b7d8904d736561ef97e85bb662a3b4bb06a00bb91160
|
Provenance
The following attestation bundles were made for docframe_ai-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Meet2147/docframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docframe_ai-0.2.0-py3-none-any.whl -
Subject digest:
9176b5b09ef55db356c0e8a9d70f88b2c734da0d5212052424ed7d51565e424c - Sigstore transparency entry: 1419276333
- Sigstore integration time:
-
Permalink:
Meet2147/docframe@8946f5d921cd2b460c3600350cb45c63e9de3737 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Meet2147
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8946f5d921cd2b460c3600350cb45c63e9de3737 -
Trigger Event:
workflow_dispatch
-
Statement type: