Validate OOXML and ODF files in pure Python — no .NET required
Project description
OpenXML Audit
Validate OOXML (PPTX/DOCX/XLSX) and ODF files in pure Python — no .NET required.
A Python port of Microsoft's Open XML SDK validation logic. Check whether generated or modified Office files will open cleanly, directly from Python scripts, CI pipelines, or anywhere .NET isn't practical.
Also supports OASIS OpenDocument Format (ODT/ODS/ODP) with staged conformance levels.
Features
- OOXML Validation: Package structure, schema, semantic, and format-specific checks for PPTX/DOCX/XLSX — matching the Open XML SDK's validation without the .NET dependency
- ODF Validation: Staged conformance levels — foundation, schema-core (Relax NG), semantic-core, and security-core for ODT/ODS/ODP
- Multiple Output Formats: Text, JSON, and XML output
- Performance Tooling: Per-phase timing breakdown, benchmark scripts for both OOXML and ODF
- Flexible Integration: Context managers, decorators, and pytest fixtures
Installation
pip install openxml-audit
Or install from source:
git clone https://github.com/yourusername/openxml-audit.git
cd openxml-audit
pip install -e .
Quick Start
Command Line
# Validate a single file
openxml-audit presentation.pptx
# Validate an OASIS OpenDocument file
openxml-audit document.odt
# Validate with JSON output
openxml-audit presentation.pptx --output json
# Validate with XML output
openxml-audit presentation.pptx --output xml
# Validate all matching files in a directory
openxml-audit ./presentations/ --recursive
# Validate against a specific Office version
openxml-audit presentation.pptx --format Office2007
# Limit maximum errors reported
openxml-audit presentation.pptx --max-errors 10
Python API
from openxml_audit import validate_pptx, is_valid_pptx, OpenXmlValidator
# Quick check
if is_valid_pptx("presentation.pptx"):
print("File is valid!")
# Detailed validation
result = validate_pptx("presentation.pptx")
if not result.is_valid:
print(f"Found {result.error_count} errors, {result.warning_count} warnings")
for error in result.errors:
print(f" [{error.severity.value}] {error.description}")
# With custom options
from openxml_audit import FileFormat
validator = OpenXmlValidator(
file_format=FileFormat.OFFICE_2019,
max_errors=100,
schema_validation=True,
semantic_validation=True,
)
result = validator.validate("presentation.pptx")
ODF Validation Depth
ODF validation is staged by explicit conformance level.
| Level | Includes | Does not include |
|---|---|---|
foundation |
package/manifest integrity + XML parse sweep | Relax NG schema-core routing, semantic-core rules, security-core checks |
schema-core |
foundation + Relax NG validation for routed XML members | semantic-core and security-core checks |
semantic-core |
foundation + semantic-core rule families (ODFSEM*) |
Relax NG schema-core routing, security-core checks |
security-core |
semantic-core + signature/encryption structural checks (ODFSEC*) |
full cryptographic trust guarantees unless crypto verification backend is configured |
Rule registry and policy references:
- semantic rule IDs:
openxml_audit.odf.get_odf_semantic_rules() - security policy:
docs/odf_security_policy.md - reference calibration/drift contract:
docs/odf_validation_contract.md
CLI Conformance Selection
Use --odf-level when validating ODF files:
# foundation
openxml-audit file.odt --validator odf --odf-level foundation
# semantic-core (default)
openxml-audit file.odt --validator odf --odf-level semantic-core
# security-core
openxml-audit file.odt --validator odf --odf-level security-core
Schema-core requires a schema-route JSON file:
openxml-audit file.odt \
--validator odf \
--odf-level schema-core \
--odf-schema-routes schemas/odf/routes.json
--odf-schema-routes accepts either shape:
- versioned mapping:
{"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}}
- flat legacy mapping:
{"content.xml": "schemas/odf/content.rng"}
Security-core crypto verification hook:
openxml-audit file.odt \
--validator odf \
--odf-level security-core \
--odf-verify-cryptography
API Conformance Selection
from openxml_audit import FileFormat
from openxml_audit.odf import OdfValidator
# foundation
foundation = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=False,
semantic_validation=False,
security_validation=False,
)
# schema-core (routes required)
schema_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
schema_routes={"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}},
)
# semantic-core
semantic_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=False,
)
# security-core
security_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=True,
verify_cryptography=False, # set True when crypto backend is available
)
ODF Benchmarking
# Benchmark an ODF file (5 iterations by default)
python scripts/odf/benchmark_validation.py document.odt
# More iterations, with security checks
python scripts/odf/benchmark_validation.py document.odt --iterations 20 --security
# Foundation-only (skip schema/semantic)
python scripts/odf/benchmark_validation.py document.odt --no-schema --no-semantic
Reports avg/min/max/P95 with per-phase breakdown (package_structure, xml_parse, schema, semantic, security).
OOXML benchmark: python scripts/benchmark_validation.py presentation.pptx
Known ODF Limitations
- Full OASIS conformance parity is not yet complete.
- Schema-core requires caller-provided Relax NG routes (
schema_routes). - Security-core validates structure/policy, not full cryptographic trust by default.
- CLI
--odf-levelonly applies when the selected/auto-detected validator is ODF.
ODF Reference Calibration
Compare Python results against external validators (ODF Toolkit, OPF) using the scripts in scripts/odf/:
| Script | Purpose |
|---|---|
run_reference_validators.py |
Run Python + external validators on pinned corpus |
compare_reference_results.py |
Diff results into mismatch families |
check_reference_drift.py |
Enforce drift policy against baseline |
bootstrap_reference_validators.py |
Auto-build external validator commands |
CI workflow: .github/workflows/odf-reference-calibration.yml — builds ODF Toolkit and OPF at runtime via Maven/Docker.
Set command templates via --odf-toolkit-cmd / --opf-cmd or env vars ODF_TOOLKIT_CMD / OPF_ODF_VALIDATOR_CMD. Placeholders: {file}, {file_dir}, {file_name}, {file_stem}, {file_suffix}.
Open XML SDK (Standalone)
Run the .NET SDK validator separately (requires .NET SDK 8.x or Docker):
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /path/to/file.pptx
dotnet run --project scripts/sdk_compare/OpenXmlSdkValidator.csproj -- /path/to/file.pptx # JSON
# Via Docker
docker run --rm -v "$PWD:/work" -w /work mcr.microsoft.com/dotnet/sdk:8.0 \
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /work/path/to/file.pptx
Supports PPTX/DOCX/XLSX and variants. Configured for Office 2019.
CI Workflows
| Workflow | Trigger | Purpose |
|---|---|---|
parity-gate.yml |
PR / push | Enforce OOXML parity + perf budget against SDK baseline |
calibrate-parity.yml |
Weekly / dispatch | Calibrate against Open XML SDK upstream |
odf-reference-calibration.yml |
Dispatch | Run ODF reference validators and drift checks |
validate-inputs.yml |
Push to inputs/ |
Validate dropped files with both Python and .NET SDK |
OOXML parity details: docs/parity_contract.md. ODF reference contract: docs/odf_validation_contract.md.
Integration Helpers
# Context manager
from openxml_audit import validation_context
with validation_context(raise_on_invalid=True) as validator:
result = validator.validate("presentation.pptx")
# Decorator — validate after save
from openxml_audit import validate_on_save
@validate_on_save(raise_on_invalid=True)
def create_presentation(output_path: str) -> None:
Presentation().save(output_path)
# Decorator — require valid input
from openxml_audit import require_valid_pptx
@require_valid_pptx()
def process(input_path: str) -> dict: ...
# pytest fixtures (add to conftest.py)
from openxml_audit.helpers import pytest_openxml_audit, pytest_assert_valid_pptx
openxml_audit = pytest_openxml_audit()
assert_valid_pptx = pytest_assert_valid_pptx()
API Reference
OpenXmlValidator / OdfValidator
OpenXmlValidator(file_format=FileFormat.OFFICE_2019, max_errors=1000,
schema_validation=True, semantic_validation=True)
OdfValidator(file_format=FileFormat.ODF_1_3, max_errors=1000,
schema_validation=True, semantic_validation=True,
security_validation=False, strict=True)
Both expose:
validate(path) -> ValidationResultvalidate_with_timings(path) -> (ValidationResult, dict[str, float])is_valid(path) -> bool
ValidationResult
| Property | Type | Description |
|---|---|---|
is_valid |
bool |
No ERROR-severity issues |
errors |
list[ValidationError] |
All errors and warnings |
error_count / warning_count |
int |
Counts by severity |
file_path |
str |
Validated file path |
file_format |
FileFormat |
Version validated against |
ValidationError
| Property | Type | Description |
|---|---|---|
error_type |
ValidationErrorType |
PACKAGE, SCHEMA, SEMANTIC, RELATIONSHIP |
severity |
ValidationSeverity |
ERROR, WARNING, INFO |
description |
str |
Human-readable message |
part_uri |
str | None |
Affected part URI |
path |
str | None |
XPath to affected element |
Supported Formats
| OOXML | ODF |
|---|---|
OFFICE_2007 through MICROSOFT_365 (default: OFFICE_2019) |
ODF_1_2, ODF_1_3 (default: ODF_1_3) |
Convenience Functions
validate_pptx(path) -> ValidationResultis_valid_pptx(path) -> bool
Works Well With
These libraries create Office files — openxml-audit checks them:
| Library | Format | Link |
|---|---|---|
| python-pptx | PPTX | Create and update PowerPoint files |
| python-docx | DOCX | Create and update Word files |
| openpyxl | XLSX | Create and update Excel files |
from pptx import Presentation
from openxml_audit import validate_pptx
Presentation().save("output.pptx")
result = validate_pptx("output.pptx")
if not result.is_valid:
print(f"{result.error_count} issues found")
Contributing
Contributions are welcome! See CONTRIBUTING.md for dev setup and guidelines.
Looking for Maintainers
This project is actively looking for co-maintainers — especially people working with:
- Office file generation pipelines (python-pptx, python-docx, openpyxl)
- ODF tooling and OASIS conformance
- Open XML SDK internals
If you're interested, open an issue or reach out.
Funding
If this project saves you time, consider sponsoring its development:
License
Acknowledgments
Based on the validation logic from Microsoft's Open XML SDK for .NET.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openxml_audit-0.2.0.tar.gz.
File metadata
- Download URL: openxml_audit-0.2.0.tar.gz
- Upload date:
- Size: 8.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29cb2f9a4efdfda1ca2bd580a74ea0cc1616838c80e000b4de7b2b1d1d262b86
|
|
| MD5 |
ecbeb86f1a54c044d61530d79cdbbabf
|
|
| BLAKE2b-256 |
fcb016c4964b4c49db6aab31d86480a2d2ec26e7af4e49d2ac4a9981b6799155
|
Provenance
The following attestation bundles were made for openxml_audit-0.2.0.tar.gz:
Publisher:
release.yml on BramAlkema/openxml-audit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openxml_audit-0.2.0.tar.gz -
Subject digest:
29cb2f9a4efdfda1ca2bd580a74ea0cc1616838c80e000b4de7b2b1d1d262b86 - Sigstore transparency entry: 1076858249
- Sigstore integration time:
-
Permalink:
BramAlkema/openxml-audit@eaefb1e8a8fe8b8801e46fcbb9720a48515633dc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/BramAlkema
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eaefb1e8a8fe8b8801e46fcbb9720a48515633dc -
Trigger Event:
push
-
Statement type:
File details
Details for the file openxml_audit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: openxml_audit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 114.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3d4d6df205cae680963225d8ca6c022cfcfedee34a90d92dce4c1a6701b9ed3
|
|
| MD5 |
94edacfe2c75c31e67ad5c6818729ada
|
|
| BLAKE2b-256 |
edf2227190336285263f50e2b32c66f3ee89365cc513f3be54a336a3ce7a615b
|
Provenance
The following attestation bundles were made for openxml_audit-0.2.0-py3-none-any.whl:
Publisher:
release.yml on BramAlkema/openxml-audit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openxml_audit-0.2.0-py3-none-any.whl -
Subject digest:
e3d4d6df205cae680963225d8ca6c022cfcfedee34a90d92dce4c1a6701b9ed3 - Sigstore transparency entry: 1076858261
- Sigstore integration time:
-
Permalink:
BramAlkema/openxml-audit@eaefb1e8a8fe8b8801e46fcbb9720a48515633dc -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/BramAlkema
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eaefb1e8a8fe8b8801e46fcbb9720a48515633dc -
Trigger Event:
push
-
Statement type: