Convert documents to semantic HTML optimized for LLM context - reduces token congestion
Project description
MakeContextSimple
Convert documents to semantic HTML optimized for LLM context consumption.
Overview
MakeContextSimple is a Python utility that converts various document formats into clean, semantic HTML optimized for large language model (LLM) consumption. Unlike Markdown-based converters, MakeContextSimple produces HTML that is:
- Token-efficient: Less syntax overhead than Markdown for complex structures
- Semantically rich: HTML tags convey meaning without extra markers
- Machine-parseable: Standard HTML parsers work reliably
- Browser-viewable: Output can be directly viewed in any browser
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, Markdown |
| Office | PPTX, XLSX |
| Web | HTML, XML, RSS |
| Data | CSV, JSON |
| Text | Plain text, Code files, Config files |
| Images | JPG, PNG, GIF, WebP, BMP |
Installation
Basic Installation
pip install makecontextsimple
With Optional Dependencies
# For PDF support
pip install makecontextsimple[pdf]
# For Office document support
pip install makecontextsimple[docx,pptx,xlsx]
# For image support
pip install makecontextsimple[image]
# For all formats
pip install makecontextsimple[all]
From Source
git clone https://github.com/makecontextsimple/makecontextsimple.git
cd makecontextsimple
pip install -e ".[all]"
Docker
# Build image
docker build -t makecontextsimple .
# Convert a file
docker run --rm -v $(pwd):/data makecontextsimple document.pdf -o /data/output.html
# LLM-optimized output
docker run --rm -v $(pwd):/data makecontextsimple document.pdf --llm -o /data/context.html
Docker Compose
# Single file conversion
docker compose run convert
# LLM-optimized conversion
docker compose run convert-llm
# Batch convert all PDFs in input/ folder
docker compose run batch
Usage
Command Line
# Convert a file to HTML (output to stdout)
makecontextsimple document.pdf
# Convert with custom output file
makecontextsimple document.pdf -o output.html
# Generate minimal HTML for LLM context
makecontextsimple document.pdf --llm
# List supported formats
makecontextsimple --list-formats
Python API
from makecontextsimple import MakeContextSimple
# Initialize converter
converter = MakeContextSimple()
# Convert a file
result = converter.convert("document.pdf")
# Get full HTML document
html = result.to_full_document()
print(html)
# Get minimal HTML for LLM context
llm_context = result.to_llm_context()
# Save directly to file
converter.convert_to_file("document.pdf", "output.html")
# Convert URL content
import requests
response = requests.get("https://example.com/page.html")
result = converter.convert(response)
Custom Styles
# Use custom CSS
custom_css = """
body { font-family: Arial; max-width: 800px; margin: 0 auto; }
h1 { color: #333; }
"""
result = converter.convert("document.pdf")
html = result.to_full_document(styles=custom_css)
Custom Converters
from makecontextsimple import HTMLConverter, HTMLResult
class MyCustomConverter(HTMLConverter):
def accepts(self, file_stream, mimetype=None, extension=None, **kwargs):
return extension == ".myformat"
def convert(self, file_stream, mimetype=None, extension=None, **kwargs):
content = file_stream.read().decode("utf-8")
# Custom conversion logic
html = f"<pre>{content}</pre>"
return HTMLResult(html=html, title="Custom Format")
# Register custom converter
converter = MakeContextSimple()
converter.register_converter(MyCustomConverter(), priority=0)
Architecture
MakeContextSimple follows a plugin-based converter architecture:
MakeContextSimple (orchestrator)
├── HTMLConverter (abstract base)
│ ├── PDFConverter
│ ├── DOCXConverter
│ ├── PPTXConverter
│ ├── XLSXConverter
│ ├── ImageConverter
│ ├── CSVConverter
│ ├── JSONConverter
│ ├── XMLConverter
│ ├── HTMLConverter_Builtin
│ ├── MarkdownConverter
│ └── PlainTextConverter
├── HTMLBuilder (utilities)
└── HTMLResult (output container)
Key Components
- MakeContextSimple: Main orchestrator that manages converters and I/O
- HTMLConverter: Abstract base class for all format converters
- HTMLBuilder: Utility class for constructing semantic HTML
- HTMLResult: Container for conversion output with metadata
Why HTML Over Markdown?
| Aspect | Markdown | HTML |
|---|---|---|
| Token Efficiency | Good | Better (15-20% fewer) |
| Table Syntax | |---| separators |
<table> tags |
| Semantic Meaning | Relies on conventions | Explicit tags |
| Parsing | Regex/string ops | Standard parsers |
| Preview | Needs rendering | Native browser |
Token Comparison Example
Markdown (180 tokens):
| Name | Age | City |
|-------|-----|----------|
| Alice | 30 | New York |
HTML (150 tokens):
<table>
<tr><td>Name</td><td>Age</td><td>City</td></tr>
<tr><td>Alice</td><td>30</td><td>New York</td></tr>
Plugin System
MakeContextSimple supports third-party plugins via Python's entry_points:
# In your plugin's pyproject.toml:
[project.entry-points."makecontextsimple.plugin"]
my_plugin = "my_package:register"
# In your plugin:
def register(converter_instance):
converter_instance.register_converter(MyConverter(), priority=5)
Development
Setup
git clone https://github.com/makecontextsimple/makecontextsimple.git
cd makecontextsimple
pip install -e ".[dev]"
Running Tests
pytest tests/
Code Style
ruff check src/
ruff format src/
Docker Development
# Build development image
docker build -t makecontextsimple:dev .
# Run tests in container
docker run --rm makecontextsimple:dev python -m pytest tests/
# Interactive shell
docker run --rm -it makecontextsimple:dev /bin/bash
CI/CD
This project uses GitHub Actions for:
- CI (
.github/workflows/ci.yml): Runs tests on push/PR - Publish (
.github/workflows/publish.yml): Publishes to PyPI and Docker Hub on release
Required Secrets
For publishing, add these secrets in GitHub Settings:
| Secret | Description |
|---|---|
PYPI_API_TOKEN |
PyPI API token |
DOCKERHUB_USERNAME |
Docker Hub username |
DOCKERHUB_TOKEN |
Docker Hub access token |
Publishing
Manual Publishing
# Build distribution
python -m build
# Check distribution
twine check dist/*
# Upload to PyPI
twine upload dist/*
Automated Publishing
Create a GitHub release to automatically publish to PyPI and Docker Hub.
# Create tag
git tag -a v0.1.0 -m "Release 0.1.0"
git push origin v0.1.0
# Create release on GitHub or use:
gh release create v0.1.0
License
MIT License
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file makecontextsimple-0.1.2.tar.gz.
File metadata
- Download URL: makecontextsimple-0.1.2.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16b7c91ce79d89ef894e3a3e117b5e7dfdac3f1b5616b60c15eeb2e3d4451c3a
|
|
| MD5 |
c1192828faf502c41c96dce1a5a1d39a
|
|
| BLAKE2b-256 |
36ae6d4dbebd6f409c0be39f840504ca16be909ed1fa5bc05c8ca1590158f93a
|
File details
Details for the file makecontextsimple-0.1.2-py3-none-any.whl.
File metadata
- Download URL: makecontextsimple-0.1.2-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d56c7cc039a955ad76cc5787e13b072cba33f826b988775879f292e816acb1d
|
|
| MD5 |
c0ce210398ab3a347de007236b8d27e2
|
|
| BLAKE2b-256 |
53395facdbdc112ddb7d16d60df52f6ef4402ed2947cc688b98189a33e6a5243
|