Skip to main content

Convert documents to semantic HTML optimized for LLM context - reduces token congestion

Project description

MakeContextSimple

Convert documents to semantic HTML optimized for LLM context consumption.

Overview

MakeContextSimple is a Python utility that converts various document formats into clean, semantic HTML optimized for large language model (LLM) consumption. Unlike Markdown-based converters, MakeContextSimple produces HTML that is:

  • Token-efficient: Less syntax overhead than Markdown for complex structures
  • Semantically rich: HTML tags convey meaning without extra markers
  • Machine-parseable: Standard HTML parsers work reliably
  • Browser-viewable: Output can be directly viewed in any browser

Supported Formats

Category Formats
Documents PDF, DOCX, Markdown
Office PPTX, XLSX
Web HTML, XML, RSS
Data CSV, JSON
Text Plain text, Code files, Config files
Images JPG, PNG, GIF, WebP, BMP

Installation

Basic Installation

pip install makecontextsimple

With Optional Dependencies

# For PDF support
pip install makecontextsimple[pdf]

# For Office document support
pip install makecontextsimple[docx,pptx,xlsx]

# For image support
pip install makecontextsimple[image]

# For all formats
pip install makecontextsimple[all]

From Source

git clone https://github.com/makecontextsimple/makecontextsimple.git
cd makecontextsimple
pip install -e ".[all]"

Docker

# Build image
docker build -t makecontextsimple .

# Convert a file
docker run --rm -v $(pwd):/data makecontextsimple document.pdf -o /data/output.html

# LLM-optimized output
docker run --rm -v $(pwd):/data makecontextsimple document.pdf --llm -o /data/context.html

Docker Compose

# Single file conversion
docker compose run convert

# LLM-optimized conversion
docker compose run convert-llm

# Batch convert all PDFs in input/ folder
docker compose run batch

Usage

Command Line

# Convert a file to HTML (output to stdout)
makecontextsimple document.pdf

# Convert with custom output file
makecontextsimple document.pdf -o output.html

# Generate minimal HTML for LLM context
makecontextsimple document.pdf --llm

# List supported formats
makecontextsimple --list-formats

Python API

from makecontextsimple import MakeContextSimple

# Initialize converter
converter = MakeContextSimple()

# Convert a file
result = converter.convert("document.pdf")

# Get full HTML document
html = result.to_full_document()
print(html)

# Get minimal HTML for LLM context
llm_context = result.to_llm_context()

# Save directly to file
converter.convert_to_file("document.pdf", "output.html")

# Convert URL content
import requests
response = requests.get("https://example.com/page.html")
result = converter.convert(response)

Custom Styles

# Use custom CSS
custom_css = """
body { font-family: Arial; max-width: 800px; margin: 0 auto; }
h1 { color: #333; }
"""
result = converter.convert("document.pdf")
html = result.to_full_document(styles=custom_css)

Custom Converters

from makecontextsimple import HTMLConverter, HTMLResult

class MyCustomConverter(HTMLConverter):
    def accepts(self, file_stream, mimetype=None, extension=None, **kwargs):
        return extension == ".myformat"
    
    def convert(self, file_stream, mimetype=None, extension=None, **kwargs):
        content = file_stream.read().decode("utf-8")
        # Custom conversion logic
        html = f"<pre>{content}</pre>"
        return HTMLResult(html=html, title="Custom Format")

# Register custom converter
converter = MakeContextSimple()
converter.register_converter(MyCustomConverter(), priority=0)

Architecture

MakeContextSimple follows a plugin-based converter architecture:

MakeContextSimple (orchestrator)
    ├── HTMLConverter (abstract base)
    │   ├── PDFConverter
    │   ├── DOCXConverter
    │   ├── PPTXConverter
    │   ├── XLSXConverter
    │   ├── ImageConverter
    │   ├── CSVConverter
    │   ├── JSONConverter
    │   ├── XMLConverter
    │   ├── HTMLConverter_Builtin
    │   ├── MarkdownConverter
    │   └── PlainTextConverter
    ├── HTMLBuilder (utilities)
    └── HTMLResult (output container)

Key Components

  • MakeContextSimple: Main orchestrator that manages converters and I/O
  • HTMLConverter: Abstract base class for all format converters
  • HTMLBuilder: Utility class for constructing semantic HTML
  • HTMLResult: Container for conversion output with metadata

Why HTML Over Markdown?

Aspect Markdown HTML
Token Efficiency Good Better (15-20% fewer)
Table Syntax |---| separators <table> tags
Semantic Meaning Relies on conventions Explicit tags
Parsing Regex/string ops Standard parsers
Preview Needs rendering Native browser

Token Comparison Example

Markdown (180 tokens):

| Name  | Age | City     |
|-------|-----|----------|
| Alice | 30  | New York |

HTML (150 tokens):

<table>
<tr><td>Name</td><td>Age</td><td>City</td></tr>
<tr><td>Alice</td><td>30</td><td>New York</td></tr>

Plugin System

MakeContextSimple supports third-party plugins via Python's entry_points:

# In your plugin's pyproject.toml:
[project.entry-points."makecontextsimple.plugin"]
my_plugin = "my_package:register"

# In your plugin:
def register(converter_instance):
    converter_instance.register_converter(MyConverter(), priority=5)

Development

Setup

git clone https://github.com/makecontextsimple/makecontextsimple.git
cd makecontextsimple
pip install -e ".[dev]"

Running Tests

pytest tests/

Code Style

ruff check src/
ruff format src/

Docker Development

# Build development image
docker build -t makecontextsimple:dev .

# Run tests in container
docker run --rm makecontextsimple:dev python -m pytest tests/

# Interactive shell
docker run --rm -it makecontextsimple:dev /bin/bash

CI/CD

This project uses GitHub Actions for:

  • CI (.github/workflows/ci.yml): Runs tests on push/PR
  • Publish (.github/workflows/publish.yml): Publishes to PyPI and Docker Hub on release

Required Secrets

For publishing, add these secrets in GitHub Settings:

Secret Description
PYPI_API_TOKEN PyPI API token
DOCKERHUB_USERNAME Docker Hub username
DOCKERHUB_TOKEN Docker Hub access token

Publishing

Manual Publishing

# Build distribution
python -m build

# Check distribution
twine check dist/*

# Upload to PyPI
twine upload dist/*

Automated Publishing

Create a GitHub release to automatically publish to PyPI and Docker Hub.

# Create tag
git tag -a v0.1.0 -m "Release 0.1.0"
git push origin v0.1.0

# Create release on GitHub or use:
gh release create v0.1.0

License

MIT License

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makecontextsimple-0.1.2.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makecontextsimple-0.1.2-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file makecontextsimple-0.1.2.tar.gz.

File metadata

  • Download URL: makecontextsimple-0.1.2.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for makecontextsimple-0.1.2.tar.gz
Algorithm Hash digest
SHA256 16b7c91ce79d89ef894e3a3e117b5e7dfdac3f1b5616b60c15eeb2e3d4451c3a
MD5 c1192828faf502c41c96dce1a5a1d39a
BLAKE2b-256 36ae6d4dbebd6f409c0be39f840504ca16be909ed1fa5bc05c8ca1590158f93a

See more details on using hashes here.

File details

Details for the file makecontextsimple-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for makecontextsimple-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7d56c7cc039a955ad76cc5787e13b072cba33f826b988775879f292e816acb1d
MD5 c0ce210398ab3a347de007236b8d27e2
BLAKE2b-256 53395facdbdc112ddb7d16d60df52f6ef4402ed2947cc688b98189a33e6a5243

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page