Skip to main content

Convert documents to semantic HTML optimized for LLM context - reduces token congestion

Project description

MakeContextSimple

PyPI version License: MIT Python 3.10+

Convert documents to semantic HTML optimized for LLM context consumption.

Author: Ashish Sharma
Email: ashishsharma12549@gmail.com
LinkedIn: linkedin.com/in/ashish-sharma-3613a82b8

Overview

MakeContextSimple is a Python utility that converts various document formats into clean, semantic HTML optimized for large language model (LLM) consumption. Unlike Markdown-based converters, MakeContextSimple produces HTML that is:

  • Token-efficient: Less syntax overhead than Markdown for complex structures
  • Semantically rich: HTML tags convey meaning without extra markers
  • Machine-parseable: Standard HTML parsers work reliably
  • Browser-viewable: Output can be directly viewed in any browser

Supported Formats

Category Formats
Documents PDF, DOCX, Markdown
Office PPTX, XLSX
Web HTML, XML, RSS
Data CSV, JSON
Text Plain text, Code files, Config files
Images JPG, PNG, GIF, WebP, BMP

Installation

Basic Installation

pip install makecontextsimple

With Optional Dependencies

# For PDF support
pip install makecontextsimple[pdf]

# For Office document support
pip install makecontextsimple[docx,pptx,xlsx]

# For image support
pip install makecontextsimple[image]

# For all formats
pip install makecontextsimple[all]

From Source

git clone https://github.com/Ashish813213/MakeContextSimple.git
cd MakeContextSimple
pip install -e ".[all]"

Docker

# Build image
docker build -t makecontextsimple .

# Convert a file
docker run --rm -v $(pwd):/data makecontextsimple document.pdf -o /data/output.html

# LLM-optimized output
docker run --rm -v $(pwd):/data makecontextsimple document.pdf --llm -o /data/context.html

Docker Compose

# Single file conversion
docker compose run convert

# LLM-optimized conversion
docker compose run convert-llm

# Batch convert all PDFs in input/ folder
docker compose run batch

Usage

Command Line

# Convert a file to HTML (output to stdout)
makecontextsimple document.pdf

# Convert with custom output file
makecontextsimple document.pdf -o output.html

# Generate minimal HTML for LLM context
makecontextsimple document.pdf --llm

# List supported formats
makecontextsimple --list-formats

Python API

from makecontextsimple import MakeContextSimple

# Initialize converter
converter = MakeContextSimple()

# Convert a file
result = converter.convert("document.pdf")

# Get full HTML document
html = result.to_full_document()
print(html)

# Get minimal HTML for LLM context
llm_context = result.to_llm_context()

# Save directly to file
converter.convert_to_file("document.pdf", "output.html")

# Convert URL content
import requests
response = requests.get("https://example.com/page.html")
result = converter.convert(response)

Custom Styles

# Use custom CSS
custom_css = """
body { font-family: Arial; max-width: 800px; margin: 0 auto; }
h1 { color: #333; }
"""
result = converter.convert("document.pdf")
html = result.to_full_document(styles=custom_css)

Custom Converters

from makecontextsimple import HTMLConverter, HTMLResult

class MyCustomConverter(HTMLConverter):
    def accepts(self, file_stream, mimetype=None, extension=None, **kwargs):
        return extension == ".myformat"
    
    def convert(self, file_stream, mimetype=None, extension=None, **kwargs):
        content = file_stream.read().decode("utf-8")
        # Custom conversion logic
        html = f"<pre>{content}</pre>"
        return HTMLResult(html=html, title="Custom Format")

# Register custom converter
converter = MakeContextSimple()
converter.register_converter(MyCustomConverter(), priority=0)

Architecture

MakeContextSimple follows a plugin-based converter architecture:

MakeContextSimple (orchestrator)
    ├── HTMLConverter (abstract base)
    │   ├── PDFConverter
    │   ├── DOCXConverter
    │   ├── PPTXConverter
    │   ├── XLSXConverter
    │   ├── ImageConverter
    │   ├── CSVConverter
    │   ├── JSONConverter
    │   ├── XMLConverter
    │   ├── HTMLConverter_Builtin
    │   ├── MarkdownConverter
    │   └── PlainTextConverter
    ├── HTMLBuilder (utilities)
    └── HTMLResult (output container)

Key Components

  • MakeContextSimple: Main orchestrator that manages converters and I/O
  • HTMLConverter: Abstract base class for all format converters
  • HTMLBuilder: Utility class for constructing semantic HTML
  • HTMLResult: Container for conversion output with metadata

Why HTML Over Markdown?

Aspect Markdown HTML
Token Efficiency Good Better (15-20% fewer)
Table Syntax |---| separators <table> tags
Semantic Meaning Relies on conventions Explicit tags
Parsing Regex/string ops Standard parsers
Preview Needs rendering Native browser

Token Comparison Example

Markdown (180 tokens):

| Name  | Age | City     |
|-------|-----|----------|
| Alice | 30  | New York |

HTML (150 tokens):

<table>
<tr><td>Name</td><td>Age</td><td>City</td></tr>
<tr><td>Alice</td><td>30</td><td>New York</td></tr>

Plugin System

MakeContextSimple supports third-party plugins via Python's entry_points:

# In your plugin's pyproject.toml:
[project.entry-points."makecontextsimple.plugin"]
my_plugin = "my_package:register"

# In your plugin:
def register(converter_instance):
    converter_instance.register_converter(MyConverter(), priority=5)

Development

Setup

git clone https://github.com/Ashish813213/MakeContextSimple.git
cd MakeContextSimple
pip install -e ".[dev]"

Running Tests

pytest tests/

Code Style

ruff check src/
ruff format src/

Docker Development

# Build development image
docker build -t makecontextsimple:dev .

# Run tests in container
docker run --rm makecontextsimple:dev python -m pytest tests/

# Interactive shell
docker run --rm -it makecontextsimple:dev /bin/bash

CI/CD

This project uses GitHub Actions for:

  • CI (.github/workflows/ci.yml): Runs tests on push/PR
  • Publish (.github/workflows/publish.yml): Publishes to PyPI and Docker Hub on release

Required Secrets

For publishing, add these secrets in GitHub Settings:

Secret Description
PYPI_API_TOKEN PyPI API token
DOCKERHUB_USERNAME Docker Hub username
DOCKERHUB_TOKEN Docker Hub access token

Publishing

Manual Publishing

# Build distribution
python -m build

# Check distribution
twine check dist/*

# Upload to PyPI
twine upload dist/*

Automated Publishing

Create a GitHub release to automatically publish to PyPI and Docker Hub.

# Create tag
git tag -a v0.1.0 -m "Release 0.1.0"
git push origin v0.1.0

# Create release on GitHub or use:
gh release create v0.1.0

License

MIT License

Author

Ashish Sharma
Email: ashishsharma12549@gmail.com
LinkedIn: linkedin.com/in/ashish-sharma-3613a82b8

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makecontextsimple-0.1.3.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makecontextsimple-0.1.3-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file makecontextsimple-0.1.3.tar.gz.

File metadata

  • Download URL: makecontextsimple-0.1.3.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for makecontextsimple-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d17f22db4e1c19f20aca314980af7d35e3b4e309f7d086d3889eac4d2c9f8179
MD5 30005e5e09200d360d154feff801108f
BLAKE2b-256 e392ffaafdeb2514ee3bfdcd379ed84e919911c3428079cf2a3204baa7571746

See more details on using hashes here.

File details

Details for the file makecontextsimple-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for makecontextsimple-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a04676aa80e31682a73b8fc4c44a9865edab3f12d6b8ee394829819a3a13f364
MD5 2f5828b519e9698f60230a6a24cbfb58
BLAKE2b-256 2b23a08cb6862b577a0b0e9c70d6e5b81e4a777b4d1458a43912575172e6929d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page