Convert PDF, Office, data, and markup files into clean, self-contained HTML — for humans and for LLMs.
Project description
everythingtohtml
Convert (almost) any file into clean, self-contained HTML — a universal file reader for your browser and scripts.
English | 中文发布文案 | ▶ Live demo — drag a file, read it as HTML
everythingtohtml is the spiritual inverse of tools like markitdown: instead of flattening rich documents down to Markdown, it lifts a wide range of formats up into clean, styled, standalone HTML you can open in a browser, embed in a page, or feed to a workflow that wants structured markup.
One small API. One CLI. A pluggable converter registry. No browser, no network required for local files.
中文简介:everythingtohtml 是一个浏览器里的万能文件阅读器,也是一个 Python 包和 CLI。它可以把 PDF、Office、Markdown、CSV、JSON、EPUB 等常见文件转换成干净、自包含的 HTML,方便直接阅读、分享和自动化处理。
from everythingtohtml import EverythingToHtml
eth = EverythingToHtml()
result = eth.convert("quarterly-report.docx")
print(result.html) # a complete <!DOCTYPE html> document
print(result.title) # best-effort document title
$ everythingtohtml notes.md -o notes.html
$ everythingtohtml data.csv > data.html
$ everythingtohtml https://example.com/feed.rss > feed.html
Why HTML (and not Markdown)?
Markdown is lossy: tables get flattened, styling vanishes, slide structure disappears, and nested data becomes ambiguous. HTML keeps the structure that matters — headings, tables, lists, sections, links, images — while staying:
- Human-friendly — open the output in any browser, no toolchain needed.
- Restyleable — every document ships with a small, overridable stylesheet.
- Structure-preserving — explicit
<table>/<section>markup keeps tables, sections, and nested content easy to inspect and process. - Self-contained — one file, valid HTML5, dark-mode aware.
Supported formats
| Format | Extensions | Extra needed |
|---|---|---|
| Plain text | .txt, anything textual |
— (built in) |
| Markdown | .md, .markdown, .mkd |
— (built in) |
| HTML (clean/normalize) | .html, .htm, .xhtml |
— (built in) |
| CSV / TSV | .csv, .tsv |
— (built in) |
| JSON / JSONL | .json, .jsonl, .ndjson |
— (built in) |
| Jupyter notebook | .ipynb |
— (built in) |
| RSS / Atom feeds | .rss, .atom |
— (built in) |
| EPUB e-books | .epub |
— (built in) |
.eml |
— (built in) | |
| OpenDocument Text | .odt |
— (built in) |
| YAML | .yaml, .yml |
pip install everythingtohtml[yaml] |
| reStructuredText | .rst |
pip install everythingtohtml[rst] |
| Word | .docx |
pip install everythingtohtml[docx] |
| Word (legacy) | .doc |
pip install everythingtohtml[doc] (LibreOffice recommended) |
| Excel | .xlsx, .xlsm |
pip install everythingtohtml[xlsx] |
| PowerPoint | .pptx |
pip install everythingtohtml[pptx] |
.pdf |
pip install everythingtohtml[pdf] |
Legacy
.doc: best results come from having LibreOffice installed (used headlessly for high-fidelity conversion). Without it, a pure-Pythonolefilefallback recovers the text content.
Want everything?
pip install everythingtohtml[all]
New formats are just a small class away — see Writing a converter.
Installation
# core formats only (tiny dependency footprint)
pip install everythingtohtml
# pull in Office + data formats
pip install "everythingtohtml[all]"
# or cherry-pick
pip install "everythingtohtml[docx,xlsx]"
Requires Python 3.10+.
Usage
Library
from everythingtohtml import EverythingToHtml
eth = EverythingToHtml()
# From a path
result = eth.convert("slides.pptx")
# From bytes or an open stream
with open("data.csv", "rb") as f:
result = eth.convert(f)
# From a URL (http/https/file/data URIs)
result = eth.convert("https://example.com/posts.atom")
# Give hints when the source is ambiguous (e.g. stdin)
from everythingtohtml import StreamInfo
result = eth.convert(raw_bytes, stream_info=StreamInfo(extension=".md"))
result.html # the full HTML document (str)
result.title # detected title, or None
result.text_content # alias for .html (drop-in for markdown-style code)
Command line
everythingtohtml SOURCE [-o OUTPUT] [--extension .md] [--mimetype text/markdown]
# convert a file to a file
everythingtohtml report.docx -o report.html
# pipe through stdin (give it a hint)
cat notes.md | everythingtohtml --extension .md > notes.html
# fetch and convert a remote feed
everythingtohtml https://hnrss.org/frontpage > hn.html
The CLI is also available as e2h for the impatient.
Merging and comparing documents
Need to collate a stack of Word files into one page, or see exactly what changed between two revisions? everythingtohtml does both — for any supported format.
eth = EverythingToHtml()
# Merge several documents into one HTML page (each becomes a section, with a TOC)
merged = eth.merge(["intro.docx", "chapter1.doc", "appendix.pdf"])
# Place them side by side for visual comparison
columns = eth.merge(["draft-v1.docx", "draft-v2.docx"], layout="columns")
# Produce a highlighted, line-by-line diff of two documents' text
changes = eth.diff("spec-old.docx", "spec-new.docx")
open("changes.html", "w", encoding="utf-8").write(changes.html)
From the CLI:
# two or more sources are merged automatically
everythingtohtml intro.docx chapter1.doc appendix.pdf -o handbook.html
# side-by-side layout
everythingtohtml old.docx new.docx --columns -o compare.html
# highlighted diff of exactly two documents
everythingtohtml spec-old.docx spec-new.docx --diff -o changes.html
Architecture
everythingtohtml borrows the proven shape of markitdown:
EverythingToHtml # engine: detection + dispatch + plugins
├─ StreamInfo # immutable bag of hints (ext, mime, charset, …)
├─ DocumentConverter # base class: accepts() + convert()
│ ├─ MarkdownConverter
│ ├─ CsvConverter
│ ├─ DocxConverter (mammoth)
│ └─ … one small class per format
└─ DocumentConverterResult # { html, title, metadata }
When you call convert(), the engine:
- Detects the stream — extension, mimetype, declared charset, and magic-byte
sniffing via
puremagicfill in aStreamInfo. - Dispatches — converters are tried in priority order; each
accepts()is a cheap, non-destructive check. Specific formats win over the plain-text catch-all. - Converts — the winning converter returns a
DocumentConverterResult. If a converter accepts but raises, the engine records it and tries the next one, so one greedy converter can't sink the whole conversion.
Writing a converter
from everythingtohtml import DocumentConverter, DocumentConverterResult, StreamInfo
from everythingtohtml._html_builder import wrap_document, escape_text
class UpperTextConverter(DocumentConverter):
def accepts(self, file_stream, stream_info: StreamInfo, **kwargs) -> bool:
return stream_info.normalized_extension() == ".loud"
def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
text = file_stream.read().decode("utf-8").upper()
return DocumentConverterResult(wrap_document(f"<pre>{escape_text(text)}</pre>"))
eth = EverythingToHtml()
eth.register_converter(UpperTextConverter())
Ship it as a package and expose it as a plugin via entry points so any user can
EverythingToHtml(enable_plugins=True) and pick it up automatically — see
docs/PLUGINS.md.
Contributing
Contributions are very welcome — new converters especially. See CONTRIBUTING.md and our Code of Conduct. Found a security issue? See SECURITY.md.
Acknowledgements
The converter-registry design is directly inspired by Microsoft's excellent markitdown. everythingtohtml aims to be its mirror image for teams that want structure-preserving HTML instead of Markdown.
License
MIT © everythingtohtml contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file everythingtohtml-0.1.2.tar.gz.
File metadata
- Download URL: everythingtohtml-0.1.2.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f99ce892fe972f7da1d169f06d76d648fae3e7856731b3ed28bc51e59dfce8cb
|
|
| MD5 |
c1ab238e4bed3a8c24fa22f5e2a0ee0e
|
|
| BLAKE2b-256 |
694f145089102d9e79b89a7b5bd2cc60d25f67badfa20d716520c6c10e92ddaf
|
Provenance
The following attestation bundles were made for everythingtohtml-0.1.2.tar.gz:
Publisher:
release.yml on He-wei-gui/everythingtohtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
everythingtohtml-0.1.2.tar.gz -
Subject digest:
f99ce892fe972f7da1d169f06d76d648fae3e7856731b3ed28bc51e59dfce8cb - Sigstore transparency entry: 1765954593
- Sigstore integration time:
-
Permalink:
He-wei-gui/everythingtohtml@d205bacf4ddbc36a2fc41f24b6933cd082b07e2f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/He-wei-gui
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d205bacf4ddbc36a2fc41f24b6933cd082b07e2f -
Trigger Event:
push
-
Statement type:
File details
Details for the file everythingtohtml-0.1.2-py3-none-any.whl.
File metadata
- Download URL: everythingtohtml-0.1.2-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f263baabf35ca06badad1672ace88f498e226f3d6924103901d061686254556c
|
|
| MD5 |
b51e101ff30f46bdc84a6e65a20e22e8
|
|
| BLAKE2b-256 |
deda79f8d0858c062862b99c5eec81590729861fbfe860597a51aaac49df545d
|
Provenance
The following attestation bundles were made for everythingtohtml-0.1.2-py3-none-any.whl:
Publisher:
release.yml on He-wei-gui/everythingtohtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
everythingtohtml-0.1.2-py3-none-any.whl -
Subject digest:
f263baabf35ca06badad1672ace88f498e226f3d6924103901d061686254556c - Sigstore transparency entry: 1765954907
- Sigstore integration time:
-
Permalink:
He-wei-gui/everythingtohtml@d205bacf4ddbc36a2fc41f24b6933cd082b07e2f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/He-wei-gui
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d205bacf4ddbc36a2fc41f24b6933cd082b07e2f -
Trigger Event:
push
-
Statement type: