Universal eBook → Markdown converter and cleaner
Project description
allmark
Universal eBook → Markdown converter and cleaner. Handles all formats, all artifacts, all chapter styles automatically.
Transform your entire eBook library into clean, readable Markdown files with a single command. allmark intelligently strips away the cruft—frontmatter, backmatter, headers, footers, page numbers, and metadata—leaving only the pure narrative content.
✨ Features
Core Capabilities
- 📚 Universal Format Support: Convert 40+ formats to clean Markdown (10 verified: EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
- 🧹 Intelligent Cleaning: Automatically removes frontmatter, backmatter, headers, footers, page numbers
- 🔧 OCR Repair: Fixes broken hyphenation, ligatures, and common OCR artifacts
- 📖 Chapter Detection: Standardizes chapter markers across different formats
- 🎯 Artifact Removal: Strips ebook metadata, CSS classes, Calibre IDs, and other cruft
- 🛡️ Safety First: Never removes more than 50% of content (built-in safety check)
- 📊 Progress Tracking: SQLite database logs all conversions with statistics
- 📄 JSONL Export: Token-based text chunking for ML/AI training datasets
- 🎛️ Flexible Splitting: Paragraph-aware or strict token boundary splitting
- 🏷️ Custom Metadata: Add arbitrary metadata to JSONL records
What Makes allmark Different?
- Statistical Analysis: Uses document structure analysis to intelligently identify and remove non-content sections
- Dialogue-Aware: Preserves paragraph breaks in dialogue while merging broken narrative paragraphs
- Format Agnostic: Same great results whether your source is a scanned PDF or a modern EPUB
- Zero Configuration: Works out of the box with sensible defaults
- Batch Processing: Convert entire libraries with a single command
- ML-Ready Output: Direct JSONL export with configurable chunk sizes for training datasets
📦 Installation
Quick Install (pip)
pip install git+https://github.com/dcondrey/allmark.git
Development Install
Using pip:
git clone https://github.com/dcondrey/allmark.git
cd allmark
pip install -e .
Using Poetry:
git clone https://github.com/dcondrey/allmark.git
cd allmark
poetry install
poetry shell
Using Conda:
git clone https://github.com/dcondrey/allmark.git
cd allmark
conda env create -f environment.yml
conda activate allmark
🔧 Requirements
allmark has zero Python dependencies - uses only Python stdlib!
External Tools
| Tool | Purpose | Required? |
|---|---|---|
| pandoc | EPUB, DOCX converter | ✅ Yes |
| pdftotext (poppler) | PDF text extraction | ✅ Yes |
| ebook-convert (Calibre) | FB2, MOBI fallback | ⚠️ Optional |
PDF Extraction:
- Uses pdftotext with
-layoutmode (preserves formatting) - Falls back to
-rawmode if layout fails - Final fallback to ebook-convert if both fail
Installing External Dependencies
macOS (Homebrew)
brew install pandoc poppler
brew install --cask calibre # optional
Ubuntu/Debian
sudo apt-get install pandoc poppler-utils
sudo apt-get install calibre # optional
Windows (Chocolatey)
choco install pandoc poppler
choco install calibre # optional
🚀 Quick Start
Get Help
allmark
# or
allmark --help
Basic Conversion
# Convert all ebooks in a directory (with intelligent cleaning)
allmark --in /path/to/ebooks
# Output goes to same directory by default
# Verified formats: .epub, .html, .docx, .pdf, .txt, .md, .rtf, .odt, .tex, .rst
# Additional (with Calibre): .mobi, .azw3, .kf8, .fb2, .djvu
Common Use Cases
📚 Convert entire library to Markdown
allmark --in ~/Books --out ~/Books-Markdown
🤖 Create ML training dataset with JSONL
# Convert to JSONL with 1024 token chunks
allmark --in ./books --jsonl --token-size 1024
# With custom metadata for training
allmark --in ./books --jsonl --metadata ./book_info.json
Example book_info.json:
{
"genre": "science_fiction",
"language": "en",
"dataset": "training_v1"
}
📄 Convert without cleaning (preserve everything)
allmark --in ./books --no-strip
# Keeps: frontmatter, backmatter, headers, footers, page numbers, metadata
⚡ Strict token splitting for exact chunk sizes
allmark --in ./books --jsonl --token-size 512 --strict-split
# Splits at exact token boundaries, ignoring paragraph breaks
📖 Usage
Command-Line Options
| Option | Description | Default |
|---|---|---|
--in, --input <dir> |
Input directory containing ebook files | Required |
--out, --output <dir> |
Output directory for markdown files | Same as --in |
--no-strip |
Skip cleaning (preserve all content) | Cleaning enabled |
--force |
Force reconversion of existing files | Skip existing |
--no-clean-md |
Skip cleaning existing .md files | Clean .md files |
--db <path> |
Conversion log database path | ./conversion_log.db |
--jsonl |
Also create JSONL output with chunks | Markdown only |
--token-size <n> |
Max tokens per JSONL chunk | 512 |
--strict-split |
Split at exact token boundaries | Paragraph-aware |
--metadata <file> |
JSON file with custom metadata for JSONL | None |
Examples by Use Case
# Example 1: Basic conversion with cleaning
allmark --in ./ebooks
# Example 2: Separate output directory
allmark --in ./source-books --out ./clean-markdown
# Example 3: Raw conversion (no cleaning)
allmark --in ./books --no-strip
# Example 4: Force reconversion
allmark --in ./books --force
# Example 5: Create ML training dataset
allmark --in ./books --jsonl --token-size 1024 --metadata ./metadata.json
# Example 6: Custom everything
allmark --in ./books --out ./md --db ~/conversion.db --force
JSONL Output Format
When using --jsonl, each record contains:
{
"text": "Chunk of narrative text...",
"chunk_index": 0,
"total_chunks": 25,
"token_count": 487,
"source_file": "book.epub",
"markdown_file": "book.md",
"split_mode": "paragraph_aware",
// ... plus any custom metadata from --metadata file
"genre": "fiction",
"language": "en"
}
How It Works
allmark processes files through a comprehensive pipeline:
- Format Conversion: Uses pandoc/pdftotext to convert to markdown
- OCR Repair: Fixes broken hyphens, ligatures, soft hyphens
- Artifact Removal: Strips images, links, CSS classes, ebook metadata
- Code Block Detection: Removes non-literary code/markup blocks
- Header/Footer Removal: Statistical detection of repeating elements
- Page Number Removal: Multiple pattern matching
- TOC Removal: Detects and removes table of contents
- Document Analysis: Understands prose density and narrative structure
- Frontmatter/Backmatter Trimming: Removes copyright pages, author bios, etc.
- Chapter Standardization: Normalizes chapter markers to
# Chapter N - Typography Normalization: Fixes quotes, dashes, ellipses
- Markdown Validation: Ensures proper markdown formatting
- Paragraph Merging: Intelligently rejoins broken paragraphs
Project Structure
allmark/
├── src/
│ └── allmark/
│ ├── __init__.py # Package initialization
│ ├── __main__.py # CLI entry point
│ ├── cli.py # Command-line interface
│ ├── converter.py # Main conversion logic
│ ├── cleaners.py # Text cleaning functions
│ ├── analyzers.py # Document analysis
│ ├── ocr.py # OCR artifact repair
│ └── utils.py # Utility functions
├── setup.py # pip installation
├── pyproject.toml # Modern Python packaging
├── environment.yml # Conda environment
└── README.md # This file
Development
Setting up Development Environment
# Clone the repository
git clone https://github.com/dcondrey/allmark.git
cd allmark
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# OR: Install with pinned dev dependencies for reproducible environment
pip install -r requirements-dev.txt
pip install -e .
Running Tests
pytest
pytest --cov=allmark # with coverage
Code Formatting
black src/
Linting
flake8 src/
mypy src/
🤝 Contributing
Contributions are welcome! Here's how you can help:
- Report bugs: Open an issue with details and reproduction steps
- Suggest features: Share your ideas via GitHub issues
- Submit PRs: Fork, create a feature branch, and submit a pull request
- Improve docs: Help make the documentation clearer
See Development Guide for setup instructions.
📝 License
MIT License - see LICENSE file for details.
Copyright (c) 2025 David Condrey
💬 Support & Community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code documentation
🙏 Acknowledgments
Built with:
- Pandoc - Universal document converter
- Poppler - PDF rendering and text extraction
- Python standard library - Zero Python dependencies!
📊 Project Stats
- Python Dependencies: 0 (pure stdlib!)
- Verified Formats: 10 formats (EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
- Additional Formats: 30+ with Calibre (MOBI, AZW3, KF8, DjVu, legacy formats)
- Cleaning Stages: 17-stage intelligent pipeline
- Safety Checks: Never removes >50% of content
- Output Formats: Markdown, JSONL
- Test Coverage: Coming soon!
📚 Format Support
Tier 1: Verified & Tested ✅
These formats work out-of-the-box with just Pandoc + poppler-utils:
- EPUB (.epub, .epub3) - Modern ebooks
- HTML (.html, .htm, .xhtml) - Web pages
- DOCX (.docx) - Microsoft Word 2007+
- PDF (.pdf) - Portable documents
- TXT/MD (.txt, .text, .md) - Plain text
- RTF (.rtf) - Rich text format
- ODT (.odt) - LibreOffice documents
- LaTeX (.tex, .latex) - Academic documents
- RST (.rst) - Python documentation
Tier 2: With Calibre 🟡
Requires brew install calibre or apt install calibre:
- MOBI (.mobi) - Mobipocket/Kindle
- AZW3/KF8 (.azw3, .kf8) - Amazon Kindle
- FB2 (.fb2) - FictionBook (Russian format)
- DjVu (.djvu) - Scanned documents (also needs djvulibre)
Tier 3: Legacy Formats ⚠️
Implemented but untested (require Calibre):
- Microsoft Reader (.lit), Sony Reader (.lrf), Palm (.pdb, .pml, .prc)
- RocketBook (.rb), TomeRaider (.tcr), XPS (.xps)
- And 15+ other obsolete formats from the 2000s
Total: 40+ formats supported in code, 10 verified working, 15 example files
See examples/ directory for test files in 15 different formats!
Made with ❤️ for book lovers and data scientists
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file allmark-0.6.0.tar.gz.
File metadata
- Download URL: allmark-0.6.0.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abd953dfd46036b68edb84f721a6228e82780d70ae876db094a95ccb9f50d54e
|
|
| MD5 |
d7db2e6efec135d35e50e57eb9f8efa0
|
|
| BLAKE2b-256 |
1099dcee7bc72376f1137103aef56221ed34e1f407de4360c6ec2618ca9dd1d2
|
Provenance
The following attestation bundles were made for allmark-0.6.0.tar.gz:
Publisher:
publish.yml on dcondrey/allmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
allmark-0.6.0.tar.gz -
Subject digest:
abd953dfd46036b68edb84f721a6228e82780d70ae876db094a95ccb9f50d54e - Sigstore transparency entry: 685568888
- Sigstore integration time:
-
Permalink:
dcondrey/allmark@7afd3402fb7943a476071fbecc636c3aa4fdc1fd -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/dcondrey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7afd3402fb7943a476071fbecc636c3aa4fdc1fd -
Trigger Event:
release
-
Statement type:
File details
Details for the file allmark-0.6.0-py3-none-any.whl.
File metadata
- Download URL: allmark-0.6.0-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebe3bf5545878e45bc8c2866a3a205f3182f13200a22930e6d726b1011a0c152
|
|
| MD5 |
5e5c5359797cfc99a3187403f5a69c3d
|
|
| BLAKE2b-256 |
b5bc500669bde16eb243bb275cf0028438fed420b8c909f691dc53f21c4231cf
|
Provenance
The following attestation bundles were made for allmark-0.6.0-py3-none-any.whl:
Publisher:
publish.yml on dcondrey/allmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
allmark-0.6.0-py3-none-any.whl -
Subject digest:
ebe3bf5545878e45bc8c2866a3a205f3182f13200a22930e6d726b1011a0c152 - Sigstore transparency entry: 685568895
- Sigstore integration time:
-
Permalink:
dcondrey/allmark@7afd3402fb7943a476071fbecc636c3aa4fdc1fd -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/dcondrey
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7afd3402fb7943a476071fbecc636c3aa4fdc1fd -
Trigger Event:
release
-
Statement type: