Skip to main content

Universal eBook → Markdown converter and cleaner

Project description

allmark

License: MIT Python 3.7+

Universal eBook → Markdown converter and cleaner. Handles all formats, all artifacts, all chapter styles automatically.

Transform your entire eBook library into clean, readable Markdown files with a single command. allmark intelligently strips away the cruft—frontmatter, backmatter, headers, footers, page numbers, and metadata—leaving only the pure narrative content.

✨ Features

Core Capabilities

  • 📚 Universal Format Support: Convert 40+ formats to clean Markdown (10 verified: EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
  • 🧹 Intelligent Cleaning: Automatically removes frontmatter, backmatter, headers, footers, page numbers
  • 🔧 OCR Repair: Fixes broken hyphenation, ligatures, and common OCR artifacts
  • 📖 Chapter Detection: Standardizes chapter markers across different formats
  • 🎯 Artifact Removal: Strips ebook metadata, CSS classes, Calibre IDs, and other cruft
  • 🛡️ Safety First: Never removes more than 50% of content (built-in safety check)
  • 📊 Progress Tracking: SQLite database logs all conversions with statistics
  • 📄 JSONL Export: Token-based text chunking for ML/AI training datasets
  • 🎛️ Flexible Splitting: Paragraph-aware or strict token boundary splitting
  • 🏷️ Custom Metadata: Add arbitrary metadata to JSONL records

What Makes allmark Different?

  • Statistical Analysis: Uses document structure analysis to intelligently identify and remove non-content sections
  • Dialogue-Aware: Preserves paragraph breaks in dialogue while merging broken narrative paragraphs
  • Format Agnostic: Same great results whether your source is a scanned PDF or a modern EPUB
  • Zero Configuration: Works out of the box with sensible defaults
  • Batch Processing: Convert entire libraries with a single command
  • ML-Ready Output: Direct JSONL export with configurable chunk sizes for training datasets

📦 Installation

Quick Install (pip)

pip install git+https://github.com/dcondrey/allmark.git

Development Install

Using pip:

git clone https://github.com/dcondrey/allmark.git
cd allmark
pip install -e .

Using Poetry:

git clone https://github.com/dcondrey/allmark.git
cd allmark
poetry install
poetry shell

Using Conda:

git clone https://github.com/dcondrey/allmark.git
cd allmark
conda env create -f environment.yml
conda activate allmark

🔧 Requirements

allmark has zero Python dependencies - uses only Python stdlib!

External Tools

Tool Purpose Required?
pandoc EPUB, DOCX converter ✅ Yes
pdftotext (poppler) PDF text extraction ✅ Yes
ebook-convert (Calibre) FB2, MOBI fallback ⚠️ Optional

PDF Extraction:

  • Uses pdftotext with -layout mode (preserves formatting)
  • Falls back to -raw mode if layout fails
  • Final fallback to ebook-convert if both fail

Installing External Dependencies

macOS (Homebrew)
brew install pandoc poppler
brew install --cask calibre  # optional
Ubuntu/Debian
sudo apt-get install pandoc poppler-utils
sudo apt-get install calibre  # optional
Windows (Chocolatey)
choco install pandoc poppler
choco install calibre  # optional

🚀 Quick Start

Get Help

allmark
# or
allmark --help

Basic Conversion

# Convert all ebooks in a directory (with intelligent cleaning)
allmark --in /path/to/ebooks

# Output goes to same directory by default
# Verified formats: .epub, .html, .docx, .pdf, .txt, .md, .rtf, .odt, .tex, .rst
# Additional (with Calibre): .mobi, .azw3, .kf8, .fb2, .djvu

Common Use Cases

📚 Convert entire library to Markdown
allmark --in ~/Books --out ~/Books-Markdown
🤖 Create ML training dataset with JSONL
# Convert to JSONL with 1024 token chunks
allmark --in ./books --jsonl --token-size 1024

# With custom metadata for training
allmark --in ./books --jsonl --metadata ./book_info.json

Example book_info.json:

{
  "genre": "science_fiction",
  "language": "en",
  "dataset": "training_v1"
}
📄 Convert without cleaning (preserve everything)
allmark --in ./books --no-strip
# Keeps: frontmatter, backmatter, headers, footers, page numbers, metadata
⚡ Strict token splitting for exact chunk sizes
allmark --in ./books --jsonl --token-size 512 --strict-split
# Splits at exact token boundaries, ignoring paragraph breaks

📖 Usage

Command-Line Options

Option Description Default
--in, --input <dir> Input directory containing ebook files Required
--out, --output <dir> Output directory for markdown files Same as --in
--no-strip Skip cleaning (preserve all content) Cleaning enabled
--force Force reconversion of existing files Skip existing
--no-clean-md Skip cleaning existing .md files Clean .md files
--db <path> Conversion log database path ./conversion_log.db
--jsonl Also create JSONL output with chunks Markdown only
--token-size <n> Max tokens per JSONL chunk 512
--strict-split Split at exact token boundaries Paragraph-aware
--metadata <file> JSON file with custom metadata for JSONL None

Examples by Use Case

# Example 1: Basic conversion with cleaning
allmark --in ./ebooks

# Example 2: Separate output directory
allmark --in ./source-books --out ./clean-markdown

# Example 3: Raw conversion (no cleaning)
allmark --in ./books --no-strip

# Example 4: Force reconversion
allmark --in ./books --force

# Example 5: Create ML training dataset
allmark --in ./books --jsonl --token-size 1024 --metadata ./metadata.json

# Example 6: Custom everything
allmark --in ./books --out ./md --db ~/conversion.db --force

JSONL Output Format

When using --jsonl, each record contains:

{
  "text": "Chunk of narrative text...",
  "chunk_index": 0,
  "total_chunks": 25,
  "token_count": 487,
  "source_file": "book.epub",
  "markdown_file": "book.md",
  "split_mode": "paragraph_aware",
  // ... plus any custom metadata from --metadata file
  "genre": "fiction",
  "language": "en"
}

How It Works

allmark processes files through a comprehensive pipeline:

  1. Format Conversion: Uses pandoc/pdftotext to convert to markdown
  2. OCR Repair: Fixes broken hyphens, ligatures, soft hyphens
  3. Artifact Removal: Strips images, links, CSS classes, ebook metadata
  4. Code Block Detection: Removes non-literary code/markup blocks
  5. Header/Footer Removal: Statistical detection of repeating elements
  6. Page Number Removal: Multiple pattern matching
  7. TOC Removal: Detects and removes table of contents
  8. Document Analysis: Understands prose density and narrative structure
  9. Frontmatter/Backmatter Trimming: Removes copyright pages, author bios, etc.
  10. Chapter Standardization: Normalizes chapter markers to # Chapter N
  11. Typography Normalization: Fixes quotes, dashes, ellipses
  12. Markdown Validation: Ensures proper markdown formatting
  13. Paragraph Merging: Intelligently rejoins broken paragraphs

Project Structure

allmark/
├── src/
│   └── allmark/
│       ├── __init__.py       # Package initialization
│       ├── __main__.py       # CLI entry point
│       ├── cli.py            # Command-line interface
│       ├── converter.py      # Main conversion logic
│       ├── cleaners.py       # Text cleaning functions
│       ├── analyzers.py      # Document analysis
│       ├── ocr.py            # OCR artifact repair
│       └── utils.py          # Utility functions
├── setup.py                  # pip installation
├── pyproject.toml           # Modern Python packaging
├── environment.yml          # Conda environment
└── README.md                # This file

Development

Setting up Development Environment

# Clone the repository
git clone https://github.com/dcondrey/allmark.git
cd allmark

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# OR: Install with pinned dev dependencies for reproducible environment
pip install -r requirements-dev.txt
pip install -e .

Running Tests

pytest
pytest --cov=allmark  # with coverage

Code Formatting

black src/

Linting

flake8 src/
mypy src/

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Report bugs: Open an issue with details and reproduction steps
  2. Suggest features: Share your ideas via GitHub issues
  3. Submit PRs: Fork, create a feature branch, and submit a pull request
  4. Improve docs: Help make the documentation clearer

See Development Guide for setup instructions.

📝 License

MIT License - see LICENSE file for details.

Copyright (c) 2025 David Condrey

💬 Support & Community

🙏 Acknowledgments

Built with:

  • Pandoc - Universal document converter
  • Poppler - PDF rendering and text extraction
  • Python standard library - Zero Python dependencies!

📊 Project Stats

  • Python Dependencies: 0 (pure stdlib!)
  • Verified Formats: 10 formats (EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
  • Additional Formats: 30+ with Calibre (MOBI, AZW3, KF8, DjVu, legacy formats)
  • Cleaning Stages: 17-stage intelligent pipeline
  • Safety Checks: Never removes >50% of content
  • Output Formats: Markdown, JSONL
  • Test Coverage: Coming soon!

📚 Format Support

Tier 1: Verified & Tested ✅

These formats work out-of-the-box with just Pandoc + poppler-utils:

  • EPUB (.epub, .epub3) - Modern ebooks
  • HTML (.html, .htm, .xhtml) - Web pages
  • DOCX (.docx) - Microsoft Word 2007+
  • PDF (.pdf) - Portable documents
  • TXT/MD (.txt, .text, .md) - Plain text
  • RTF (.rtf) - Rich text format
  • ODT (.odt) - LibreOffice documents
  • LaTeX (.tex, .latex) - Academic documents
  • RST (.rst) - Python documentation

Tier 2: With Calibre 🟡

Requires brew install calibre or apt install calibre:

  • MOBI (.mobi) - Mobipocket/Kindle
  • AZW3/KF8 (.azw3, .kf8) - Amazon Kindle
  • FB2 (.fb2) - FictionBook (Russian format)
  • DjVu (.djvu) - Scanned documents (also needs djvulibre)

Tier 3: Legacy Formats ⚠️

Implemented but untested (require Calibre):

  • Microsoft Reader (.lit), Sony Reader (.lrf), Palm (.pdb, .pml, .prc)
  • RocketBook (.rb), TomeRaider (.tcr), XPS (.xps)
  • And 15+ other obsolete formats from the 2000s

Total: 40+ formats supported in code, 10 verified working, 15 example files

See examples/ directory for test files in 15 different formats!


Made with ❤️ for book lovers and data scientists

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

allmark-0.6.0.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

allmark-0.6.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file allmark-0.6.0.tar.gz.

File metadata

  • Download URL: allmark-0.6.0.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for allmark-0.6.0.tar.gz
Algorithm Hash digest
SHA256 abd953dfd46036b68edb84f721a6228e82780d70ae876db094a95ccb9f50d54e
MD5 d7db2e6efec135d35e50e57eb9f8efa0
BLAKE2b-256 1099dcee7bc72376f1137103aef56221ed34e1f407de4360c6ec2618ca9dd1d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for allmark-0.6.0.tar.gz:

Publisher: publish.yml on dcondrey/allmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file allmark-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: allmark-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for allmark-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ebe3bf5545878e45bc8c2866a3a205f3182f13200a22930e6d726b1011a0c152
MD5 5e5c5359797cfc99a3187403f5a69c3d
BLAKE2b-256 b5bc500669bde16eb243bb275cf0028438fed420b8c909f691dc53f21c4231cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for allmark-0.6.0-py3-none-any.whl:

Publisher: publish.yml on dcondrey/allmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page