Universal eBook → Markdown converter and cleaner

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dcondrey

These details have not been verified by PyPI

Project description

allmark

Universal eBook → Markdown converter and cleaner. Handles all formats, all artifacts, all chapter styles automatically.

Transform your entire eBook library into clean, readable Markdown files with a single command. allmark intelligently strips away the cruft—frontmatter, backmatter, headers, footers, page numbers, and metadata—leaving only the pure narrative content.

✨ Features

Core Capabilities

📚 Universal Format Support: Convert 40+ formats to clean Markdown (10 verified: EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
🧹 Intelligent Cleaning: Automatically removes frontmatter, backmatter, headers, footers, page numbers
🔧 OCR Repair: Fixes broken hyphenation, ligatures, and common OCR artifacts
📖 Chapter Detection: Standardizes chapter markers across different formats
🎯 Artifact Removal: Strips ebook metadata, CSS classes, Calibre IDs, and other cruft
🛡️ Safety First: Never removes more than 50% of content (built-in safety check)
📊 Progress Tracking: SQLite database logs all conversions with statistics
📄 JSONL Export: Token-based text chunking for ML/AI training datasets
🎛️ Flexible Splitting: Paragraph-aware or strict token boundary splitting
🏷️ Custom Metadata: Add arbitrary metadata to JSONL records

What Makes allmark Different?

Statistical Analysis: Uses document structure analysis to intelligently identify and remove non-content sections
Dialogue-Aware: Preserves paragraph breaks in dialogue while merging broken narrative paragraphs
Format Agnostic: Same great results whether your source is a scanned PDF or a modern EPUB
Zero Configuration: Works out of the box with sensible defaults
Batch Processing: Convert entire libraries with a single command
ML-Ready Output: Direct JSONL export with configurable chunk sizes for training datasets

📦 Installation

Quick Install (pip)

pip install git+https://github.com/dcondrey/allmark.git

Development Install

Using pip:

git clone https://github.com/dcondrey/allmark.git
cd allmark
pip install -e .

Using Poetry:

git clone https://github.com/dcondrey/allmark.git
cd allmark
poetry install
poetry shell

Using Conda:

git clone https://github.com/dcondrey/allmark.git
cd allmark
conda env create -f environment.yml
conda activate allmark

🔧 Requirements

allmark has zero Python dependencies - uses only Python stdlib!

External Tools

Tool	Purpose	Required?
pandoc	EPUB, DOCX converter	✅ Yes
pdftotext (poppler)	PDF text extraction	✅ Yes
ebook-convert (Calibre)	FB2, MOBI fallback	⚠️ Optional

PDF Extraction:

Uses pdftotext with -layout mode (preserves formatting)
Falls back to -raw mode if layout fails
Final fallback to ebook-convert if both fail

Installing External Dependencies

macOS (Homebrew)

brew install pandoc poppler
brew install --cask calibre  # optional

Ubuntu/Debian

sudo apt-get install pandoc poppler-utils
sudo apt-get install calibre  # optional

Windows (Chocolatey)

choco install pandoc poppler
choco install calibre  # optional

🚀 Quick Start

Get Help

allmark
# or
allmark --help

Basic Conversion

# Convert all ebooks in a directory (with intelligent cleaning)
allmark --in /path/to/ebooks

# Output goes to same directory by default
# Verified formats: .epub, .html, .docx, .pdf, .txt, .md, .rtf, .odt, .tex, .rst
# Additional (with Calibre): .mobi, .azw3, .kf8, .fb2, .djvu

Common Use Cases

📚 Convert entire library to Markdown

allmark --in ~/Books --out ~/Books-Markdown

🤖 Create ML training dataset with JSONL

# Convert to JSONL with 1024 token chunks
allmark --in ./books --jsonl --token-size 1024

# With custom metadata for training
allmark --in ./books --jsonl --metadata ./book_info.json

Example book_info.json:

{
  "genre": "science_fiction",
  "language": "en",
  "dataset": "training_v1"
}

📄 Convert without cleaning (preserve everything)

allmark --in ./books --no-strip
# Keeps: frontmatter, backmatter, headers, footers, page numbers, metadata

⚡ Strict token splitting for exact chunk sizes

allmark --in ./books --jsonl --token-size 512 --strict-split
# Splits at exact token boundaries, ignoring paragraph breaks

📖 Usage

Command-Line Options

Option	Description	Default
`--in, --input <dir>`	Input directory containing ebook files	Required
`--out, --output <dir>`	Output directory for markdown files	Same as `--in`
`--no-strip`	Skip cleaning (preserve all content)	Cleaning enabled
`--force`	Force reconversion of existing files	Skip existing
`--no-clean-md`	Skip cleaning existing .md files	Clean .md files
`--db <path>`	Conversion log database path	`./conversion_log.db`
`--jsonl`	Also create JSONL output with chunks	Markdown only
`--token-size <n>`	Max tokens per JSONL chunk	512
`--strict-split`	Split at exact token boundaries	Paragraph-aware
`--metadata <file>`	JSON file with custom metadata for JSONL	None

Examples by Use Case

# Example 1: Basic conversion with cleaning
allmark --in ./ebooks

# Example 2: Separate output directory
allmark --in ./source-books --out ./clean-markdown

# Example 3: Raw conversion (no cleaning)
allmark --in ./books --no-strip

# Example 4: Force reconversion
allmark --in ./books --force

# Example 5: Create ML training dataset
allmark --in ./books --jsonl --token-size 1024 --metadata ./metadata.json

# Example 6: Custom everything
allmark --in ./books --out ./md --db ~/conversion.db --force

JSONL Output Format

When using --jsonl, each record contains:

{
  "text": "Chunk of narrative text...",
  "chunk_index": 0,
  "total_chunks": 25,
  "token_count": 487,
  "source_file": "book.epub",
  "markdown_file": "book.md",
  "split_mode": "paragraph_aware",
  // ... plus any custom metadata from --metadata file
  "genre": "fiction",
  "language": "en"
}

How It Works

allmark processes files through a comprehensive pipeline:

Format Conversion: Uses pandoc/pdftotext to convert to markdown
OCR Repair: Fixes broken hyphens, ligatures, soft hyphens
Artifact Removal: Strips images, links, CSS classes, ebook metadata
Code Block Detection: Removes non-literary code/markup blocks
Header/Footer Removal: Statistical detection of repeating elements
Page Number Removal: Multiple pattern matching
TOC Removal: Detects and removes table of contents
Document Analysis: Understands prose density and narrative structure
Frontmatter/Backmatter Trimming: Removes copyright pages, author bios, etc.
Chapter Standardization: Normalizes chapter markers to # Chapter N
Typography Normalization: Fixes quotes, dashes, ellipses
Markdown Validation: Ensures proper markdown formatting
Paragraph Merging: Intelligently rejoins broken paragraphs

Project Structure

allmark/
├── src/
│   └── allmark/
│       ├── __init__.py       # Package initialization
│       ├── __main__.py       # CLI entry point
│       ├── cli.py            # Command-line interface
│       ├── converter.py      # Main conversion logic
│       ├── cleaners.py       # Text cleaning functions
│       ├── analyzers.py      # Document analysis
│       ├── ocr.py            # OCR artifact repair
│       └── utils.py          # Utility functions
├── setup.py                  # pip installation
├── pyproject.toml           # Modern Python packaging
├── environment.yml          # Conda environment
└── README.md                # This file

Development

Setting up Development Environment

# Clone the repository
git clone https://github.com/dcondrey/allmark.git
cd allmark

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# OR: Install with pinned dev dependencies for reproducible environment
pip install -r requirements-dev.txt
pip install -e .

Running Tests

pytest
pytest --cov=allmark  # with coverage

Code Formatting

black src/

Linting

flake8 src/
mypy src/

🤝 Contributing

Contributions are welcome! Here's how you can help:

Report bugs: Open an issue with details and reproduction steps
Suggest features: Share your ideas via GitHub issues
Submit PRs: Fork, create a feature branch, and submit a pull request
Improve docs: Help make the documentation clearer

See Development Guide for setup instructions.

📝 License

MIT License - see LICENSE file for details.

💬 Support & Community

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: This README and inline code documentation

🙏 Acknowledgments

Built with:

Pandoc - Universal document converter
Poppler - PDF rendering and text extraction
Python standard library - Zero Python dependencies!

📊 Project Stats

Python Dependencies: 0 (pure stdlib!)
Verified Formats: 10 formats (EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
Additional Formats: 30+ with Calibre (MOBI, AZW3, KF8, DjVu, legacy formats)
Cleaning Stages: 17-stage intelligent pipeline
Safety Checks: Never removes >50% of content
Output Formats: Markdown, JSONL
Test Coverage: Coming soon!

📚 Format Support

Tier 1: Verified & Tested ✅

These formats work out-of-the-box with just Pandoc + poppler-utils:

EPUB (.epub, .epub3) - Modern ebooks
HTML (.html, .htm, .xhtml) - Web pages
DOCX (.docx) - Microsoft Word 2007+
PDF (.pdf) - Portable documents
TXT/MD (.txt, .text, .md) - Plain text
RTF (.rtf) - Rich text format
ODT (.odt) - LibreOffice documents
LaTeX (.tex, .latex) - Academic documents
RST (.rst) - Python documentation

Tier 2: With Calibre 🟡

Requires brew install calibre or apt install calibre:

MOBI (.mobi) - Mobipocket/Kindle
AZW3/KF8 (.azw3, .kf8) - Amazon Kindle
FB2 (.fb2) - FictionBook (Russian format)
DjVu (.djvu) - Scanned documents (also needs djvulibre)

Tier 3: Legacy Formats ⚠️

Implemented but untested (require Calibre):

Microsoft Reader (.lit), Sony Reader (.lrf), Palm (.pdb, .pml, .prc)
RocketBook (.rb), TomeRaider (.tcr), XPS (.xps)
And 15+ other obsolete formats from the 2000s

Total: 40+ formats supported in code, 10 verified working, 15 example files

See examples/ directory for test files in 15 different formats!

Made with ❤️ for book lovers and data scientists

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dcondrey

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.0

Nov 9, 2025

0.5.0

Nov 9, 2025

0.4.0

Nov 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

allmark-0.6.0.tar.gz (3.2 MB view details)

Uploaded Nov 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

allmark-0.6.0-py3-none-any.whl (39.6 kB view details)

Uploaded Nov 9, 2025 Python 3

File details

Details for the file allmark-0.6.0.tar.gz.

File metadata

Download URL: allmark-0.6.0.tar.gz
Upload date: Nov 9, 2025
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for allmark-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`abd953dfd46036b68edb84f721a6228e82780d70ae876db094a95ccb9f50d54e`
MD5	`d7db2e6efec135d35e50e57eb9f8efa0`
BLAKE2b-256	`1099dcee7bc72376f1137103aef56221ed34e1f407de4360c6ec2618ca9dd1d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for allmark-0.6.0.tar.gz:

Publisher: publish.yml on dcondrey/allmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: allmark-0.6.0.tar.gz
- Subject digest: abd953dfd46036b68edb84f721a6228e82780d70ae876db094a95ccb9f50d54e
- Sigstore transparency entry: 685568888
- Sigstore integration time: Nov 9, 2025
Source repository:
- Permalink: dcondrey/allmark@7afd3402fb7943a476071fbecc636c3aa4fdc1fd
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/dcondrey
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7afd3402fb7943a476071fbecc636c3aa4fdc1fd
- Trigger Event: release

File details

Details for the file allmark-0.6.0-py3-none-any.whl.

File metadata

Download URL: allmark-0.6.0-py3-none-any.whl
Upload date: Nov 9, 2025
Size: 39.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for allmark-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebe3bf5545878e45bc8c2866a3a205f3182f13200a22930e6d726b1011a0c152`
MD5	`5e5c5359797cfc99a3187403f5a69c3d`
BLAKE2b-256	`b5bc500669bde16eb243bb275cf0028438fed420b8c909f691dc53f21c4231cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for allmark-0.6.0-py3-none-any.whl:

Publisher: publish.yml on dcondrey/allmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: allmark-0.6.0-py3-none-any.whl
- Subject digest: ebe3bf5545878e45bc8c2866a3a205f3182f13200a22930e6d726b1011a0c152
- Sigstore transparency entry: 685568895
- Sigstore integration time: Nov 9, 2025
Source repository:
- Permalink: dcondrey/allmark@7afd3402fb7943a476071fbecc636c3aa4fdc1fd
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/dcondrey
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7afd3402fb7943a476071fbecc636c3aa4fdc1fd
- Trigger Event: release

allmark 0.6.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

allmark

✨ Features

Core Capabilities

What Makes allmark Different?

📦 Installation

Quick Install (pip)

Development Install

🔧 Requirements

External Tools

Installing External Dependencies

🚀 Quick Start

Get Help

Basic Conversion

Common Use Cases

📖 Usage

Command-Line Options

Examples by Use Case

JSONL Output Format

How It Works

Project Structure

Development

Setting up Development Environment

Running Tests

Code Formatting

Linting

🤝 Contributing

📝 License

💬 Support & Community

🙏 Acknowledgments

📊 Project Stats

📚 Format Support

Tier 1: Verified & Tested ✅

Tier 2: With Calibre 🟡

Tier 3: Legacy Formats ⚠️

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance