Skip to main content

CLI tool for extracting text with DeepSeek OCR and generating datasets

Project description

Book Data Maker

A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.

Table of Contents

๐Ÿš€ Getting Started

๐Ÿ“– User Guide

๐Ÿ”ง Advanced

๐Ÿ“š Reference


Features

  • ๐Ÿ“„ Multi-Format Support: PDF, EPUB, and images
  • ๐Ÿ  Self-Hosted OCR: Local transformers for DeepSeek-OCR (no API costs)
  • ๐Ÿค– Parallel Generation: Multiple LLM threads explore documents simultaneously
  • ๐ŸŽฏ Smart Distribution: Control thread starting positions
  • ๐Ÿ’พ SQLite Storage: Real-time dataset storage with flexible export
  • ๐Ÿ“Š Multiple Formats: JSONL, Parquet, CSV, JSON
  • ๐ŸŒ Flexible Modes: API or self-hosted for both stages
  • ๐Ÿ“ˆ Progress Tracking: Real-time progress bars

Installation

From PyPI (Recommended)

pip install bookdatamaker

From Source

git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
pip install -r requirements.txt
pip install -e .

Optional: Local Inference Support

# For self-hosted OCR and LLM generation
pip install bookdatamaker[local]  # From PyPI
# OR
pip install -e ".[local]"  # From source - installs transformers==4.46.3, torch, flash-attn, etc.

Note: The project requires transformers==4.46.3 for optimal compatibility with DeepSeek-OCR. A warning will be displayed if a different version is detected.

System Requirements

For API Mode:

  • Python 3.10+
  • API keys (OpenAI, DeepSeek, etc.)

For Local Mode:

  • Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
  • NVIDIA GPU with CUDA support (or CPU, though slower)
  • 16GB+ VRAM recommended for GPU
  • transformers==4.46.3
  • Linux or WSL2 (recommended)

Quick Start

Prerequisites

# Set API keys (choose one based on your mode)
export OPENAI_API_KEY=your_openai_key        # For API mode
export DEEPSEEK_API_KEY=your_deepseek_key    # For API OCR mode

Option 1: API Mode (Fastest Setup)

# 1. Install
pip install bookdatamaker

# 2. Extract โ†’ Generate โ†’ Export
bookdatamaker extract book.pdf -o ./extracted
bookdatamaker generate ./extracted -d dataset.db --distribution "10,10,20,30,20,10"
bookdatamaker export-dataset dataset.db -o output.parquet

Option 2: Self-Hosted Mode (Free, Private)

# 1. Install with local dependencies
pip install bookdatamaker[local]

# 2. Extract with local OCR
bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted

# 3. Generate with vLLM
bookdatamaker generate ./extracted \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B-Instruct \
  --distribution "25,25,25,25" \
  -d dataset.db

# 4. Export
bookdatamaker export-dataset dataset.db -o output.parquet

System Requirements

For API Mode:

  • Python 3.10+
  • API keys (OpenAI, DeepSeek, etc.)

For Local Mode:

  • Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
  • NVIDIA GPU with CUDA support (or CPU, though slower)
  • 16GB+ VRAM recommended for GPU
  • transformers==4.46.3
  • Linux or WSL2 (recommended)

Extract Text (Stage 1)

Extract text from documents using DeepSeek OCR.

Supported Formats

  • PDF: Text extraction or OCR from rendered pages
  • EPUB: E-book text extraction
  • Images: JPG, PNG, BMP, TIFF, WebP

API Mode

# Basic usage
bookdatamaker extract book.pdf -o ./extracted

# Custom API endpoint
bookdatamaker extract book.pdf \
  --deepseek-api-url https://custom-api.example.com/v1 \
  -o ./extracted

Local Mode (Transformers)

Use local transformers model for OCR (DeepSeek-OCR, no API calls):

# Basic usage - uses transformers AutoModel with flash_attention_2
bookdatamaker extract book.pdf --mode local -o ./extracted

# With custom batch size (adjust based on GPU memory)
bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted

# Use CPU instead of GPU
bookdatamaker extract book.pdf --mode local --device cpu -o ./extracted

# Use specific GPU
bookdatamaker extract book.pdf --mode local --device cuda:1 -o ./extracted

# Process directory of images
bookdatamaker extract ./images/ --mode local -o ./extracted

Device Options:

  • cuda (default): Use default CUDA GPU
  • cuda:0, cuda:1, etc.: Use specific GPU
  • cpu: Use CPU (slower, no GPU required)
  • xpu: Use Intel XPU

Plain Text Mode (No OCR)

For PDF with embedded text, skip OCR and extract text directly (much faster):

# Extract plain text from PDF without OCR
bookdatamaker extract book.pdf --plain-text -o ./extracted

Note: EPUB files are automatically extracted as plain text (no OCR needed, no --plain-text flag required):

# EPUB always uses plain text extraction
bookdatamaker extract book.epub -o ./extracted

When to use --plain-text (for PDF):

  • โœ… PDF with embedded text (e.g., born-digital documents)
  • โœ… Fast extraction without GPU/API requirements
  • โœ… Text-only documents

When NOT to use --plain-text:

  • โŒ Scanned PDFs (images of text)
  • โŒ PDFs with complex layouts requiring OCR
  • โŒ Documents where text extraction quality is poor

Output Structure

./extracted/
โ”œโ”€โ”€ page_001/
โ”‚   โ”œโ”€โ”€ page_001.png      # Page image
โ”‚   โ””โ”€โ”€ result.mmd        # Extracted text in markdown
โ”œโ”€โ”€ page_002/
โ”‚   โ”œโ”€โ”€ page_002.png
โ”‚   โ””โ”€โ”€ result.mmd
โ””โ”€โ”€ ...

Note: Each page is stored in its own subdirectory with the extracted text in result.mmd format.


Generate Dataset (Stage 2)

Generate Q&A datasets using parallel LLM threads with page-based navigation.

Navigation Model

The system uses page navigation:

  • LLM threads navigate through document pages
  • Tools available: get_current_page, next_page, previous_page, jump_to_page, get_page_context
  • Each thread starts at a specific page based on distribution
  • Threads can move forward/backward through pages to explore content

Checkpoint & Resume

The generation process automatically saves checkpoints to the database:

  • Thread state is saved after each successful Q&A submission
  • If interrupted (Ctrl+C, crash, etc.), simply rerun the same command
  • You'll be prompted to resume from checkpoint or start fresh
# First run (interrupted at 50%)
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ^C (interrupted)

# Resume from checkpoint
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# โš ๏ธ  Found 4 incomplete thread(s) in database:
#   Thread 0: 8/20 pairs, last updated 2024-01-15 10:30:45
#   Thread 1: 10/20 pairs, last updated 2024-01-15 10:30:48
#   Thread 2: 12/20 pairs, last updated 2024-01-15 10:30:50
#   Thread 3: 7/20 pairs, last updated 2024-01-15 10:30:43
# 
# Do you want to resume from checkpoint? [Y/n]: y
# โœ“ Resuming from checkpoint...

Features:

  • ๐Ÿ’พ Automatic checkpoint after each Q&A pair submission
  • ๐Ÿ”„ Resume from last position in document
  • ๐Ÿ’ฌ Preserves conversation history
  • ๐ŸŽฏ Tracks progress per thread

Basic Usage

# 6 threads (from distribution), 20 Q&A pairs per thread
bookdatamaker generate ./extracted \
  -d dataset.db \
  --distribution "10,10,20,30,20,10" \
  --datasets-per-thread 20

Key Concept: Thread count is determined by the number of comma-separated values in --distribution.

API Mode Examples

# OpenAI/Azure
bookdatamaker generate ./extracted \
  -d dataset.db \
  --openai-api-url https://api.openai.com/v1 \
  --model gpt-4 \
  --distribution "10,10,20,30,20,10"

# Custom API endpoint
bookdatamaker generate ./extracted \
  --openai-api-url http://localhost:8000/v1 \
  --model your-model-name \
  --distribution "25,25,25,25"

vLLM Direct Mode (Self-Hosted)

Use vLLM directly without API server:

# Single GPU
bookdatamaker generate ./extracted \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B-Instruct \
  --distribution "25,25,25,25" \
  -d dataset.db

# Multi-GPU (4 GPUs, 6 threads)
bookdatamaker generate ./extracted \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --distribution "10,10,20,30,20,10" \
  -d dataset.db

Custom Prompts

Add specific instructions to guide LLM behavior:

# Language specification
bookdatamaker generate ./extracted \
  --custom-prompt "Generate all Q&A in Chinese with simplified characters"

# Format specification
bookdatamaker generate ./extracted \
  --custom-prompt "Questions should be multiple-choice with 4 options"

# Multiple requirements
bookdatamaker generate ./extracted \
  --custom-prompt "Requirements:
1. Generate questions in English
2. Focus on practical applications
3. Include code examples
4. Answer length: 50-150 words
5. Difficulty: intermediate"

Export Dataset

Export from SQLite database to your preferred format:

# Parquet (recommended for data analysis)
bookdatamaker export-dataset dataset.db -o output.parquet

# JSON Lines (easy to stream)
bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl

# CSV (Excel-friendly)
bookdatamaker export-dataset dataset.db -o output.csv -f csv

# JSON with metadata
bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadata

Position Distribution

Control where threads start in the document using distribution percentages.

How It Works

Document: 100 pages
Distribution: "10,10,20,30,20,10" (6 threads)

Thread 0: Start at 0%   โ†’ Page 1
Thread 1: Start at 10%  โ†’ Page 10
Thread 2: Start at 20%  โ†’ Page 20
Thread 3: Start at 50%  โ†’ Page 50
Thread 4: Start at 70%  โ†’ Page 70
Thread 5: Start at 80%  โ†’ Page 80

Distribution Strategies

# Even distribution (4 threads)
--distribution "25,25,25,25"
# Start at: 0%, 25%, 50%, 75%

# Front-heavy (4 threads) - focus on beginning
--distribution "40,30,20,10"
# Start at: 0%, 40%, 70%, 90%

# Middle-heavy (5 threads) - focus on middle
--distribution "10,20,40,20,10"
# Start at: 0%, 10%, 30%, 70%, 90%

# Dense sampling (10 threads) - fine-grained coverage
--distribution "10,10,10,10,10,10,10,10,10,10"

Thread Count Guidelines

  • Small documents (<50 pages): 2-4 threads
  • Medium documents (50-200 pages): 4-8 threads
  • Large documents (>200 pages): 8-16 threads

Interactive Chat

Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.

Start Chat Session

# Basic chat with GPT-4
bookdatamaker chat ./extracted

# With vLLM server
bookdatamaker chat ./extracted \
  --openai-api-url http://localhost:8000/v1 \
  --model Qwen/Qwen3-4B-Thinking-2507

# With custom database
bookdatamaker chat ./extracted --db my_dataset.db

Debug Mode

Set environment variable for verbose logging:

export LOG_LEVEL=DEBUG
bookdatamaker generate ./extracted -d dataset.db

Development

Project Structure

bookdatamaker/
โ”œโ”€โ”€ src/bookdatamaker/
โ”‚   โ”œโ”€โ”€ cli.py                    # CLI interface
โ”‚   โ”œโ”€โ”€ ocr/
โ”‚   โ”‚   โ”œโ”€โ”€ extractor.py          # OCR extraction
โ”‚   โ”‚   โ””โ”€โ”€ document_parser.py    # Document parsing
โ”‚   โ”œโ”€โ”€ mcp/
โ”‚   โ”‚   โ””โ”€โ”€ server.py             # MCP server
โ”‚   โ”œโ”€โ”€ llm/
โ”‚   โ”‚   โ””โ”€โ”€ parallel_generator.py # Parallel generation
โ”‚   โ”œโ”€โ”€ dataset/
โ”‚   โ”‚   โ”œโ”€โ”€ builder.py            # Dataset building
โ”‚   โ”‚   โ””โ”€โ”€ dataset_manager.py    # SQLite management
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ”œโ”€โ”€ page_manager.py       # Page navigation
โ”‚       โ””โ”€โ”€ status.py             # Progress indicators
โ””โ”€โ”€ tests/                        # Test files

Development Setup

# Clone repository
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Code formatting
black src/
ruff check src/

# Type checking
mypy src/

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Ensure all tests pass
  5. Submit a pull request

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_ocr.py

# Run with coverage
pytest --cov=bookdatamaker tests/

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bookdatamaker-0.2.5.2.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bookdatamaker-0.2.5.2-py3-none-any.whl (51.5 kB view details)

Uploaded Python 3

File details

Details for the file bookdatamaker-0.2.5.2.tar.gz.

File metadata

  • Download URL: bookdatamaker-0.2.5.2.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bookdatamaker-0.2.5.2.tar.gz
Algorithm Hash digest
SHA256 288deeff020695f720cb8ed26e1ad3fe0def4b37c33ea4907d6c166c505da9a9
MD5 c25885694712a4eb3af6f961f2c65c62
BLAKE2b-256 5ccc62bdac7fcda23d91b753681835967a8bea6d54848c075d51e058edf8d758

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookdatamaker-0.2.5.2.tar.gz:

Publisher: python-publish.yml on zwh20081/bookdatamaker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bookdatamaker-0.2.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for bookdatamaker-0.2.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a1548fe60e342066dc036ae3db658c730c5005b306dad64d506af84067c28fdd
MD5 f2774355c43320c7ba5186b86f64f2d8
BLAKE2b-256 f16fcd50f8c63af3199149c074436a327634c72f3f668b3b14bb2434650248e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookdatamaker-0.2.5.2-py3-none-any.whl:

Publisher: python-publish.yml on zwh20081/bookdatamaker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page