CLI tool for extracting text with DeepSeek OCR and generating datasets
Project description
Book Data Maker
A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.
Table of Contents
๐ Getting Started
๐ User Guide
๐ง Advanced
Features
- ๐ Multi-Format Support: PDF, EPUB, and images
- ๐ Self-Hosted OCR: Local transformers for DeepSeek-OCR (no API costs)
- ๐ค Parallel Generation: Multiple LLM threads explore documents simultaneously
- ๐ฏ Smart Distribution: Control thread starting positions
- ๐พ SQLite Storage: Real-time dataset storage with flexible export
- ๐ Multiple Formats: JSONL, Parquet, CSV, JSON
- ๐ Flexible Modes: API or self-hosted for both stages
- ๐ Progress Tracking: Real-time progress bars
Installation
From PyPI (Recommended)
pip install bookdatamaker
From Source
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
pip install -r requirements.txt
pip install -e .
Optional: Local Inference Support
# For self-hosted OCR and LLM generation
pip install bookdatamaker[local] # From PyPI
# OR
pip install -e ".[local]" # From source - installs transformers==4.46.3, torch, flash-attn, etc.
Note: The project requires transformers==4.46.3 for optimal compatibility with DeepSeek-OCR. A warning will be displayed if a different version is detected.
System Requirements
For API Mode:
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)
For Local Mode:
- Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
- NVIDIA GPU with CUDA support (or CPU, though slower)
- 16GB+ VRAM recommended for GPU
- transformers==4.46.3
- Linux or WSL2 (recommended)
Quick Start
Prerequisites
# Set API keys (choose one based on your mode)
export OPENAI_API_KEY=your_openai_key # For API mode
export DEEPSEEK_API_KEY=your_deepseek_key # For API OCR mode
Option 1: API Mode (Fastest Setup)
# 1. Install
pip install bookdatamaker
# 2. Extract โ Generate โ Export
bookdatamaker extract book.pdf -o ./extracted
bookdatamaker generate ./extracted -d dataset.db --distribution "10,10,20,30,20,10"
bookdatamaker export-dataset dataset.db -o output.parquet
Option 2: Self-Hosted Mode (Free, Private)
# 1. Install with local dependencies
pip install bookdatamaker[local]
# 2. Extract with local OCR
bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted
# 3. Generate with vLLM
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db
# 4. Export
bookdatamaker export-dataset dataset.db -o output.parquet
System Requirements
For API Mode:
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)
For Local Mode:
- Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
- NVIDIA GPU with CUDA support (or CPU, though slower)
- 16GB+ VRAM recommended for GPU
- transformers==4.46.3
- Linux or WSL2 (recommended)
Extract Text (Stage 1)
Extract text from documents using DeepSeek OCR.
Supported Formats
- PDF: Text extraction or OCR from rendered pages
- EPUB: E-book text extraction
- Images: JPG, PNG, BMP, TIFF, WebP
API Mode
Note: DeepSeek does not provide an official OCR API. You need to self-host DeepSeek-OCR using vLLM.
Setup vLLM OCR Server
Follow the vLLM DeepSeek-OCR recipe to set up your server
Use the API
Once your vLLM server is running:
# Basic usage (default: http://localhost:8000/v1)
bookdatamaker extract book.pdf -o ./extracted
# Custom vLLM endpoint
bookdatamaker extract book.pdf \
--deepseek-api-url http://your-server:8000/v1 \
-o ./extracted
# Adjust concurrency for faster processing
bookdatamaker extract book.pdf \
--api-concurrency 8 \
-o ./extracted
Performance Options:
--api-concurrency N: Number of concurrent API requests (default: 4)- Higher values = faster processing (if your server can handle it)
- Adjust based on your vLLM server capacity and network bandwidth
- Example: 8-16 for powerful servers, 2-4 for smaller setups
Local Mode (Transformers)
Use local transformers model for OCR (DeepSeek-OCR, no API calls):
# Basic usage - uses transformers AutoModel with flash_attention_2
bookdatamaker extract book.pdf --mode local -o ./extracted
# With custom batch size (adjust based on GPU memory)
bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted
# Use CPU instead of GPU
bookdatamaker extract book.pdf --mode local --device cpu -o ./extracted
# Use specific GPU
bookdatamaker extract book.pdf --mode local --device cuda:1 -o ./extracted
# Process directory of images
bookdatamaker extract ./images/ --mode local -o ./extracted
Performance Options:
--batch-size N: Number of images to process in parallel (default: 8)- Higher values = faster processing but more GPU memory
- Adjust based on available VRAM
- Example: 4 for 8GB VRAM, 8-16 for 24GB+ VRAM
Device Options:
cuda(default): Use default CUDA GPUcuda:0,cuda:1, etc.: Use specific GPUcpu: Use CPU (slower, no GPU required)xpu: Use Intel XPU
Plain Text Mode (No OCR)
For PDF with embedded text, skip OCR and extract text directly (much faster):
# Extract plain text from PDF without OCR
bookdatamaker extract book.pdf --plain-text -o ./extracted
Note: EPUB files are automatically extracted as plain text (no OCR needed, no --plain-text flag required):
# EPUB always uses plain text extraction
bookdatamaker extract book.epub -o ./extracted
When to use --plain-text (for PDF):
- โ PDF with embedded text (e.g., born-digital documents)
- โ Fast extraction without GPU/API requirements
- โ Text-only documents
When NOT to use --plain-text:
- โ Scanned PDFs (images of text)
- โ PDFs with complex layouts requiring OCR
- โ Documents where text extraction quality is poor
Output Structure
./extracted/
โโโ page_001/
โ โโโ page_001.png # Page image
โ โโโ result.mmd # Extracted text in markdown
โโโ page_002/
โ โโโ page_002.png
โ โโโ result.mmd
โโโ ...
Note: Each page is stored in its own subdirectory with the extracted text in result.mmd format.
Generate Dataset (Stage 2)
Generate Q&A datasets using parallel LLM threads with page-based navigation.
Navigation Model
The system uses page navigation:
- LLM threads navigate through document pages
- Tools available:
get_current_page,next_page,previous_page,jump_to_page,get_page_context - Each thread starts at a specific page based on distribution
- Threads can move forward/backward through pages to explore content
Checkpoint & Resume
The generation process automatically saves checkpoints to the database:
- Thread state is saved after each successful Q&A submission
- If interrupted (Ctrl+C, crash, etc.), simply rerun the same command
- You'll be prompted to resume from checkpoint or start fresh
# First run (interrupted at 50%)
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ^C (interrupted)
# Resume from checkpoint
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# โ ๏ธ Found 4 incomplete thread(s) in database:
# Thread 0: 8/20 pairs, last updated 2024-01-15 10:30:45
# Thread 1: 10/20 pairs, last updated 2024-01-15 10:30:48
# Thread 2: 12/20 pairs, last updated 2024-01-15 10:30:50
# Thread 3: 7/20 pairs, last updated 2024-01-15 10:30:43
#
# Do you want to resume from checkpoint? [Y/n]: y
# โ Resuming from checkpoint...
Features:
- ๐พ Automatic checkpoint after each Q&A pair submission
- ๐ Resume from last position in document
- ๐ฌ Preserves conversation history
- ๐ฏ Tracks progress per thread
Basic Usage
# 6 threads (from distribution), 20 Q&A pairs per thread
bookdatamaker generate ./extracted \
-d dataset.db \
--distribution "10,10,20,30,20,10" \
--datasets-per-thread 20
Key Concept: Thread count is determined by the number of comma-separated values in --distribution.
API Mode Examples
# OpenAI/Azure
bookdatamaker generate ./extracted \
-d dataset.db \
--openai-api-url https://api.openai.com/v1 \
--model gpt-4 \
--distribution "10,10,20,30,20,10"
# Custom API endpoint
bookdatamaker generate ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model your-model-name \
--distribution "25,25,25,25"
vLLM Direct Mode (Self-Hosted)
Use vLLM directly without API server:
# Single GPU
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db
# Multi-GPU (4 GPUs, 6 threads)
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--distribution "10,10,20,30,20,10" \
-d dataset.db
Custom Prompts
Add specific instructions to guide LLM behavior:
# Language specification
bookdatamaker generate ./extracted \
--custom-prompt "Generate all Q&A in Chinese with simplified characters"
# Format specification
bookdatamaker generate ./extracted \
--custom-prompt "Questions should be multiple-choice with 4 options"
# Multiple requirements
bookdatamaker generate ./extracted \
--custom-prompt "Requirements:
1. Generate questions in English
2. Focus on practical applications
3. Include code examples
4. Answer length: 50-150 words
5. Difficulty: intermediate"
Message History Management
Control conversation history to prevent token overflow:
# Limit conversation to 50 messages (keeps system prompt + last 10 when exceeded)
bookdatamaker generate ./extracted \
--max-messages 50 \
-d dataset.db
# For models with limited context windows
bookdatamaker generate ./extracted \
--max-messages 30 \
--model gpt-3.5-turbo
How it works:
- When message count exceeds
--max-messages, history is pruned automatically - System prompt is always preserved
- Last 10 messages are kept for continuity
- Prevents token overflow errors during long generation sessions
- Useful for models with limited context windows (e.g., 4K, 8K tokens)
Export Dataset
Export from SQLite database to your preferred format:
# Parquet (recommended for data analysis, default: zstd compression)
bookdatamaker export-dataset dataset.db -o output.parquet
# Parquet with different compression methods
bookdatamaker export-dataset dataset.db -o output.parquet -c snappy # Faster, larger files
bookdatamaker export-dataset dataset.db -o output.parquet -c gzip # Smaller, slower
bookdatamaker export-dataset dataset.db -o output.parquet -c brotli # Best compression
bookdatamaker export-dataset dataset.db -o output.parquet -c none # No compression
# JSON Lines (easy to stream)
bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl
# CSV (Excel-friendly)
bookdatamaker export-dataset dataset.db -o output.csv -f csv
# JSON with metadata
bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadata
Compression Comparison
For Parquet files:
| Method | Speed | Size | Use Case |
|---|---|---|---|
zstd (default) |
Fast | Small | Best balance, recommended |
snappy |
Fastest | Larger | Real-time processing |
gzip |
Medium | Smaller | Network transfer |
brotli |
Slowest | Smallest | Archival storage |
none |
Instant | Largest | Debug/testing only |
Position Distribution
Control where threads start in the document using distribution percentages.
How It Works
Document: 100 pages
Distribution: "10,10,20,30,20,10" (6 threads)
Thread 0: Start at 0% โ Page 1
Thread 1: Start at 10% โ Page 10
Thread 2: Start at 20% โ Page 20
Thread 3: Start at 50% โ Page 50
Thread 4: Start at 70% โ Page 70
Thread 5: Start at 80% โ Page 80
Distribution Strategies
# Even distribution (4 threads)
--distribution "25,25,25,25"
# Start at: 0%, 25%, 50%, 75%
# Front-heavy (4 threads) - focus on beginning
--distribution "40,30,20,10"
# Start at: 0%, 40%, 70%, 90%
# Middle-heavy (5 threads) - focus on middle
--distribution "10,20,40,20,10"
# Start at: 0%, 10%, 30%, 70%, 90%
# Dense sampling (10 threads) - fine-grained coverage
--distribution "10,10,10,10,10,10,10,10,10,10"
Thread Count Guidelines
- Small documents (<50 pages): 2-4 threads
- Medium documents (50-200 pages): 4-8 threads
- Large documents (>200 pages): 8-16 threads
Performance Tuning
Optimize extraction and generation speeds based on your hardware and requirements.
Stage 1: OCR Extraction
API Mode (vLLM):
# Increase concurrent requests (default: 4)
bookdatamaker extract book.pdf --api-concurrency 8
# Guidelines:
# - 2-4: Small vLLM server (1-2 GPUs)
# - 4-8: Medium server (2-4 GPUs)
# - 8-16: Large server (4+ GPUs)
# - Monitor server load and adjust accordingly
Local Mode (Transformers):
# Increase batch size (default: 8)
bookdatamaker extract book.pdf --mode local --batch-size 16
# Guidelines based on GPU VRAM:
# - 8GB VRAM: batch-size 2-4
# - 16GB VRAM: batch-size 4-8
# - 24GB VRAM: batch-size 8-12
# - 40GB+ VRAM: batch-size 12-16
Stage 2: Dataset Generation
Thread Count:
# More threads = faster generation (if LLM server can handle it)
bookdatamaker generate ./extracted \
--distribution "10,10,10,10,10,10,10,10,10,10" \
--threads 10
# Guidelines:
# - API mode: 4-16 threads (based on rate limits)
# - vLLM mode: 4-8 threads (based on GPU capacity)
# - Local mode: 2-4 threads (memory intensive)
Message History Management:
# Limit conversation history to prevent memory issues
bookdatamaker generate ./extracted \
--max-messages 20 \
-d dataset.db
# Default: 20 messages (system message + last 10 exchanges)
# Lower values = less memory, potentially less context
# Higher values = more memory, better context retention
Duplicate Detection:
- Automatically enabled with 95% similarity threshold
- Uses rapidfuzz for efficient fuzzy matching
- Prevents redundant Q&A pairs in the dataset
Performance Tips
- Start Small: Test with small concurrency/batch sizes first
- Monitor Resources: Watch GPU memory, CPU usage, and network
- Balance Quality vs Speed: Higher concurrency may reduce quality
- Network Bandwidth: API mode performance depends on network speed
- vLLM Configuration: Use tensor parallelism for multi-GPU setups
Interactive Chat
Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.
Start Chat Session
# Basic chat with GPT-4
bookdatamaker chat ./extracted
# With vLLM server
bookdatamaker chat ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model Qwen/Qwen3-4B-Thinking-2507
# With custom database
bookdatamaker chat ./extracted --db my_dataset.db
Debug Mode
Set environment variable for verbose logging:
export LOG_LEVEL=DEBUG
bookdatamaker generate ./extracted -d dataset.db
Development
Project Structure
bookdatamaker/
โโโ src/bookdatamaker/
โ โโโ cli.py # CLI interface
โ โโโ ocr/
โ โ โโโ extractor.py # OCR extraction
โ โ โโโ document_parser.py # Document parsing
โ โโโ mcp/
โ โ โโโ server.py # MCP server
โ โโโ llm/
โ โ โโโ parallel_generator.py # Parallel generation
โ โโโ dataset/
โ โ โโโ builder.py # Dataset building
โ โ โโโ dataset_manager.py # SQLite management
โ โโโ utils/
โ โโโ page_manager.py # Page navigation
โ โโโ status.py # Progress indicators
โโโ tests/ # Test files
Development Setup
# Clone repository
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Code formatting
black src/
ruff check src/
# Type checking
mypy src/
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Ensure all tests pass
- Submit a pull request
Testing
# Run all tests
pytest
# Run specific test file
pytest tests/test_ocr.py
# Run with coverage
pytest --cov=bookdatamaker tests/
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bookdatamaker-0.3.0.5.tar.gz.
File metadata
- Download URL: bookdatamaker-0.3.0.5.tar.gz
- Upload date:
- Size: 61.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b92e265de8ade86a9af669ee7fa58e15585590c06f38569bf572bb9e70b7cb3c
|
|
| MD5 |
fb8074f12512f885365c9f15cf428a68
|
|
| BLAKE2b-256 |
e695a3b0d6f4558c252c1e0ea160e4d260dcf2979fd16a7063cbf10fe1e0bc13
|
Provenance
The following attestation bundles were made for bookdatamaker-0.3.0.5.tar.gz:
Publisher:
python-publish.yml on zwh20081/bookdatamaker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bookdatamaker-0.3.0.5.tar.gz -
Subject digest:
b92e265de8ade86a9af669ee7fa58e15585590c06f38569bf572bb9e70b7cb3c - Sigstore transparency entry: 712542284
- Sigstore integration time:
-
Permalink:
zwh20081/bookdatamaker@045c76c7c0044ea806df281bb6c45b596c914047 -
Branch / Tag:
refs/tags/v0.3.0.5 - Owner: https://github.com/zwh20081
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@045c76c7c0044ea806df281bb6c45b596c914047 -
Trigger Event:
release
-
Statement type:
File details
Details for the file bookdatamaker-0.3.0.5-py3-none-any.whl.
File metadata
- Download URL: bookdatamaker-0.3.0.5-py3-none-any.whl
- Upload date:
- Size: 56.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75a6fd4bfa843c6236f5e31dda1c29049a093deb92b71d41018001c2d0ec45f9
|
|
| MD5 |
2fb718f95ec21f6fa46f8131297978b5
|
|
| BLAKE2b-256 |
8bca3ec53a3b98958684a9edd6f82aa0271338cc3ad6f5580899ffb056ed5582
|
Provenance
The following attestation bundles were made for bookdatamaker-0.3.0.5-py3-none-any.whl:
Publisher:
python-publish.yml on zwh20081/bookdatamaker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bookdatamaker-0.3.0.5-py3-none-any.whl -
Subject digest:
75a6fd4bfa843c6236f5e31dda1c29049a093deb92b71d41018001c2d0ec45f9 - Sigstore transparency entry: 712542287
- Sigstore integration time:
-
Permalink:
zwh20081/bookdatamaker@045c76c7c0044ea806df281bb6c45b596c914047 -
Branch / Tag:
refs/tags/v0.3.0.5 - Owner: https://github.com/zwh20081
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@045c76c7c0044ea806df281bb6c45b596c914047 -
Trigger Event:
release
-
Statement type: