CLI tool for extracting text with DeepSeek OCR and generating datasets

These details have not been verified by PyPI

Project description

Book Data Maker

A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.

Features

📄 Multi-Format Support: PDF, EPUB, and images
🏠 Self-Hosted OCR: Local transformers for DeepSeek-OCR (no API costs)
🤖 Parallel Generation: Multiple LLM threads explore documents simultaneously
🎯 Smart Distribution: Control thread starting positions
💾 SQLite Storage: Real-time dataset storage with flexible export
📊 Multiple Formats: JSONL, Parquet, CSV, JSON
🌐 Flexible Modes: API or self-hosted for both stages
📈 Progress Tracking: Real-time progress bars
⚡ Resume Support: Continue interrupted sessions

Quick Start

Prerequisites

# Set API keys (choose one based on your mode)
export OPENAI_API_KEY=your_openai_key        # For API mode
export DEEPSEEK_API_KEY=your_deepseek_key    # For API OCR mode

Option 1: API Mode (Fastest Setup)

# 1. Install
pip install -r requirements.txt && pip install -e .

# 2. Extract → Generate → Export
bookdatamaker extract book.pdf -o ./extracted
bookdatamaker generate ./extracted/combined.txt -d dataset.db --distribution "10,10,20,30,20,10"
bookdatamaker export-dataset dataset.db -o output.parquet

Option 2: Self-Hosted Mode (Free, Private)

# 1. Install with local dependencies
pip install -r requirements.txt && pip install -e ".[local]"

# 2. Extract with local OCR
bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted

# 3. Generate with vLLM
bookdatamaker generate ./extracted/combined.txt \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B-Instruct \
  --distribution "25,25,25,25" \
  -d dataset.db

# 4. Export
bookdatamaker export-dataset dataset.db -o output.parquet

Installation

Basic Installation

git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
pip install -r requirements.txt
pip install -e .

Optional: Local Inference Support

# For self-hosted OCR and LLM generation
pip install -e ".[local]"  # Installs transformers==4.46.3, torch, flash-attn, etc.

Note: The project requires transformers==4.46.3 for optimal compatibility with DeepSeek-OCR. A warning will be displayed if a different version is detected.

System Requirements

For API Mode:

Python 3.10+
API keys (OpenAI, DeepSeek, etc.)

For Local Mode:

Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
NVIDIA GPU with CUDA support (or CPU, though slower)
16GB+ VRAM recommended for GPU
transformers==4.46.3
Linux or WSL2 (recommended)

Extract Text (Stage 1)

Extract text from documents using DeepSeek OCR.

Supported Formats

PDF: Text extraction or OCR from rendered pages
EPUB: E-book text extraction
Images: JPG, PNG, BMP, TIFF, WebP

API Mode

# Basic usage
bookdatamaker extract book.pdf -o ./extracted

# Custom API endpoint
bookdatamaker extract book.pdf \
  --deepseek-api-url https://custom-api.example.com/v1 \
  -o ./extracted

Local Mode (Transformers)

Use local transformers model for OCR (DeepSeek-OCR, no API calls):

# Basic usage - uses transformers AutoModel with flash_attention_2
bookdatamaker extract book.pdf --mode local -o ./extracted

# With custom batch size (adjust based on GPU memory)
bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted

# Use CPU instead of GPU
bookdatamaker extract book.pdf --mode local --device cpu -o ./extracted

# Use specific GPU
bookdatamaker extract book.pdf --mode local --device cuda:1 -o ./extracted

# Process directory of images
bookdatamaker extract ./images/ --mode local -o ./extracted

Batch Size Guidelines:

12-16: GPUs with 24GB+ VRAM
8-12: GPUs with 16GB+ VRAM (default: 8)
4-8: GPUs with 8-12GB VRAM
1-4: GPUs with <8GB VRAM

Device Options:

cuda (default): Use default CUDA GPU
cuda:0, cuda:1, etc.: Use specific GPU
cpu: Use CPU (slower, no GPU required)

Output Structure

./extracted/
├── page_001.txt
├── page_002.txt
├── ...
└── combined.txt    # All pages with [PAGE_XXX] markers

Generate Dataset (Stage 2)

Generate Q&A datasets using parallel LLM threads.

Basic Usage

# 6 threads (from distribution), 20 Q&A pairs per thread
bookdatamaker generate combined.txt \
  -d dataset.db \
  --distribution "10,10,20,30,20,10" \
  --datasets-per-thread 20

Key Concept: Thread count is determined by the number of comma-separated values in --distribution.

API Mode Examples

# OpenAI/Azure
bookdatamaker generate combined.txt \
  -d dataset.db \
  --openai-api-url https://api.openai.com/v1 \
  --model gpt-4 \
  --distribution "10,10,20,30,20,10"

# Custom API endpoint
bookdatamaker generate combined.txt \
  --openai-api-url http://localhost:8000/v1 \
  --model your-model-name \
  --distribution "25,25,25,25"

vLLM Direct Mode (Self-Hosted)

Use vLLM directly without API server:

# Single GPU
bookdatamaker generate combined.txt \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B-Instruct \
  --distribution "25,25,25,25" \
  -d dataset.db

# Multi-GPU (4 GPUs, 6 threads)
bookdatamaker generate combined.txt \
  --mode vllm \
  --vllm-model-path meta-llama/Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --distribution "10,10,20,30,20,10" \
  -d dataset.db

Benefits of vLLM Mode:

No API costs
Full privacy (local processing)
Optimized inference
Thread-safe parallel processing
Automatic batching

Custom Prompts

Add specific instructions to guide LLM behavior:

# Language specification
bookdatamaker generate combined.txt \
  --custom-prompt "Generate all Q&A in Chinese with simplified characters"

# Format specification
bookdatamaker generate combined.txt \
  --custom-prompt "Questions should be multiple-choice with 4 options"

# Multiple requirements
bookdatamaker generate combined.txt \
  --custom-prompt "Requirements:
1. Generate questions in English
2. Focus on practical applications
3. Include code examples
4. Answer length: 50-150 words
5. Difficulty: intermediate"

Export Dataset

Export from SQLite database to your preferred format:

# Parquet (recommended for data analysis)
bookdatamaker export-dataset dataset.db -o output.parquet

# JSON Lines (easy to stream)
bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl

# CSV (Excel-friendly)
bookdatamaker export-dataset dataset.db -o output.csv -f csv

# JSON with metadata
bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadata

Format Comparison:

Format	Best For	Size	Load Speed
Parquet	Data analysis, ML	Smallest	Fastest
JSONL	Streaming, processing	Medium	Fast
CSV	Excel, spreadsheets	Largest	Medium
JSON	API responses	Large	Slow

Position Distribution

Control where threads start in the document using distribution percentages.

How It Works

Document: 500 paragraphs
Distribution: "10,10,20,30,20,10" (6 threads)

Thread 0: Start at 0%   → Paragraph 1
Thread 1: Start at 10%  → Paragraph 50
Thread 2: Start at 20%  → Paragraph 100
Thread 3: Start at 50%  → Paragraph 250
Thread 4: Start at 70%  → Paragraph 350
Thread 5: Start at 80%  → Paragraph 400

Distribution Strategies

# Even distribution (4 threads)
--distribution "25,25,25,25"
# Start at: 0%, 25%, 50%, 75%

# Front-heavy (4 threads) - focus on beginning
--distribution "40,30,20,10"
# Start at: 0%, 40%, 70%, 90%

# Middle-heavy (5 threads) - focus on middle
--distribution "10,20,40,20,10"
# Start at: 0%, 10%, 30%, 70%, 90%

# Dense sampling (10 threads) - fine-grained coverage
--distribution "10,10,10,10,10,10,10,10,10,10"

Thread Count Guidelines

Small documents (<100 paragraphs): 2-4 threads
Medium documents (100-500 paragraphs): 4-8 threads
Large documents (>500 paragraphs): 8-16 threads

Performance Tuning

Extraction (Stage 1)

Batch Size Optimization (Transformers):

# Maximum speed (24GB+ VRAM) - uses transformers with DeepSeek-OCR
bookdatamaker extract book.pdf --mode local --batch-size 16

# Balanced (16GB VRAM) - transformers default batch size
bookdatamaker extract book.pdf --mode local --batch-size 8

# Conservative (<8GB VRAM) - smaller batches for limited VRAM
bookdatamaker extract book.pdf --mode local --batch-size 4

# Use CPU if no GPU available (slower)
bookdatamaker extract book.pdf --mode local --device cpu --batch-size 2

Multi-GPU Setup:

# Use specific GPU in multi-GPU system
bookdatamaker extract book.pdf --mode local --device cuda:0
bookdatamaker extract book.pdf --mode local --device cuda:1

# Run multiple processes on different GPUs simultaneously
bookdatamaker extract book1.pdf --mode local --device cuda:0 &
bookdatamaker extract book2.pdf --mode local --device cuda:1 &

Generation (Stage 2)

Optimal Configurations:

# Maximum throughput (multi-GPU, 12 threads)
bookdatamaker generate text.txt --mode vllm \
  --vllm-model-path meta-llama/Llama-3-70B \
  --tensor-parallel-size 4 \
  --distribution "5,5,10,10,15,15,15,15,5,5,2,3" \
  --datasets-per-thread 50

# Balanced (single GPU, 6 threads)
bookdatamaker generate text.txt --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B \
  --distribution "10,10,20,30,20,10" \
  --datasets-per-thread 20

# Conservative (2 threads)
bookdatamaker generate text.txt --mode vllm \
  --vllm-model-path meta-llama/Llama-3-8B \
  --distribution "50,50" \
  --datasets-per-thread 10

Interactive Chat

Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.

Start Chat Session

# Basic chat with GPT-4
bookdatamaker chat combined.txt

# With vLLM server
bookdatamaker chat combined.txt \
  --openai-api-url http://localhost:8000/v1 \
  --model Qwen/Qwen3-4B-Thinking-2507

# With custom database
bookdatamaker chat combined.txt --db my_dataset.db

Example Interaction

📚 Document: combined.txt
📊 Paragraphs: 578
🤖 Model: gpt-4

You: What's in paragraph 100?
- `-f, --format`: Format: `jsonl`, `parquet`, `csv`, `json` (default: `parquet`)
- `--include-metadata`: Include timestamps

### Parameter Tables

#### extract Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_path` | required | - | Input file or directory |
| `--output-dir` | optional | `extracted_text` | Output directory |
| `--mode` | optional | `api` | OCR mode: `api` or `local` |
| `--batch-size` | optional | `8` | Batch size for local mode |
| `--device` | optional | `cuda` | Torch device for local mode: `cuda`, `cuda:0`, `cpu` |
| `--deepseek-api-key` | optional | env var | DeepSeek API key |
| `--deepseek-api-url` | optional | `https://api.deepseek.com/v1` | DeepSeek API URL |
| `--local-model-path` | optional | `deepseek-ai/DeepSeek-OCR` | Local model path |

#### generate Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text_file` | required | - | Combined text file |
| `--db` | optional | `dataset.db` | Database file path |
| `--mode` | optional | `api` | LLM mode: `api` or `vllm` |
| `--distribution` | optional | `10,10,20,30,20,10` | Position distribution (determines threads) |
| `--datasets-per-thread` | optional | `10` | Target Q&A pairs per thread |
| `--openai-api-key` | optional | env var | OpenAI API key |
| `--openai-api-url` | optional | `https://api.openai.com/v1` | API URL |
| `--model` | optional | `gpt-4` | Model name |
| `--vllm-model-path` | optional | - | vLLM model path |
| `--tensor-parallel-size` | optional | `1` | Number of GPUs |
| `--custom-prompt` | optional | - | Additional instructions |

---

## Troubleshooting

### Common Issues

**Problem: Threads not completing**
- Reduce `--datasets-per-thread`
- Check API rate limits
- Verify API keys
- Ensure document has enough content

**Problem: Out of memory (OCR)**
- Reduce `--batch-size`
- Use `--device cpu` to run on CPU instead of GPU
- Use API mode instead of local
- Use specific GPU with `--device cuda:0` if you have multiple GPUs

**Problem: Out of memory (Generation)**
- Reduce thread count (fewer distribution values)
- Use smaller model
- Reduce `--tensor-parallel-size`

**Problem: Low quality Q&A pairs**
- Adjust distribution to focus on content-rich sections
- Use higher-quality model (e.g., GPT-4)
- Add specific `--custom-prompt` instructions
- Check OCR quality

**Problem: SQLite errors**
- Ensure database path is writable
- Don't modify database during generation
- Delete and regenerate if corrupted

### Debug Mode

Set environment variable for verbose logging:

```bash
export LOG_LEVEL=DEBUG
bookdatamaker generate combined.txt -d dataset.db

Development

Project Structure

bookdatamaker/
├── src/bookdatamaker/
│   ├── cli.py                    # CLI interface
│   ├── ocr/
│   │   ├── extractor.py          # OCR extraction
│   │   └── document_parser.py    # Document parsing
│   ├── mcp/
│   │   └── server.py             # MCP server
│   ├── llm/
│   │   └── parallel_generator.py # Parallel generation
│   ├── dataset/
│   │   ├── builder.py            # Dataset building
│   │   └── dataset_manager.py    # SQLite management
│   └── utils/
│       ├── page_manager.py       # Page navigation
│       └── status.py             # Progress indicators
└── tests/                        # Test files

Development Setup

# Clone repository
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Code formatting
black src/
ruff check src/

# Type checking
mypy src/

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Ensure all tests pass
Submit a pull request

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_ocr.py

# Run with coverage
pytest --cov=bookdatamaker tests/

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.1.3

Mar 25, 2026

4.1.2

Mar 20, 2026

4.1.0

Mar 19, 2026

4.0.0

Mar 19, 2026

0.3.0.6

Nov 21, 2025

0.3.0.5

Nov 21, 2025

0.3.0.4

Nov 20, 2025

0.3.0.3

Nov 20, 2025

0.3.0.2

Nov 17, 2025

0.3.0.1

Nov 16, 2025

0.3.0

Nov 16, 2025

0.2.5.7

Nov 16, 2025

0.2.5.6

Nov 15, 2025

0.2.5.5

Nov 15, 2025

0.2.5.4

Nov 15, 2025

0.2.5.3

Nov 14, 2025

0.2.5.2

Nov 14, 2025

0.2.5.1

Nov 13, 2025

0.2.5

Nov 13, 2025

0.2.4

Nov 13, 2025

0.2.3

Nov 13, 2025

0.2.2

Nov 13, 2025

0.2.1

Nov 13, 2025

This version

0.2.0

Nov 13, 2025

0.1.0

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bookdatamaker-0.2.0.tar.gz (45.1 kB view details)

Uploaded Nov 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bookdatamaker-0.2.0-py3-none-any.whl (41.3 kB view details)

Uploaded Nov 13, 2025 Python 3

File details

Details for the file bookdatamaker-0.2.0.tar.gz.

File metadata

Download URL: bookdatamaker-0.2.0.tar.gz
Upload date: Nov 13, 2025
Size: 45.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bookdatamaker-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ad55231838e9988ae5286276e7a9f361684b33acbc0be448924bd1bb6a2d9ab8`
MD5	`06bff2ae1874855da3a51c433e4a4a80`
BLAKE2b-256	`7eb5867909eb71e69421aa64c25348b5597d7a40c9e6d6a279045eba19e3325e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookdatamaker-0.2.0.tar.gz:

Publisher: python-publish.yml on zwh20081/bookdatamaker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bookdatamaker-0.2.0.tar.gz
- Subject digest: ad55231838e9988ae5286276e7a9f361684b33acbc0be448924bd1bb6a2d9ab8
- Sigstore transparency entry: 699282521
- Sigstore integration time: Nov 13, 2025
Source repository:
- Permalink: zwh20081/bookdatamaker@21d7edc5e4f9ac597df28b8e490d4ea15319f1aa
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/zwh20081
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@21d7edc5e4f9ac597df28b8e490d4ea15319f1aa
- Trigger Event: release

File details

Details for the file bookdatamaker-0.2.0-py3-none-any.whl.

File metadata

Download URL: bookdatamaker-0.2.0-py3-none-any.whl
Upload date: Nov 13, 2025
Size: 41.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bookdatamaker-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bd3bcd10b7131f115514713b215c1f66389013e12b199d4517cfb6d136008e4`
MD5	`80b6d3577d25e90d96be7c15f845dc3d`
BLAKE2b-256	`172563426b4c277042cd7b2d90c0ff018fe7535a043966a07741028d4a65caa7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookdatamaker-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on zwh20081/bookdatamaker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bookdatamaker-0.2.0-py3-none-any.whl
- Subject digest: 7bd3bcd10b7131f115514713b215c1f66389013e12b199d4517cfb6d136008e4
- Sigstore transparency entry: 699282526
- Sigstore integration time: Nov 13, 2025
Source repository:
- Permalink: zwh20081/bookdatamaker@21d7edc5e4f9ac597df28b8e490d4ea15319f1aa
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/zwh20081
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@21d7edc5e4f9ac597df28b8e490d4ea15319f1aa
- Trigger Event: release

bookdatamaker 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Book Data Maker

Table of Contents

🚀 Getting Started

📖 User Guide

🔧 Advanced

📚 Reference

Features

Quick Start

Prerequisites

Option 1: API Mode (Fastest Setup)

Option 2: Self-Hosted Mode (Free, Private)

Installation

Basic Installation

Optional: Local Inference Support

System Requirements

Extract Text (Stage 1)

Supported Formats

API Mode

Local Mode (Transformers)

Output Structure

Generate Dataset (Stage 2)

Basic Usage

API Mode Examples

vLLM Direct Mode (Self-Hosted)

Custom Prompts

Export Dataset

Position Distribution

How It Works

Distribution Strategies

Thread Count Guidelines

Performance Tuning

Extraction (Stage 1)

Generation (Stage 2)

Interactive Chat

Start Chat Session

Example Interaction

Development

Project Structure

Development Setup

Contributing

Testing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance