A client library for BookWyrm

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Project description

bookwyrm

A Python client library and CLI designed to accelerate the development of RAG (Retrieval Augmented Generation) systems and AI agents. BookWyrm provides powerful text processing capabilities through a simple API, making it easy to build sophisticated document analysis and citation systems.

Documentation

📖 Full Documentation

Key Capabilities

BookWyrm simplifies RAG and agent development by providing these core endpoints:

Citation Finding - Automatically find and extract relevant citations from text chunks based on questions or queries
Text Processing - Break down large documents into meaningful phrases and chunks with configurable sizing
Document Classification - Intelligently classify files and content by format, type, and structure
PDF Structure Extraction - Extract structured text data from PDF files using OCR with bounding box coordinates
Summarization - Generate concise summaries from collections of text phrases or documents
Streaming Support - Real-time processing with progress updates for all major operations

These capabilities work together to provide a complete pipeline for document ingestion, processing, and retrieval - the foundation of any RAG system.

Installation

Using uv (recommended for development)

# Clone the repository
git clone https://github.com/scidonia/bookwyrm-client.git
cd bookwyrm-client

# Install dependencies and create virtual environment
uv sync

# Install in development mode
uv pip install -e .

Using pip

# Install from PyPI (when published)
pip install bookwyrm

Getting an API Key

To use the BookWyrm client, you'll need an API key from bookwyrm.ai:

Visit bookwyrm.ai
Click on "Sign up for beta" to create an account
Once registered, you can create an API key in the dashboard.
Set your API key as an environment variable or pass it directly to the client

export BOOKWYRM_API_KEY="your-api-key-here"

Usage

Python Library

The BookWyrm client provides both synchronous and asynchronous interfaces for text processing, citation finding, summarization, and phrasal analysis.

Synchronous Client

from bookwyrm import BookWyrmClient
from bookwyrm.models import TextSpan

# Initialize client
client = BookWyrmClient(base_url="https://api.bookwyrm.ai:443", api_key="your-key")

# Citation finding using function interface
chunks = [
    TextSpan(text="This is the first chunk.", start_char=0, end_char=25),
    TextSpan(text="This is the second chunk.", start_char=26, end_char=52),
]

# Stream citations (real-time results) - function interface
citations = []
for stream_response in client.stream_citations(
    chunks=chunks,
    question="What are the chunks about?"
):
    if hasattr(stream_response, 'citation'):
        citations.append(stream_response.citation)
        print(f"New citation: {stream_response.citation.text}")
    elif hasattr(stream_response, 'message'):
        print(f"Progress: {stream_response.message}")
    elif hasattr(stream_response, 'total_citations'):
        print(f"Found {stream_response.total_citations} citations total")

for citation in citations:
    print(f"Quality: {citation.quality}/4")
    print(f"Text: {citation.text}")
    print(f"Reasoning: {citation.reasoning}")

# Phrasal text processing with boolean flags
for response in client.stream_process_text(
    text_url="https://www.gutenberg.org/cache/epub/32706/pg32706.txt",  # Triplanetary by E. E. Smith
    chunk_size=1000,
    offsets=True  # Boolean flag for WITH_OFFSETS
):
    if hasattr(response, 'text'):
        print(f"Phrase: {response.text[:100]}...")
    elif hasattr(response, 'message'):
        print(f"Progress: {response.message}")

# File classification using function interface
classification_response = client.classify(
    content_bytes=open("alice_wonderland.epub", "rb").read(),
    filename="alice_wonderland.epub"  # Optional hint
)
print(f"Format: {classification_response.classification.format_type}")
print(f"Content Type: {classification_response.classification.content_type}")
print(f"MIME Type: {classification_response.classification.mime_type}")
print(f"Confidence: {classification_response.classification.confidence:.2%}")
print(f"File Size: {classification_response.file_size:,} bytes")

# Classify local text content
with open("document.txt", "r") as f:
    content = f.read()

local_response = client.classify(
    content=content,
    filename="document.txt"
)
print(f"Local file classified as: {local_response.classification.content_type}")

# Classify binary content using raw bytes
with open("image.jpg", "rb") as f:
    binary_content = f.read()

binary_response = client.classify(
    content_bytes=binary_content,
    filename="image.jpg"
)
print(f"Binary file classified as: {binary_response.classification.content_type}")

# Streaming PDF extraction with progress
pages = []
for stream_response in client.stream_extract_pdf(
    pdf_url="https://example.com/document.pdf",
    start_page=1,
    num_pages=5
):
    if hasattr(stream_response, 'page_data'):
        pages.append(stream_response.page_data)
        print(f"Processed page {stream_response.document_page}: {len(stream_response.page_data.text_blocks)} elements")
    elif hasattr(stream_response, 'total_pages'):
        print(f"Starting extraction of {stream_response.total_pages} pages")

print(f"Extracted {len(pages)} pages")
print(f"Found {sum(len(page.text_blocks) for page in pages)} text elements")

# Extract from local PDF file using raw bytes
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

local_pages = []
for stream_response in client.stream_extract_pdf(
    pdf_bytes=pdf_bytes,
    filename="document.pdf",
    start_page=10,
    num_pages=5
):
    if hasattr(stream_response, 'page_data'):
        local_pages.append(stream_response.page_data)

print(f"Extracted pages 10-14: {len(local_pages)} pages processed")

# Streaming summarization
final_summary = None
for response in client.stream_summarize(
    content="Long text content to summarize...",
    max_tokens=5000,
    debug=True
):
    if hasattr(response, 'summary'):
        final_summary = response
        break
    elif hasattr(response, 'message'):
        print(f"Progress: {response.message}")

if final_summary:
    print(f"Summary: {final_summary.summary}")
    print(f"Used {final_summary.levels_used} levels")

client.close()

Asynchronous Client

import asyncio
from bookwyrm import AsyncBookWyrmClient, CitationRequest, ProcessTextRequest, ResponseFormat, ClassifyRequest, SummarizeRequest

async def main():
    # Initialize async client
    async with AsyncBookWyrmClient(base_url="https://api.bookwyrm.ai:443", api_key="your-key") as client:
        
        # Stream citations
        citations = []
        async for stream_response in client.stream_citations(
            jsonl_url="https://example.com/chunks.jsonl",
            question="What is the main topic?"
        ):
            if hasattr(stream_response, 'citation'):
                citations.append(stream_response.citation)
                print(f"New citation: {stream_response.citation.text}")
            elif hasattr(stream_response, 'total_citations'):
                print(f"Found {stream_response.total_citations} citations")

        # Phrasal text processing with boolean flags
        async for response in client.stream_process_text(
            text_url="https://www.gutenberg.org/cache/epub/32706/pg32706.txt",  # Triplanetary by E. E. Smith
            chunk_size=500,
            text_only=True  # Boolean flag for TEXT_ONLY
        ):
            if hasattr(response, 'text'):
                print(f"Phrase: {response.text[:100]}...")
            elif hasattr(response, 'message'):
                print(f"Progress: {response.message}")

        # File classification using function interface
        classification = await client.classify(
            content_bytes=open("alice_wonderland.epub", "rb").read()
        )
        print(f"Classified as: {classification.classification.content_type}")
        print(f"Confidence: {classification.classification.confidence:.2%}")

        # Streaming PDF extraction
        pages = []
        async for stream_response in client.stream_extract_pdf(
            pdf_url="https://example.com/document.pdf",
            start_page=1,
            num_pages=10
        ):
            if hasattr(stream_response, 'page_data'):
                pages.append(stream_response.page_data)
                print(f"Page {stream_response.document_page}: {len(stream_response.page_data.text_blocks)} elements")
            elif hasattr(stream_response, 'total_pages'):
                print(f"Processing {stream_response.total_pages} pages...")

asyncio.run(main())

Command Line Interface

The CLI provides a rich, interactive interface for text processing operations:

Citation Finding

# Find citations in a JSONL file
bookwyrm cite "What is the main theme?" chunks.jsonl

# Save results to JSON
bookwyrm cite "What is the main theme?" chunks.jsonl --output results.json

# Use a URL as source
bookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl

# Use --file option instead of positional argument
bookwyrm cite "What is the main theme?" --file chunks.jsonl

# Process only a subset of chunks
bookwyrm cite "What is the main theme?" chunks.jsonl --start 10 --limit 100

# Use non-streaming mode
bookwyrm cite "What is the main theme?" chunks.jsonl --no-stream

Phrasal Text Processing

# Process text from a URL (Triplanetary by E. E. Smith from Project Gutenberg)
bookwyrm phrasal --url "https://www.gutenberg.org/cache/epub/32706/pg32706.txt" --chunk-size 1000 --offsets --output triplanetary_phrases.jsonl

# Process text from a file using boolean flags
bookwyrm phrasal --file document.txt --offsets --output phrases.jsonl

# Process text directly with text-only output
bookwyrm phrasal "This is some text to analyze for phrases." --text-only

# Traditional format option still works
bookwyrm phrasal --file document.txt --format with_offsets --output phrases.jsonl

# Use different SpaCy models
bookwyrm phrasal --file document.txt --spacy-model en_core_web_lg --offsets

File Classification

# Classify a URL resource (EPUB from Project Gutenberg)
bookwyrm classify --url "https://www.gutenberg.org/ebooks/18857.epub3.images" --output classification.json

# Classify a local file
bookwyrm classify --file document.pdf --output results.json

# Classify text content directly
bookwyrm classify "import pandas as pd\ndf = pd.DataFrame()" --filename "script.py"

# Classify with filename hint for better accuracy
bookwyrm classify --url "https://example.com/data" --filename "data.json"

# Note: Binary files are automatically detected and base64-encoded when using --file option

PDF Structure Extraction

# Extract structured data from a local PDF file (with streaming progress)
bookwyrm extract-pdf document.pdf --output extracted_data.json

# Extract from a PDF URL with streaming progress
bookwyrm extract-pdf --url "https://example.com/document.pdf" --output results.json

# Use --file option instead of positional argument
bookwyrm extract-pdf --file document.pdf --output data.json

# Extract specific page ranges
bookwyrm extract-pdf document.pdf --start-page 5 --num-pages 10 --output pages_5_to_14.json

# Extract from page 10 to end of document
bookwyrm extract-pdf document.pdf --start-page 10 --output from_page_10.json

# Use non-streaming mode (no progress bar)
bookwyrm extract-pdf document.pdf --no-stream --output results.json

# Show detailed extraction results with verbose output
bookwyrm extract-pdf document.pdf --verbose --output detailed_results.json

# Use custom PDF extraction API endpoint
bookwyrm extract-pdf document.pdf --base-url "http://localhost:8000" --output results.json

# Auto-save with generated filename (no --output needed)
bookwyrm extract-pdf my_document.pdf --start-page 5 --num-pages 3
# Saves to: my_document_pages_5-7_extracted.json

Summarization

# Summarize a JSONL file of phrases
bookwyrm summarize phrases.jsonl --output summary.json

# Include debug information
bookwyrm summarize phrases.jsonl --debug --max-tokens 5000

Global Options

All commands support these options:

# Set API key and base URL for individual commands
bookwyrm phrasal --api-key YOUR_KEY --base-url https://api.bookwyrm.ai:443 --url "https://example.com/text.txt"

# Enable verbose output (per command)
bookwyrm cite --verbose "Question?" chunks.jsonl

# Use environment variables (recommended)
export BOOKWYRM_API_URL="https://api.bookwyrm.ai:443"
export BOOKWYRM_API_KEY="your-api-key"
export BOOKWYRM_PDF_API_URL="https://pdf-api.bookwyrm.ai:443"  # Optional: separate PDF API endpoint
bookwyrm phrasal --url "https://example.com/text.txt"

Note: API key and base URL options are available on each command individually, not as global app-level options. Using environment variables is the recommended approach for setting these values across all commands.

Environment Variables

Set these environment variables for convenience:

export BOOKWYRM_API_KEY="your-api-key"
export BOOKWYRM_API_URL="https://api.bookwyrm.ai:443"
export BOOKWYRM_PDF_API_URL="https://pdf-api.bookwyrm.ai:443"  # Optional: separate PDF API endpoint

Development

This project supports both uv and pip for development:

# With uv
uv sync
uv run pytest integration/
uv run bookwyrm --help

# With pip
pip install -r requirements-integration.txt
pytest integration/
bookwyrm --help

Running Tests

# Run all integration tests
pytest integration/

# Run specific test suites
pytest integration/ -k test_cli
pytest integration/ -k test_library

# Run specific features
pytest integration/ -m cite
pytest integration/ -m summarize

# Run with tox (recommended)
tox -e dev-local
tox -e dev-local-cli-cite

API Reference

Models

TextSpan: Represents a text span with start/end character positions
CitationRequest: Request model for citation processing
Citation: A found citation with quality score and reasoning
CitationResponse: Response containing multiple citations
UsageInfo: Token usage and cost information
ClassifyRequest: Request model for file classification
ClassifyResponse: Response containing classification results
FileClassification: Detailed classification information
PDFExtractRequest: Request model for PDF structure extraction
PDFExtractResponse: Response containing extracted PDF data
PDFPage: Individual page data with text elements
PDFTextElement: Text element with position and confidence
StreamingPDFResponse: Union type for streaming PDF responses

Clients

BookWyrmClient: Synchronous client with get_citations(), stream_citations(), classify(), extract_pdf(), stream_extract_pdf(), and other methods
AsyncBookWyrmClient: Asynchronous client with async versions of the same methods

Exceptions

BookWyrmClientError: Base exception class
BookWyrmAPIError: API-specific errors with status codes

License

See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

0.1.21

Dec 17, 2025

0.1.20

Dec 17, 2025

0.1.19

Nov 13, 2025

0.1.18

Nov 13, 2025

0.1.17

Nov 12, 2025

0.1.16

Nov 12, 2025

0.1.15

Nov 11, 2025

0.1.13

Nov 11, 2025

0.1.12

Oct 20, 2025

0.1.11

Oct 16, 2025

0.1.10

Oct 15, 2025

0.1.9

Oct 8, 2025

This version

0.1.8

Oct 4, 2025

0.1.7

Oct 3, 2025

0.1.6

Oct 3, 2025

0.1.5

Oct 3, 2025

0.1.4

Oct 1, 2025

0.1.3

Sep 30, 2025

0.1.2

Sep 29, 2025

0.1.1

Sep 27, 2025

0.1.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bookwyrm-0.1.8.tar.gz (147.1 kB view details)

Uploaded Oct 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bookwyrm-0.1.8-py3-none-any.whl (48.7 kB view details)

Uploaded Oct 4, 2025 Python 3

File details

Details for the file bookwyrm-0.1.8.tar.gz.

File metadata

Download URL: bookwyrm-0.1.8.tar.gz
Upload date: Oct 4, 2025
Size: 147.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bookwyrm-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`11983049696eefd20898ddf133692cb47424f51cc80f55d75d34d8b2af4db5f1`
MD5	`9947658dd49c6761cad92c2a4924aead`
BLAKE2b-256	`49fa5cc79899b4edf8cb8f01c1668b858588297d9921c2578506133f11c31ab4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookwyrm-0.1.8.tar.gz:

Publisher: publish-to-pypi.yml on scidonia/bookwyrm-client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bookwyrm-0.1.8.tar.gz
- Subject digest: 11983049696eefd20898ddf133692cb47424f51cc80f55d75d34d8b2af4db5f1
- Sigstore transparency entry: 584023582
- Sigstore integration time: Oct 4, 2025
Source repository:
- Permalink: scidonia/bookwyrm-client@7ba8601616f63fdc6d2ac8a280a7b7393a05101e
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/scidonia
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@7ba8601616f63fdc6d2ac8a280a7b7393a05101e
- Trigger Event: push

File details

Details for the file bookwyrm-0.1.8-py3-none-any.whl.

File metadata

Download URL: bookwyrm-0.1.8-py3-none-any.whl
Upload date: Oct 4, 2025
Size: 48.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bookwyrm-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2429baeca48ef4955a72d017809bdcec150735a991f14ae3ce0a7802f1229a53`
MD5	`15c35a0d375ae5b28995be8f20506440`
BLAKE2b-256	`fd72869401efdc49761abfd689fb41d1fb2910ffe6ef40a67a489e14eacf6207`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bookwyrm-0.1.8-py3-none-any.whl:

Publisher: publish-to-pypi.yml on scidonia/bookwyrm-client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bookwyrm-0.1.8-py3-none-any.whl
- Subject digest: 2429baeca48ef4955a72d017809bdcec150735a991f14ae3ce0a7802f1229a53
- Sigstore transparency entry: 584023583
- Sigstore integration time: Oct 4, 2025
Source repository:
- Permalink: scidonia/bookwyrm-client@7ba8601616f63fdc6d2ac8a280a7b7393a05101e
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/scidonia
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@7ba8601616f63fdc6d2ac8a280a7b7393a05101e
- Trigger Event: push

bookwyrm 0.1.8

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

bookwyrm

Documentation

Key Capabilities

Installation

Using uv (recommended for development)

Using pip

Getting an API Key

Usage

Python Library

Synchronous Client

Asynchronous Client

Command Line Interface

Citation Finding

Phrasal Text Processing

File Classification

PDF Structure Extraction

Summarization

Global Options

Environment Variables

Development

Running Tests

API Reference

Models

Clients

Exceptions

License

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance