A client library for BookWyrm
Project description
bookwyrm
A Python client library and CLI designed to accelerate the development of RAG (Retrieval Augmented Generation) systems and AI agents. BookWyrm provides powerful text processing capabilities through a simple API, making it easy to build sophisticated document analysis and citation systems.
Key Capabilities
BookWyrm simplifies RAG and agent development by providing these core endpoints:
- Citation Finding - Automatically find and extract relevant citations from text chunks based on questions or queries
- Text Processing - Break down large documents into meaningful phrases and chunks with configurable sizing
- Document Classification - Intelligently classify files and content by format, type, and structure
- PDF Structure Extraction - Extract structured text data from PDF files using OCR with bounding box coordinates
- Summarization - Generate concise summaries from collections of text phrases or documents
- Streaming Support - Real-time processing with progress updates for all major operations
These capabilities work together to provide a complete pipeline for document ingestion, processing, and retrieval - the foundation of any RAG system.
Installation
Using uv (recommended for development)
# Clone the repository
git clone https://github.com/yourusername/bookwyrm.git
cd bookwyrm
# Install dependencies and create virtual environment
uv sync
# Install in development mode
uv pip install -e .
Using pip
# Install from PyPI (when published)
pip install bookwyrm
Getting an API Key
To use the BookWyrm client, you'll need an API key from bookwyrm.ai:
- Visit bookwyrm.ai
- Click on "Sign up for beta" to create an account
- Once registered, you can create an API key in the dashboard.
- Set your API key as an environment variable or pass it directly to the client
export BOOKWYRM_API_KEY="your-api-key-here"
Usage
Python Library
The BookWyrm client provides both synchronous and asynchronous interfaces for text processing, citation finding, summarization, and phrasal analysis.
Synchronous Client
from bookwyrm import BookWyrmClient, CitationRequest, TextChunk, ProcessTextRequest, ResponseFormat, ClassifyRequest, SummarizeRequest
# Initialize client
client = BookWyrmClient(base_url="https://api.bookwyrm.ai:443", api_key="your-key")
# Citation finding
chunks = [
TextChunk(text="This is the first chunk.", start_char=0, end_char=25),
TextChunk(text="This is the second chunk.", start_char=26, end_char=52),
]
request = CitationRequest(
chunks=chunks,
question="What are the chunks about?",
max_tokens_per_chunk=1000
)
# Get citations (non-streaming)
response = client.get_citations(request)
print(f"Found {response.total_citations} citations")
for citation in response.citations:
print(f"Quality: {citation.quality}/4")
print(f"Text: {citation.text}")
print(f"Reasoning: {citation.reasoning}")
# Stream citations (real-time results)
for stream_response in client.stream_citations(request):
if hasattr(stream_response, 'citation'):
print(f"New citation: {stream_response.citation.text}")
elif hasattr(stream_response, 'message'):
print(f"Progress: {stream_response.message}")
# Phrasal text processing
phrasal_request = ProcessTextRequest(
text_url="https://www.gutenberg.org/cache/epub/32706/pg32706.txt", # Triplanetary by E. E. Smith
chunk_size=1000,
response_format=ResponseFormat.WITH_OFFSETS
)
for response in client.process_text(phrasal_request):
if hasattr(response, 'text'):
print(f"Phrase: {response.text[:100]}...")
elif hasattr(response, 'message'):
print(f"Progress: {response.message}")
# File classification
classify_request = ClassifyRequest(
url="https://www.gutenberg.org/ebooks/18857.epub3.images",
filename="alice_wonderland.epub" # Optional hint
)
classification_response = client.classify(classify_request)
print(f"Format: {classification_response.classification.format_type}")
print(f"Content Type: {classification_response.classification.content_type}")
print(f"MIME Type: {classification_response.classification.mime_type}")
print(f"Confidence: {classification_response.classification.confidence:.2%}")
print(f"File Size: {classification_response.file_size:,} bytes")
# Classify local content
with open("document.txt", "r") as f:
content = f.read()
local_classify_request = ClassifyRequest(
content=content,
filename="document.txt"
)
local_response = client.classify(local_classify_request)
print(f"Local file classified as: {local_response.classification.content_type}")
# Classify binary content (automatically base64 encoded)
with open("image.jpg", "rb") as f:
binary_content = f.read()
import base64
encoded_content = base64.b64encode(binary_content).decode("ascii")
binary_classify_request = ClassifyRequest(
content=encoded_content,
content_encoding="base64",
filename="image.jpg"
)
binary_response = client.classify(binary_classify_request)
print(f"Binary file classified as: {binary_response.classification.content_type}")
client.close()
Asynchronous Client
import asyncio
from bookwyrm import AsyncBookWyrmClient, CitationRequest, ProcessTextRequest, ResponseFormat, ClassifyRequest, SummarizeRequest
async def main():
# Initialize async client
async with AsyncBookWyrmClient(base_url="https://api.bookwyrm.ai:443", api_key="your-key") as client:
# Citation finding
request = CitationRequest(
jsonl_url="https://example.com/chunks.jsonl",
question="What is the main topic?",
)
response = await client.get_citations(request)
print(f"Found {response.total_citations} citations")
# Stream citations
async for stream_response in client.stream_citations(request):
if hasattr(stream_response, 'citation'):
print(f"New citation: {stream_response.citation.text}")
# Phrasal text processing
phrasal_request = ProcessTextRequest(
text_url="https://www.gutenberg.org/cache/epub/32706/pg32706.txt", # Triplanetary by E. E. Smith
chunk_size=500,
response_format=ResponseFormat.TEXT_ONLY
)
async for response in client.process_text(phrasal_request):
if hasattr(response, 'text'):
print(f"Phrase: {response.text[:100]}...")
elif hasattr(response, 'message'):
print(f"Progress: {response.message}")
# File classification
classify_request = ClassifyRequest(
url="https://www.gutenberg.org/ebooks/18857.epub3.images"
)
classification = await client.classify(classify_request)
print(f"Classified as: {classification.classification.content_type}")
print(f"Confidence: {classification.classification.confidence:.2%}")
asyncio.run(main())
Command Line Interface
The CLI provides a rich, interactive interface for text processing operations:
Citation Finding
# Find citations in a JSONL file
bookwyrm cite "What is the main theme?" chunks.jsonl
# Save results to JSON
bookwyrm cite "What is the main theme?" chunks.jsonl --output results.json
# Use a URL as source
bookwyrm cite "What is the main theme?" --url https://example.com/chunks.jsonl
# Use --file option instead of positional argument
bookwyrm cite "What is the main theme?" --file chunks.jsonl
# Process only a subset of chunks
bookwyrm cite "What is the main theme?" chunks.jsonl --start 10 --limit 100
# Use non-streaming mode
bookwyrm cite "What is the main theme?" chunks.jsonl --no-stream
Phrasal Text Processing
# Process text from a URL (Triplanetary by E. E. Smith from Project Gutenberg)
bookwyrm phrasal --url "https://www.gutenberg.org/cache/epub/32706/pg32706.txt" --chunk-size 1000 --output triplanetary_phrases.jsonl
# Process text from a file
bookwyrm phrasal --file document.txt --format with_offsets --output phrases.jsonl
# Process text directly
bookwyrm phrasal "This is some text to analyze for phrases." --format text_only
# Use different SpaCy models
bookwyrm phrasal --file document.txt --spacy-model en_core_web_lg
File Classification
# Classify a URL resource (EPUB from Project Gutenberg)
bookwyrm classify --url "https://www.gutenberg.org/ebooks/18857.epub3.images" --output classification.json
# Classify a local file
bookwyrm classify --file document.pdf --output results.json
# Classify text content directly
bookwyrm classify "import pandas as pd\ndf = pd.DataFrame()" --filename "script.py"
# Classify with filename hint for better accuracy
bookwyrm classify --url "https://example.com/data" --filename "data.json"
# Note: Binary files are automatically detected and base64-encoded when using --file option
PDF Structure Extraction
# Extract structured data from a local PDF file
bookwyrm extract-pdf document.pdf --output extracted_data.json
# Extract from a PDF URL
bookwyrm extract-pdf --url "https://example.com/document.pdf" --output results.json
# Use --file option instead of positional argument
bookwyrm extract-pdf --file document.pdf --output data.json
# Show detailed extraction results
bookwyrm extract-pdf document.pdf --verbose --output detailed_results.json
# Use custom PDF extraction API endpoint
bookwyrm extract-pdf document.pdf --base-url "http://localhost:8000" --output results.json
Summarization
# Summarize a JSONL file of phrases
bookwyrm summarize phrases.jsonl --output summary.json
# Include debug information
bookwyrm summarize phrases.jsonl --debug --max-tokens 5000
Global Options
All commands support these options:
# Set API key and base URL for individual commands
bookwyrm phrasal --api-key YOUR_KEY --base-url https://api.bookwyrm.ai:443 --url "https://example.com/text.txt"
# Enable verbose output (per command)
bookwyrm cite --verbose "Question?" chunks.jsonl
# Use environment variables (recommended)
export BOOKWYRM_API_URL="https://api.bookwyrm.ai:443"
export BOOKWYRM_API_KEY="your-api-key"
bookwyrm phrasal --url "https://example.com/text.txt"
Note: API key and base URL options are available on each command individually, not as global app-level options. Using environment variables is the recommended approach for setting these values across all commands.
Environment Variables
Set these environment variables for convenience:
export BOOKWYRM_API_KEY="your-api-key"
export BOOKWYRM_API_URL="https://api.bookwyrm.ai:443"
Development
This project supports both uv and pip for development:
# With uv
uv sync
uv run pytest
uv run bookwyrm --help
# With pip
pip install -r requirements-dev.txt
pytest
bookwyrm --help
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=bookwyrm
# Run async tests specifically
pytest -k "async"
API Reference
Models
TextChunk: Represents a text chunk with start/end character positionsCitationRequest: Request model for citation processingCitation: A found citation with quality score and reasoningCitationResponse: Response containing multiple citationsUsageInfo: Token usage and cost informationClassifyRequest: Request model for file classificationClassifyResponse: Response containing classification resultsFileClassification: Detailed classification information
Clients
BookWyrmClient: Synchronous client withget_citations(),stream_citations(),classify(), and other methodsAsyncBookWyrmClient: Asynchronous client with async versions of the same methods
Exceptions
BookWyrmClientError: Base exception classBookWyrmAPIError: API-specific errors with status codes
License
See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bookwyrm-0.1.2.tar.gz.
File metadata
- Download URL: bookwyrm-0.1.2.tar.gz
- Upload date:
- Size: 51.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24eabf6aceb1807acea695c4ff18303b7f19e418d0e6d401b764ef621e610577
|
|
| MD5 |
732af436f5a9acede791a43873c37427
|
|
| BLAKE2b-256 |
078ba1acfb0aa244fc7678c576add12ce2bd5efb1cc2b8311f52eaff4b190d67
|
Provenance
The following attestation bundles were made for bookwyrm-0.1.2.tar.gz:
Publisher:
publish-to-pypi.yml on scidonia/bookwyrm-client
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bookwyrm-0.1.2.tar.gz -
Subject digest:
24eabf6aceb1807acea695c4ff18303b7f19e418d0e6d401b764ef621e610577 - Sigstore transparency entry: 569687808
- Sigstore integration time:
-
Permalink:
scidonia/bookwyrm-client@3946624382873916989be82e4038ff85a6be9149 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/scidonia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@3946624382873916989be82e4038ff85a6be9149 -
Trigger Event:
push
-
Statement type:
File details
Details for the file bookwyrm-0.1.2-py3-none-any.whl.
File metadata
- Download URL: bookwyrm-0.1.2-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fa5e8b54f77e2f4e56b3bec080153e1b60262683ae808f2c5b5f893924a2a18
|
|
| MD5 |
05c4debcc786da81432c47fa8239d0f5
|
|
| BLAKE2b-256 |
a62c6b935af27b99e1ffd4770a60221c109442d354438543c4eb517a1ea36ea2
|
Provenance
The following attestation bundles were made for bookwyrm-0.1.2-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on scidonia/bookwyrm-client
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bookwyrm-0.1.2-py3-none-any.whl -
Subject digest:
7fa5e8b54f77e2f4e56b3bec080153e1b60262683ae808f2c5b5f893924a2a18 - Sigstore transparency entry: 569687816
- Sigstore integration time:
-
Permalink:
scidonia/bookwyrm-client@3946624382873916989be82e4038ff85a6be9149 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/scidonia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@3946624382873916989be82e4038ff85a6be9149 -
Trigger Event:
push
-
Statement type: