Multi-modal knowledge library with vector and full-text search for text, code, images, and PDFs
Project description
Librarian
A personal knowledge library for AI agents, built on Arcade for the Model Context Protocol (MCP).
Overview
Librarian provides AI agents with persistent storage for text, documents, and knowledge. Agents can store information and retrieve it later through semantic and keyword search, maintaining context across conversations.
graph LR
A[Agent Stores Info] --> B[Parser]
B --> C[Chunker]
C --> D[Embedder]
D --> E[(SQLite + vec)]
F[Agent Queries] --> G[Hybrid Search]
E --> G
G --> H[Relevant Context]
Features
- Persistent knowledge storage for AI agents
- SQLite storage with
sqlite-vecfor vector search - Full-text search using FTS5 with BM25 ranking
- Hybrid search combining semantic and keyword matching
- Max Marginal Relevance (MMR) for diverse results
- Configurable embedding models (local or OpenAI-compatible API)
- Header-aware text chunking with overlap
- Time-bounded search filters
- CLI and MCP server interfaces
Installation
git clone https://github.com/ArcadeAI/librarian.git
cd librarian
./setup.sh
Or install manually:
uv pip install -e ".[dev]"
CLI Usage
# Add files to the library
libr add ~/notes
# Search the library
libr search "machine learning concepts"
# List sources
libr list
# View library statistics
libr index
# Rebuild the index
libr index build
MCP Server
Start the server for AI assistant integration:
# stdio transport (Claude Desktop, CLI)
libr serve stdio
# HTTP transport (Cursor, VS Code)
libr serve http --port 8000
See the Arcade MCP documentation for integration details.
Available Tools
| Tool | Description |
|---|---|
Librarian_SearchLibrary |
Search the library with hybrid vector + keyword search |
Librarian_SemanticSearchLibrary |
Find content by meaning (semantic similarity) |
Librarian_KeywordSearchLibrary |
Find content by exact keywords |
Librarian_SearchLibraryByDates |
Search within a date range |
Librarian_AddToLibrary |
Store new content in the library |
Librarian_UpdateLibraryDoc |
Update existing content |
Librarian_ReadFromLibrary |
Read full document content |
Librarian_RemoveFromLibrary |
Remove content from the library |
Librarian_ListLibraryContents |
List all stored content |
Librarian_IndexDirectoryToLibrary |
Bulk import files |
Librarian_GetLibrarySources |
List sources with document/chunk counts |
Librarian_GetLibraryStats |
Overall library statistics |
Configuration
Set via environment variables:
| Variable | Default | Description |
|---|---|---|
DOCUMENTS_PATH |
./documents |
Root directory for files |
DATABASE_PATH |
~/.librarian/index.db |
SQLite database location |
EMBEDDING_PROVIDER |
openai |
local or openai |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Local model name |
OPENAI_API_BASE |
http://localhost:7171/v1 |
OpenAI-compatible API URL |
OPENAI_EMBEDDING_MODEL |
qwen3-embedding-06b |
API model name |
CHUNK_SIZE |
512 |
Max characters per chunk |
CHUNK_OVERLAP |
50 |
Overlap between chunks |
SEARCH_LIMIT |
10 |
Default results limit |
MMR_LAMBDA |
0.5 |
MMR diversity (0=diverse, 1=relevant) |
HYBRID_ALPHA |
0.7 |
Vector vs keyword weight (1=vector only) |
Project Structure
librarian/
├── cli.py # Command-line interface
├── server.py # MCP server and tool definitions
├── config.py # Configuration management
├── indexing.py # Document indexing service
├── types.py # Shared type definitions
├── storage/
│ ├── database.py # SQLite operations
│ ├── vector_store.py # sqlite-vec search
│ └── fts_store.py # FTS5 search
├── processing/
│ ├── embed/ # Embedding providers
│ ├── parsers/ # Document parsers
│ └── transform/ # Text chunking
├── retrieval/
│ └── search.py # Hybrid search + MMR
└── utils/
└── timeframe.py # Time filter utilities
Development
make install # Install dependencies
make test # Run tests
make lint # Run linter
make format # Format code
make typecheck # Type checking
make check # All checks
make evals # Run evaluations
Resources
- Arcade.dev - Build AI-native applications
- Arcade Documentation - Integration guides and API reference
License
MIT License - see LICENSE for details.
Contact
- Email: contact@arcade.dev
- Website: arcade.dev
Current Limitations & Roadmap
Image Search Limitations
Images are currently indexed by metadata only (filename, format, dimensions, EXIF data). The system does not yet understand visual content.
What works now:
- Search by filename:
search("diagram.png") - Search by format:
search("PNG") - Filter results by asset type
What doesn't work yet:
- Visual content search:
search("architecture diagram")won't understand what's IN the image - Text within images: Can't find text that appears inside screenshots or diagrams
- Image-to-image similarity: Can't find visually similar images
Multi-Modal Roadmap
| Phase | Feature | Status | Impact | Effort | ETA |
|---|---|---|---|---|---|
| 1 | Documentation & Config | In Progress | Set expectations | Low | v0.6.0 |
| Document current limitations | Complete | Users understand metadata-only indexing | - | - | |
| Add configuration structure | Planned | Prepare for future embedding models | - | - | |
| 2 | OCR for Images | Planned | Extract text FROM images | High | v0.6.0 |
| Add pytesseract integration | Planned | Search text in screenshots | Low | 2-3 days | |
| Enable text extraction from diagrams | Planned | Find labels, annotations in images | - | - | |
| Search scanned documents | Planned | Index PDF images and photos | - | - | |
| 3 | CLIP Visual Embeddings | Planned | True visual understanding | Very High | v0.7.0 |
| Add CLIP model integration | Planned | Text-to-image semantic search | Medium | 5-7 days | |
| Create vision vector table | Planned | Separate 512-dim embeddings | - | - | |
| Implement search_images tool | Planned | Find images by visual content | - | - | |
| 4 | CodeBERT for Code | Planned | Better code search | Medium | v0.8.0 |
| Add CodeBERT embeddings | Planned | Improved semantic code search | Medium | 4-5 days | |
| Cross-language similarity | Planned | Find similar algorithms across languages | - | - | |
| 5 | Cross-Modal Search | Planned | Unified search experience | High | v1.0.0 |
| Merge results across modalities | Planned | Single query finds all asset types | High | 3-4 days | |
| Score normalization | Planned | Fair ranking across embedding spaces | - | - |
Next Steps
Immediate (v0.6.0 - This Month):
- Add OCR support with pytesseract
- Enable text extraction from images
- Document installation and configuration
- Test with screenshots and diagrams
Short-term (v0.7.0 - Next Month):
- Evaluate OCR adoption and usage patterns
- Decide on CLIP investment based on image search demand
- If validated: Implement CLIP visual embeddings
- Add text-to-image semantic search
Long-term (v0.8.0+):
- CodeBERT for improved code search (if needed)
- Cross-modal unified search
- Audio transcription (Whisper)
- Video frame extraction
Decision Points:
- After OCR: Measure adoption before investing in CLIP
- After CLIP: Assess if CodeBERT adds value over text embeddings
- After individual modalities: Evaluate need for unified cross-modal search
Installing Optional Features
# OCR support (v0.6.0+)
# Enabled by default - requires Tesseract
uv pip install -e ".[ocr]"
brew install tesseract # macOS
# To disable: export ENABLE_OCR=false
# Vision support with CLIP (v0.7.0+)
uv pip install -e ".[vision]"
export ENABLE_VISION_EMBEDDINGS=true
# Code embeddings with CodeBERT (v0.8.0+)
uv pip install -e ".[code]"
export ENABLE_CODE_EMBEDDINGS=true
# All features
uv pip install -e ".[all]"
Vision and code embeddings are opt-in and disabled by default. OCR is enabled by default (v0.6.0+).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_library-0.8.0.tar.gz.
File metadata
- Download URL: agent_library-0.8.0.tar.gz
- Upload date:
- Size: 133.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7d563f1f76108dce626533b2e9c1416d31c3f338d0e31988e608b376e8dc4f4
|
|
| MD5 |
efc7fdc713ee1982044458392b3291a2
|
|
| BLAKE2b-256 |
dbdf7514fbd87b2a6f5b97ed32246bf9d8f45bc43da05542edbe922c60a2c63e
|
Provenance
The following attestation bundles were made for agent_library-0.8.0.tar.gz:
Publisher:
release.yml on arcadeai-labs/librarian
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_library-0.8.0.tar.gz -
Subject digest:
c7d563f1f76108dce626533b2e9c1416d31c3f338d0e31988e608b376e8dc4f4 - Sigstore transparency entry: 855528115
- Sigstore integration time:
-
Permalink:
arcadeai-labs/librarian@8ce41c77acec9ee93f8836d9bc501f6e6a779475 -
Branch / Tag:
refs/tags/v0.8.0 - Owner: https://github.com/arcadeai-labs
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8ce41c77acec9ee93f8836d9bc501f6e6a779475 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_library-0.8.0-py3-none-any.whl.
File metadata
- Download URL: agent_library-0.8.0-py3-none-any.whl
- Upload date:
- Size: 94.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c700186e7e7374793ee1fa77ef4d747fe1430ca88f93c0a31f619f2ef620472
|
|
| MD5 |
9f78cf4856f4e88d4d81dbdaf26c303a
|
|
| BLAKE2b-256 |
808ea830689a7b83077f79ac88a0c632d79394fb3b97675fc39add9d039423fa
|
Provenance
The following attestation bundles were made for agent_library-0.8.0-py3-none-any.whl:
Publisher:
release.yml on arcadeai-labs/librarian
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_library-0.8.0-py3-none-any.whl -
Subject digest:
4c700186e7e7374793ee1fa77ef4d747fe1430ca88f93c0a31f619f2ef620472 - Sigstore transparency entry: 855528164
- Sigstore integration time:
-
Permalink:
arcadeai-labs/librarian@8ce41c77acec9ee93f8836d9bc501f6e6a779475 -
Branch / Tag:
refs/tags/v0.8.0 - Owner: https://github.com/arcadeai-labs
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8ce41c77acec9ee93f8836d9bc501f6e6a779475 -
Trigger Event:
push
-
Statement type: