Multi-modal knowledge library with vector and full-text search for text, code, images, and PDFs

These details have not been verified by PyPI

Project description

Librarian

A personal knowledge library for AI agents, built on Arcade for the Model Context Protocol (MCP).

Overview

Librarian provides AI agents with persistent storage for text, documents, and knowledge. Agents can store information and retrieve it later through semantic and keyword search, maintaining context across conversations.

graph LR
    A[Agent Stores Info] --> B[Parser]
    B --> C[Chunker]
    C --> D[Embedder]
    D --> E[(SQLite + vec)]
    F[Agent Queries] --> G[Hybrid Search]
    E --> G
    G --> H[Relevant Context]

Features

Persistent knowledge storage for AI agents
SQLite storage with sqlite-vec for vector search
Full-text search using FTS5 with BM25 ranking
Hybrid search combining semantic and keyword matching
Max Marginal Relevance (MMR) for diverse results
Configurable embedding models (local or OpenAI-compatible API)
Header-aware text chunking with overlap
Time-bounded search filters
CLI and MCP server interfaces

Installation

git clone https://github.com/ArcadeAI/librarian.git
cd librarian
./setup.sh

Or install manually:

uv pip install -e ".[dev]"

CLI Usage

# Add files to the library
libr add ~/notes

# Search the library
libr search "machine learning concepts"

# List sources
libr list

# View library statistics
libr index

# Rebuild the index
libr index build

MCP Server

Start the server for AI assistant integration:

# stdio transport (Claude Desktop, CLI)
libr serve stdio

# HTTP transport (Cursor, VS Code)
libr serve http --port 8000

See the Arcade MCP documentation for integration details.

Available Tools

Tool	Description
`Librarian_SearchLibrary`	Search the library with hybrid vector + keyword search
`Librarian_SemanticSearchLibrary`	Find content by meaning (semantic similarity)
`Librarian_KeywordSearchLibrary`	Find content by exact keywords
`Librarian_SearchLibraryByDates`	Search within a date range
`Librarian_AddToLibrary`	Store new content in the library
`Librarian_UpdateLibraryDoc`	Update existing content
`Librarian_ReadFromLibrary`	Read full document content
`Librarian_RemoveFromLibrary`	Remove content from the library
`Librarian_ListLibraryContents`	List all stored content
`Librarian_IndexDirectoryToLibrary`	Bulk import files
`Librarian_GetLibrarySources`	List sources with document/chunk counts
`Librarian_GetLibraryStats`	Overall library statistics

Configuration

Set via environment variables:

Variable	Default	Description
`DOCUMENTS_PATH`	`./documents`	Root directory for files
`DATABASE_PATH`	`~/.librarian/index.db`	SQLite database location
`EMBEDDING_PROVIDER`	`openai`	`local` or `openai`
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Local model name
`OPENAI_API_BASE`	`http://localhost:7171/v1`	OpenAI-compatible API URL
`OPENAI_EMBEDDING_MODEL`	`qwen3-embedding-06b`	API model name
`CHUNK_SIZE`	`512`	Max characters per chunk
`CHUNK_OVERLAP`	`50`	Overlap between chunks
`SEARCH_LIMIT`	`10`	Default results limit
`MMR_LAMBDA`	`0.5`	MMR diversity (0=diverse, 1=relevant)
`HYBRID_ALPHA`	`0.7`	Vector vs keyword weight (1=vector only)

Project Structure

librarian/
├── cli.py           # Command-line interface
├── server.py        # MCP server and tool definitions
├── config.py        # Configuration management
├── indexing.py      # Document indexing service
├── types.py         # Shared type definitions
├── storage/
│   ├── database.py  # SQLite operations
│   ├── vector_store.py  # sqlite-vec search
│   └── fts_store.py     # FTS5 search
├── processing/
│   ├── embed/       # Embedding providers
│   ├── parsers/     # Document parsers
│   └── transform/   # Text chunking
├── retrieval/
│   └── search.py    # Hybrid search + MMR
└── utils/
    └── timeframe.py # Time filter utilities

Development

make install    # Install dependencies
make test       # Run tests
make lint       # Run linter
make format     # Format code
make typecheck  # Type checking
make check      # All checks
make evals      # Run evaluations

Resources

Arcade.dev - Build AI-native applications
Arcade Documentation - Integration guides and API reference

License

MIT License - see LICENSE for details.

Contact

Email: contact@arcade.dev
Website: arcade.dev

Current Limitations & Roadmap

Image Search Limitations

Images are currently indexed by metadata only (filename, format, dimensions, EXIF data). The system does not yet understand visual content.

What works now:

Search by filename: search("diagram.png")
Search by format: search("PNG")
Filter results by asset type

What doesn't work yet:

Visual content search: search("architecture diagram") won't understand what's IN the image
Text within images: Can't find text that appears inside screenshots or diagrams
Image-to-image similarity: Can't find visually similar images

Multi-Modal Roadmap

Phase	Feature	Status	Impact	Effort	ETA
1	Documentation & Config	In Progress	Set expectations	Low	v0.6.0
	Document current limitations	Complete	Users understand metadata-only indexing	-	-
	Add configuration structure	Planned	Prepare for future embedding models	-	-
2	OCR for Images	Planned	Extract text FROM images	High	v0.6.0
	Add pytesseract integration	Planned	Search text in screenshots	Low	2-3 days
	Enable text extraction from diagrams	Planned	Find labels, annotations in images	-	-
	Search scanned documents	Planned	Index PDF images and photos	-	-
3	CLIP Visual Embeddings	Planned	True visual understanding	Very High	v0.7.0
	Add CLIP model integration	Planned	Text-to-image semantic search	Medium	5-7 days
	Create vision vector table	Planned	Separate 512-dim embeddings	-	-
	Implement search_images tool	Planned	Find images by visual content	-	-
4	CodeBERT for Code	Planned	Better code search	Medium	v0.8.0
	Add CodeBERT embeddings	Planned	Improved semantic code search	Medium	4-5 days
	Cross-language similarity	Planned	Find similar algorithms across languages	-	-
5	Cross-Modal Search	Planned	Unified search experience	High	v1.0.0
	Merge results across modalities	Planned	Single query finds all asset types	High	3-4 days
	Score normalization	Planned	Fair ranking across embedding spaces	-	-

Next Steps

Immediate (v0.6.0 - This Month):

Add OCR support with pytesseract
Enable text extraction from images
Document installation and configuration
Test with screenshots and diagrams

Short-term (v0.7.0 - Next Month):

Evaluate OCR adoption and usage patterns
Decide on CLIP investment based on image search demand
If validated: Implement CLIP visual embeddings
Add text-to-image semantic search

Long-term (v0.8.0+):

CodeBERT for improved code search (if needed)
Cross-modal unified search
Audio transcription (Whisper)
Video frame extraction

Decision Points:

After OCR: Measure adoption before investing in CLIP
After CLIP: Assess if CodeBERT adds value over text embeddings
After individual modalities: Evaluate need for unified cross-modal search

Installing Optional Features

# OCR support (v0.6.0+)
# Enabled by default - requires Tesseract
uv pip install -e ".[ocr]"
brew install tesseract  # macOS
# To disable: export ENABLE_OCR=false

# Vision support with CLIP (v0.7.0+)
uv pip install -e ".[vision]"
export ENABLE_VISION_EMBEDDINGS=true

# Code embeddings with CodeBERT (v0.8.0+)
uv pip install -e ".[code]"
export ENABLE_CODE_EMBEDDINGS=true

# All features
uv pip install -e ".[all]"

Vision and code embeddings are opt-in and disabled by default. OCR is enabled by default (v0.6.0+).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.13.0

May 5, 2026

0.12.0

Apr 24, 2026

0.11.0

Mar 27, 2026

0.10.0

Mar 18, 2026

0.9.0

Feb 22, 2026

This version

0.8.0

Jan 26, 2026

0.7.0

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_library-0.8.0.tar.gz (133.7 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_library-0.8.0-py3-none-any.whl (94.8 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file agent_library-0.8.0.tar.gz.

File metadata

Download URL: agent_library-0.8.0.tar.gz
Upload date: Jan 26, 2026
Size: 133.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_library-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`c7d563f1f76108dce626533b2e9c1416d31c3f338d0e31988e608b376e8dc4f4`
MD5	`efc7fdc713ee1982044458392b3291a2`
BLAKE2b-256	`dbdf7514fbd87b2a6f5b97ed32246bf9d8f45bc43da05542edbe922c60a2c63e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_library-0.8.0.tar.gz:

Publisher: release.yml on arcadeai-labs/librarian

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_library-0.8.0.tar.gz
- Subject digest: c7d563f1f76108dce626533b2e9c1416d31c3f338d0e31988e608b376e8dc4f4
- Sigstore transparency entry: 855528115
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: arcadeai-labs/librarian@8ce41c77acec9ee93f8836d9bc501f6e6a779475
- Branch / Tag: refs/tags/v0.8.0
- Owner: https://github.com/arcadeai-labs
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8ce41c77acec9ee93f8836d9bc501f6e6a779475
- Trigger Event: push

File details

Details for the file agent_library-0.8.0-py3-none-any.whl.

File metadata

Download URL: agent_library-0.8.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 94.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agent_library-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c700186e7e7374793ee1fa77ef4d747fe1430ca88f93c0a31f619f2ef620472`
MD5	`9f78cf4856f4e88d4d81dbdaf26c303a`
BLAKE2b-256	`808ea830689a7b83077f79ac88a0c632d79394fb3b97675fc39add9d039423fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_library-0.8.0-py3-none-any.whl:

Publisher: release.yml on arcadeai-labs/librarian

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_library-0.8.0-py3-none-any.whl
- Subject digest: 4c700186e7e7374793ee1fa77ef4d747fe1430ca88f93c0a31f619f2ef620472
- Sigstore transparency entry: 855528164
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: arcadeai-labs/librarian@8ce41c77acec9ee93f8836d9bc501f6e6a779475
- Branch / Tag: refs/tags/v0.8.0
- Owner: https://github.com/arcadeai-labs
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8ce41c77acec9ee93f8836d9bc501f6e6a779475
- Trigger Event: push

agent-library 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Librarian

Overview

Features

Installation

CLI Usage

MCP Server

Available Tools

Configuration

Project Structure

Development

Resources

License

Contact

Current Limitations & Roadmap

Image Search Limitations

Multi-Modal Roadmap

Next Steps

Installing Optional Features

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance