Document fetching and management system with vector search and MCP integration
Project description
DocVault
A document management system with vector search and MCP integration for AI assistants.
CLI Command Structure Updated!
- Canonical commands now have user-friendly aliases.
searchis the default command (rundv <query>to search).- Library lookup is now
dv search lib <library>ordv search --library <library>. - See below for updated usage examples and troubleshooting tips.
Purpose
DocVault is designed to help AI assistants and developers access up-to-date documentation for libraries, frameworks, and tools. It solves key challenges:
- Accessing documentation beyond AI training cutoff dates
- Centralizing technical documentation in a searchable format
- Providing AI agents with structured access to library documentation
- Supporting offline documentation access
Features
- Web Scraper: Fetch and store documentation from URLs
- Document Storage: Store HTML and Markdown versions
- Vector Search: Semantic search using document embeddings
- Section Navigation: Hierarchical document sections with parent-child relationships
- MCP Server: Expose functionality to AI assistants through Model Context Protocol
- Library Manager: Automatically fetch library documentation
- CLI Interface: Command-line tool for document management
- Database Migrations: Automatic schema updates with versioning
Installation
Using UV (Recommended)
DocVault uses uv as the preferred installation method for its speed and reliability. If you don't have uv installed, you can get it with:
pip install uv
# or with pipx for isolated installation
pipx install uv
Then clone and install DocVault:
git clone https://github.com/azmaveth/docvault.git
cd docvault
# Create virtual environment
uv venv .venv
# Install DocVault (this installs all dependencies including sqlite-vec)
uv pip install -e .
# Set up the 'dv' command for easy access
./scripts/install-dv.sh
# Initialize the database
dv init-db
Note: The
install-dv.shscript will help you set up thedvcommand to work directly from your terminal without environment activation or bytecode compilation messages.
Using Traditional Pip
If you prefer, you can also use traditional pip:
git clone https://github.com/azmaveth/docvault.git
cd docvault
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Set up the 'dv' command
./scripts/install-dv.sh
Setting up the dv Command
After installation, run the installation helper to set up easy access to the dv command:
./scripts/install-dv.sh
This script offers several options:
- Add an alias to your shell configuration (recommended)
- Create a wrapper script in
~/binor/usr/local/bin - Show manual instructions for custom setups
The alias method is recommended as it's the simplest and doesn't require additional files in your PATH.
Required Packages
DocVault automatically installs all required dependencies, including:
sqlite-vec- Vector search extension for SQLitemodelcontextprotocol- Model Context Protocol for AI assistant integration- Various other libraries for web scraping, document processing, etc.
Quick Start
After installation and running ./scripts/install-dv.sh, you can use DocVault with the dv command:
Verifying Your Installation
To verify that DocVault is installed and working correctly, you can run the following test script:
#!/bin/bash
# Test basic CLI functionality
echo "Testing DocVault installation..."
# Check if dv command is available
if ! command -v dv &> /dev/null; then
echo "❌ 'dv' command not found. Please run './scripts/install-dv.sh' to set up the command"
exit 1
fi
# Test --version flag
echo -n "Checking version... "
dv --version
# Test database initialization
echo -n "Initializing database... "
dv init-db --force
# Test search with no documents
echo -n "Testing search (no documents yet)... "
if ! dv search "test" &> /dev/null; then
echo "❌ Search test failed"
exit 1
else
echo "✅ Search working"
fi
# Test adding a test document
echo -n "Adding test document... "
TEST_URL="https://raw.githubusercontent.com/azmaveth/docvault/main/README.md"
if ! dv add "$TEST_URL" &> /dev/null; then
echo "❌ Failed to add test document"
exit 1
else
echo "✅ Test document added"
fi
# Test search with the added document
echo -n "Testing search with documents... "
if ! dv search "DocVault" &> /dev/null; then
echo "❌ Search with documents failed"
exit 1
else
echo "✅ Search with documents working"
fi
echo "\n🎉 All tests passed! DocVault is installed and working correctly."
echo "Try running 'dv search \"your query\"' to search your documents."
Save this script as test_docvault.sh, make it executable with chmod +x test_docvault.sh, and run it to verify your installation.
Troubleshooting
Command Not Found: dv
If you get a "command not found" error when running dv, try these solutions:
-
Run the installation helper
./scripts/install-dv.sh
-
Source your shell configuration (if you chose the alias option)
source ~/.bashrc # or ~/.zshrc for zsh users
-
Use the direct path
/path/to/docvault/.venv/bin/dv --help -
Install with pipx for global access
pipx install git+https://github.com/azmaveth/docvault.git
Database Connection Issues
If you encounter database-related errors:
-
Check file permissions
ls -la ~/.docvault/docvault.db
-
Rebuild the database
dv init-db --force
Missing Dependencies
If you see import errors or missing modules:
-
Reinstall dependencies
uv pip install -r requirements.txt
-
Check Python version (requires Python 3.8+)
python --version
Vector Search Not Working
If vector search fails or falls back to text search:
-
Verify sqlite-vec installation
python -c "import sqlite_vec; print('sqlite-vec version:', sqlite_vec.__version__)"
-
Rebuild the vector index
dv init-db --force
Network Issues
If you experience timeouts or connection errors:
-
Check your internet connection
-
Set HTTP proxy if needed
export HTTP_PROXY=http://your-proxy:port export HTTPS_PROXY=http://your-proxy:port
Getting Help
If you're still having issues:
-
Check the GitHub Issues for similar problems
-
Run with
--debugflag for more detailed error messages:dv --debug <command>
-
Create a new issue with your error message and environment details
Database Initialization
Before using DocVault, you need to initialize the database:
dv init-db --force
This will create a new SQLite database at ~/.docvault/docvault.db with the necessary tables and vector index.
Note: Use the
--forceflag to recreate the database if it already exists.
Vector Search Setup
DocVault uses vector embeddings for semantic search. For optimal search functionality, you'll need to ensure the sqlite-vec extension is properly installed.
Verifying Vector Search
To check if vector search is working:
dv search "your search query" --debug
If you see a warning about sqlite-vec not being loaded, you'll need to install it.
Installing sqlite-vec
-
Install the Python package:
pip install sqlite-vec
-
Or install from source:
git clone https://github.com/asg017/sqlite-vec cd sqlite-vec make make loadable
-
Ensure the extension is in your
LD_LIBRARY_PATHor provide the full path when loading.
Common Issues
-
Missing Extension: If you see
sqlite-vec extension cannot be loaded, ensure the package is installed in your Python environment. -
Vector Table Not Found: If you get errors about missing vector tables, try recreating the database with
dv init-db --force. -
Performance: For large document collections, consider increasing SQLite's cache size:
export SQLITE_CACHE_SIZE=1000000 # 1GB cache
-
Text-Only Fallback: If vector search isn't available, DocVault will automatically fall back to text search. You can force text-only search with:
dv search "query" --text-only
Adding Documents
From a URL
To add a document from a URL:
dv add https://example.com/document
From a Local File
To add a document from a local file:
dv add /path/to/document.pdf
From a Directory
To add all documents from a directory:
dv add /path/to/documents/
- Initialize the database (recommended for a fresh start):
dv init --force
# or using the alias
dv init-db --force
If you want to keep existing data, you can omit --force.
- Import your first document:
dv import https://docs.python.org/3/library/sqlite3.html
# or use an alias:
dv add https://docs.python.org/3/library/sqlite3.html
dv scrape https://docs.python.org/3/library/sqlite3.html
dv fetch https://docs.python.org/3/library/sqlite3.html
- Search for content (or just type your query, since 'search' is default):
dv search "sqlite connection"
# or simply
dv "sqlite connection"
# or use alias
dv find "sqlite connection"
- Start the MCP server for AI assistant integration:
dv serve --transport sse
This will start a server at http://127.0.0.1:8000 that AI assistants can interact with.
CLI Commands
Import Dependencies from Project Files
DocVault can automatically detect and import documentation for all dependencies in your project. This works with various project types including Python, Node.js, Rust, Go, Ruby, and PHP.
# Import dependencies from the current directory
dv import-deps
# Import dependencies from a specific directory
dv import-deps /path/to/project
# Force re-import of all dependencies (even if they exist)
dv import-deps --force
# Include development dependencies (if supported by project type)
dv import-deps --include-dev
# Specify project type (auto-detected by default)
dv import-deps --project-type python
# Output results in JSON format
dv import-deps --format json
Supported Project Types
- Python:
requirements.txt,pyproject.toml,setup.py,Pipfile,setup.cfg - Node.js:
package.json,yarn.lock,package-lock.json - Rust:
Cargo.toml - Go:
go.mod - Ruby:
Gemfile,Gemfile.lock - PHP:
composer.json,composer.lock
Pre-commit Hooks
To ensure code and documentation quality, DocVault uses pre-commit hooks for Python formatting, linting, markdown linting, YAML linting, and secret detection.
Setup
-
Install pre-commit (once per system):
pip install pre-commit
-
Install the hooks (once per clone):
pre-commit install -
This will automatically run checks on staged files before each commit. To manually run all hooks on all files:
pre-commit run --all-files
dv import <url>- Import documentation from a URL (aliases: add, scrape, fetch)dv remove <id1> [id2...]- Remove documents from the vault (alias: rm)dv list- List all documents in the vault (alias: ls)dv read <id>- Read a document (alias: cat)dv search <query>- Search documents with semantic search (alias: find, default command)dv search lib <library> [--version <version>]- Lookup and fetch library documentationdv backup [destination]- Backup the vault to a zip filedv import-backup <file>- Import a backup filedv config- Manage configurationdv init [--wipe]- Initialize the database (alias: init-db, use--wipeto clear all data)dv serve- Start the MCP serverdv index- Index or re-index documents for vector search
Library Lookup Example
# Lookup latest version of a library
dv search lib pandas
# Lookup specific version
dv search lib tensorflow --version 2.0.0
# Alternate syntax (option flag):
dv search --library pandas
```## Connecting DocVault to AI Assistants
### What is MCP?
The [Model Context Protocol](https://modelcontextprotocol.io) (MCP) is a standardized interface for AI assistants to interact with external tools and data sources. DocVault implements MCP to allow AI assistants to search for and retrieve documentation.
### Starting the MCP Server
DocVault supports two transport methods:
1. **stdio** - Used when running DocVault directly from an AI assistant
2. **SSE (Server-Sent Events)** - Used when running DocVault as a standalone server
#### Option 1: Using stdio Transport (Recommended for Claude Desktop)
For Claude Desktop, use stdio transport which is the most secure option and recommended by the MCP specification. Claude Desktop will launch DocVault as a subprocess and communicate directly with it:
1. In Claude Desktop, navigate to Settings > External Tools
2. Click "Add Tool"
3. Fill in the form:
- **Name**: DocVault
- **Description**: Documentation search and retrieval tool
- **Command**: The full path to your DocVault executable, e.g., `/usr/local/bin/dv` or the full path to your Python executable plus the path to the DocVault script
- **Arguments**: `serve`
This will start DocVault in stdio mode, where Claude Desktop will send commands directly to DocVault's stdin and receive responses from stdout.
### Claude Desktop Configuration Example
You can configure DocVault in Claude Desktop by adding it to your configuration file. Here's a JSON example you can copy and paste:
```bashjson
{
"mcpServers": {
"docvault": {
"command": "dv",
"args": ["serve"]
}
}
}
```bash
> **Note:** If `dv` is not in your PATH, you need to use the full path to the executable, e.g.:
> ```bashjson
> {
> "mcpServers": {
> "docvault": {
> "command": "/usr/local/bin/dv",
> "args": ["serve"]
> }
> }
> }
> ```bash
> You can find the full path by running `which dv` in your terminal.
#### Option 2: Using SSE Transport (For Web-Based AI Assistants)
For web-based AI assistants or when you want to run DocVault as a persistent server:
1. Start the DocVault MCP server with SSE transport:
```bashbash
dv serve --transport sse --host 127.0.0.1 --port 8000
```bash
2. The server will start on the specified host and port (defaults to 127.0.0.1:8000).
3. For AI assistants that support connecting to MCP servers via SSE:
- Configure the MCP client with the URL: `http://127.0.0.1:8000`
- The AI assistant will connect to the SSE endpoint and receive the message endpoint in the initial handshake
> **Security Note**: When using SSE transport, bind to localhost (127.0.0.1) to prevent external access to your DocVault server. The MCP protocol recommends stdio transport for desktop applications due to potential security concerns with network-accessible endpoints.
### Example: Using DocVault with mcp-inspector
For testing and debugging, you can use the [mcp-inspector](https://github.com/modelcontextprotocol/inspector) tool:
1. Start DocVault with SSE transport:
```bashbash
dv serve --transport sse
```bash
2. Install and run mcp-inspector:
```bashbash
npx @modelcontextprotocol/inspector
```bash
3. In the inspector interface, connect to `http://localhost:8000`
4. You'll be able to explore available tools, resources, and test interactions with your DocVault server.
## Document Sections
DocVault now supports hierarchical document sections, making it easier to navigate and reference specific parts of your documents. This feature is particularly useful for large documentation sets.
### Key Features
- **Section Hierarchy**: Documents are automatically divided into sections with parent-child relationships
- **Automatic Section Detection**: Headings (h1, h2, etc.) are automatically detected and used to create the section structure
- **Section Metadata**: Each section includes:
- Title
- Level (1-6, corresponding to HTML heading levels)
- Path (e.g., "1.2.3" for the third subsection of the second section)
- Parent section reference
### Using Sections in Queries
When searching documents, you can now include section information in your results:
```python
# Get all sections for a document
sections = get_document_sections(document_id)
# Each section includes:
# - id: Unique identifier
# - document_id: Parent document ID
# - section_title: The section title
# - section_level: The heading level (1-6)
# - section_path: The hierarchical path (e.g., "1.2.3")
# - parent_segment_id: ID of the parent section (None for top-level sections)
Database Schema
The section information is stored in the document_segments table with these additional columns:
section_title: The title of the section (usually from the heading text)section_level: The heading level (1-6)section_path: A path-like string representing the section's position in the hierarchyparent_segment_id: Foreign key to the parent segment (for nested sections)
Available MCP Tools
DocVault exposes the following tools via MCP:
scrape_document- Add documentation from a URL to the vaultsearch_documents- Search documents using semantic searchread_document- Retrieve document contentlookup_library_docs- Get documentation for a librarylist_documents- List available documents
For detailed instructions for AI assistants using DocVault, see CLAUDE.md.
Known Limitations and Troubleshooting
- Vector Search Issues: If you encounter "no such table: document_segments_vec" errors, run
dv indexto rebuild the search index. - GitHub Scraping: DocVault may have difficulty scraping GitHub repositories. Try using specific documentation URLs instead of repository root URLs.
- Documentation Websites: Some documentation websites with complex structures may not be scraped correctly. Try adjusting the depth parameter (
--depth). - Embedding Model: The default embedding model is
nomic-embed-textvia Ollama. Ensure Ollama is running and has this model available. - dv command not found: If
dvis not recognized, useuv run dvor run./scripts/install-dv.shto set up the command. Some shells may require you to activate your virtual environment. See troubleshooting above. - Failed to fetch URL: If you see errors like 'Failed to fetch URL' when adding documents, verify the URL is accessible and check your network connection. Some sites may block automated scraping.
Requirements
- Python 3.12+
- Ollama for embeddings (using
nomic-embed-textmodel by default)
Configuration
DocVault can be configured using environment variables or a .env file in ~/.docvault/:
dv config --init
This will create a .env file with default settings. You can then edit this file to customize DocVault.
Available configuration options include:
DOCVAULT_DB_PATH- Path to SQLite databaseBRAVE_SEARCH_API_KEY- API key for Brave Search (optional)OLLAMA_URL- URL for Ollama APIEMBEDDING_MODEL- Embedding model to useSTORAGE_PATH- Path for document storageHOST- MCP server host (for SSE/web mode, required by Uvicorn)PORT- MCP server port (for SSE/web mode, required by Uvicorn)SERVER_HOST- [legacy/stdio mode only] MCP server host (not used by Uvicorn)SERVER_PORT- [legacy/stdio mode only] MCP server port (not used by Uvicorn)LOG_LEVEL- Logging level (INFO, DEBUG, etc.)
Development
We welcome contributions to DocVault! Check out the TASKS.md file for planned improvements and tasks you can help with.
We provide a convenient script to set up a development environment using UV:
# Make the script executable if needed
chmod +x scripts/dev-setup.sh
# Run the setup script
./scripts/dev-setup.sh
This script creates a virtual environment, installs dependencies with UV, and checks for the sqlite-vec extension.
Running Tests
DocVault includes a comprehensive test suite. You can run tests using the provided test runner script or make commands:
# Run all tests
./scripts/run-tests.sh
# or
make test
# Run specific test suites
./scripts/run-tests.sh unit # Unit tests only
./scripts/run-tests.sh cli # CLI tests only
./scripts/run-tests.sh mcp # MCP server tests only
./scripts/run-tests.sh quick # Quick smoke tests
# Run with coverage
./scripts/run-tests.sh -c all
# or
make test-coverage
# Run with verbose output
./scripts/run-tests.sh -v all
# See all options
./scripts/run-tests.sh --help
Code Quality
# Run linting
make lint
# Format code
make format
Continuous Integration
DocVault uses GitHub Actions for CI/CD. Tests are automatically run on:
- Push to main/master/develop branches
- Pull requests
- Multiple OS (Ubuntu, macOS, Windows)
- Multiple Python versions (3.11, 3.12)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docvault-0.4.0.tar.gz.
File metadata
- Download URL: docvault-0.4.0.tar.gz
- Upload date:
- Size: 104.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81d1cadff8ed12aa657551bb9ddc40e3bc125006972070199fb0a4a473f2801f
|
|
| MD5 |
2e926a2ef19d3667113398d420b577d0
|
|
| BLAKE2b-256 |
4692206eb66405a8553b0d1a66d9a93a01bc452e5c90b76b9642c96b3f5c47a4
|
File details
Details for the file docvault-0.4.0-py3-none-any.whl.
File metadata
- Download URL: docvault-0.4.0-py3-none-any.whl
- Upload date:
- Size: 79.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18715c249ef2303b72958d4832dd8d0b13ab1d2061eaba3eb955644f47f59601
|
|
| MD5 |
a0f3f296ba332ea675ab284e2ce568c8
|
|
| BLAKE2b-256 |
b7e284809001adcfdb661be932c03e9071557d19753dba94370739ab9769e03e
|