Straightforward ColBERT indexing and serving (via PyLate)
Project description
stfo-colbert
Straightforward ColBERT indexing and serving (if you need a development ColBERT server)
Design Goals
- Straightforward: Single-command usage via CLI (stfo is for "straightforward")
- Minimal: Readable, functional code with minimal default dependencies
- Simple: One HTTP endpoint only:
GET /search - For development usage: Suitable for anyone who needs an adhoc sematic search server
When to Use
Use stfo-colbert when you:
- Have a small-to-medium collection and want a simple way to build a ColBERT-style index (via PyLate) and query it over HTTP
- Prefer a one-shot CLI to index and serve, without additional orchestration
Installation
From PyPI
pip install stfo-colbert
From source (development)
git clone <repository-url>
cd stfo_colbert
pip install -e .
Quickstart
1. Install the package
pip install stfo-colbert
2. Run the CLI (index and serve)
stfo-colbert \
--dataset-path /path/to/dataset.txt
3. Query the API
curl "http://127.0.0.1:8889/search?query=hello&k=2"
4. Example response
{
"query": "hello",
"topk": [
{
"pid": "1",
"rank": 0,
"score": 0.92,
"text": "Hello world! This is a sample document.",
"prob": 0.51
},
{
"pid": "2",
"rank": 1,
"score": 0.87,
"text": "A friendly hello from another document.",
"prob": 0.49
}
]
}
CLI Reference
stfo-colbert [options]
Options
| Option | Description | Default |
|---|---|---|
--port |
Port to serve on | 8889 |
--model-name |
Hugging Face model id/name | mixedbread-ai/mxbai-edge-colbert-v0-17m |
--index-path |
Path to existing PyLate index directory | (mutually exclusive with --dataset-path) |
--dataset-path |
Path to dataset for index creation (file or directory) | - |
--batch-size |
Batch size for encoding | 64 |
--chunk-size |
Number of documents to accumulate before encoding | 10000 |
Usage Patterns
Serve an existing index:
stfo-colbert --index-path ./experiments/my_index --port 8889
Build from a delimited TXT, then serve:
stfo-colbert --dataset-path ./data/my_corpus.txt --port 8889
Build from a directory of docs, then serve:
stfo-colbert --dataset-path ./docs_dir --port 8889
Dataset Formats
1. Delimited text file (default)
A plain text file where each document is separated by the delimiter: \n\n--------\n\n
Example:
Document one text
--------
Document two text
Note: Any occurrences of the delimiter inside documents are removed during preprocessing to avoid boundary confusion.
2. Directory of document files
When --dataset-path points to a directory, stfo-colbert will scan for files and create a compressed cache file (.stfo_colbert_cache.txt.xz) in that directory. On later runs, this cache is reused instead of re-parsing all files, significantly speeding up initialization.
Supported file types:
.txt,.md.pdf
Cache behavior:
- The cache file is automatically created after the first directory scan
- To force a re-scan, delete the
.stfo_colbert_cache.txt.xzfile from the dataset directory
Index Format
stfo-colbert uses PyLate's PLAID index under the hood:
- Loads the model (default:
mixedbread-ai/mxbai-edge-colbert-v0-17m) - Encodes documents in chunks and builds an index incrementally
- Serves top-k retrieval via a simple HTTP API
The index directory contains:
- PLAID index files: The core PyLate index structure
collection.db: A SQLite database mapping document IDs to their text content
Streaming and Chunked Processing
To handle large datasets efficiently, stfo-colbert processes documents in chunks:
- Documents are streamed from the dataset (not loaded entirely into memory)
- Each chunk is encoded and added to the index incrementally
- The collection mapping is saved to SQLite progressively during indexing
- Default chunk size is 10,000 documents (configurable via
--chunk-size)
This approach enables indexing of large datasets (e.g., entire Wikipedia) without running out of memory.
When you build an index from documents, stfo-colbert automatically creates the collection.db file to enable text retrieval in search results. If you pass --index-path with an existing index, search results will include text snippets only if collection.db is present in the index directory.
HTTP API
GET /search
Parameters:
query(string, required): The search stringk(integer, optional): Top-k results (default:10, max:100)
Response:
{
"query": "...",
"topk": [
{
"pid": "<document_id>",
"score": 0.95,
"text": "...",
"prob": 0.87
}
]
}
Note: The
textfield is included if the collection mapping is available (e.g., from a delimited TXT orcollection.db).
Design Notes
- Functional approach: Modules expose pure functions; the CLI composes them
- Minimal dependencies: FastAPI for the web layer, Uvicorn ASGI server, PyLate for model+index, PyMuPDF for PDF parsing
- Persistent caching: When processing directories, a compressed cache file (
.stfo_colbert_cache.txt.xz) is saved in the dataset directory for faster subsequent runs
Development
Install in editable mode:
pip install -e .
Run tests:
pip install pytest
pytest
Examples
Using the included example data
Index Wikipedia summaries and query for specific topics:
# Start the server with Wikipedia summaries
stfo-colbert --dataset-path example_data/wikipedia_summaries.txt
# Query for movies
curl "http://127.0.0.1:8889/search?query=Disney%20animated%20movies&k=3"
# Query for sports
curl "http://127.0.0.1:8889/search?query=Olympic%20track%20and%20field%20events&k=5"
Index arXiv PDFs and search research papers:
# Start the server with PDF directory
stfo-colbert --dataset-path example_data/arxiv_sample
# Search for AI/ML topics
curl "http://127.0.0.1:8889/search?query=machine%20learning%20transformers&k=5"
# Search for specific research areas
curl "http://127.0.0.1:8889/search?query=neural%20network%20architecture&k=3"
Index large Wikipedia dataset:
# First, download and prepare the Wikipedia 20231101.en dataset
# Note: This is a large dataset (~20 GB) and will take time to download
python example_data/wikipedia_20231101_en.py
# Index the Wikipedia dataset with streaming (handles large datasets efficiently)
# The data will be processed in chunks to avoid memory issues but it will take a lot of time anyway
stfo-colbert --dataset-path wikipedia_20231101_en_shuffled.txt --chunk-size 10000
# Search for topics in Wikipedia
curl "http://127.0.0.1:8889/search?query=machine%20learning%20history&k=5"
The wikipedia_20231101_en.py script:
- Downloads the Wikipedia 20231101.en dataset from Hugging Face
- Shuffles it with a buffer size of 100,000 (good for building index centroids)
- Formats it as a delimited text file compatible with stfo-colbert
- Uses streaming to avoid loading the entire dataset into memory
General usage examples
Index directory of Markdown notes and serve on port 7777:
stfo-colbert --dataset-path ~/notes --port 7777
Serve existing index folder:
stfo-colbert --index-path ./experiments/wiki_index --port 8889
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stfo_colbert-0.2.0.tar.gz.
File metadata
- Download URL: stfo_colbert-0.2.0.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e67976403905b59586afb294dd1171ab6befda5234f7921552fe8982dcd922b7
|
|
| MD5 |
edfa889231704bef8aa0e89edc061627
|
|
| BLAKE2b-256 |
a7cf5cf4c1774db0c728d0536296bf9f2d51f87bccdc51d27c1a979af7b6d380
|
File details
Details for the file stfo_colbert-0.2.0-py3-none-any.whl.
File metadata
- Download URL: stfo_colbert-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96669a790085fdfe643a98d80451c2a2a4bdc1bbe2d263e8427d214201c7112f
|
|
| MD5 |
51e58574fe361bcaaefe6b0094d754e1
|
|
| BLAKE2b-256 |
d68b6e421974bd1eb726f1932d41476cdd07d73a672a01df43d18d1e5fa71c47
|