Skip to main content

Straightforward ColBERT indexing and serving (via PyLate)

Project description

stfo-colbert

Straightforward ColBERT indexing and serving (if you need a development ColBERT server)

Design Goals

  • Straightforward: Single-command usage via CLI (stfo is for "straightforward")
  • Minimal: Readable, functional code with minimal default dependencies
  • Simple: One HTTP endpoint only: GET /search
  • For development usage: Suitable for anyone who needs an adhoc sematic search server

When to Use

Use stfo-colbert when you:

  • Have a small-to-medium collection and want a simple way to build a ColBERT-style index (via PyLate) and query it over HTTP
  • Prefer a one-shot CLI to index and serve, without additional orchestration

Installation

From PyPI

pip install stfo-colbert

From source (development)

git clone <repository-url>
cd stfo_colbert
pip install -e .

Quickstart

1. Install the package

pip install stfo-colbert

2. Run the CLI (index and serve)

stfo-colbert \
  --dataset-path /path/to/dataset.txt

3. Query the API

curl "http://127.0.0.1:8889/search?query=hello&k=2"

4. Example response

{
  "query": "hello",
  "topk": [
    {
      "pid": "1",
      "rank": 0,
      "score": 0.92,
      "text": "Hello world! This is a sample document.",
      "prob": 0.51
    },
    {
      "pid": "2",
      "rank": 1,
      "score": 0.87,
      "text": "A friendly hello from another document.",
      "prob": 0.49
    }
  ]
}

CLI Reference

stfo-colbert [options]

Options

Option Description Default
--port Port to serve on 8889
--model-name Hugging Face model id/name mixedbread-ai/mxbai-edge-colbert-v0-17m
--index-path Path to existing PyLate index directory (mutually exclusive with --dataset-path)
--dataset-path Path to dataset for index creation (file or directory) -
--batch-size Batch size for encoding 64
--chunk-size Number of documents to accumulate before encoding 10000

Usage Patterns

Serve an existing index:

stfo-colbert --index-path ./experiments/my_index --port 8889

Build from a delimited TXT, then serve:

stfo-colbert --dataset-path ./data/my_corpus.txt --port 8889

Build from a directory of docs, then serve:

stfo-colbert --dataset-path ./docs_dir --port 8889

Dataset Formats

1. Delimited text file (default)

A plain text file where each document is separated by the delimiter: \n\n--------\n\n

Example:

Document one text

--------

Document two text

Note: Any occurrences of the delimiter inside documents are removed during preprocessing to avoid boundary confusion.

2. Directory of document files

When --dataset-path points to a directory, stfo-colbert will scan for files and create a compressed cache file (.stfo_colbert_cache.txt.xz) in that directory. On later runs, this cache is reused instead of re-parsing all files, significantly speeding up initialization.

Supported file types:

  • .txt, .md
  • .pdf

Cache behavior:

  • The cache file is automatically created after the first directory scan
  • To force a re-scan, delete the .stfo_colbert_cache.txt.xz file from the dataset directory

Index Format

stfo-colbert uses PyLate's PLAID index under the hood:

  • Loads the model (default: mixedbread-ai/mxbai-edge-colbert-v0-17m)
  • Encodes documents in chunks and builds an index incrementally
  • Serves top-k retrieval via a simple HTTP API

The index directory contains:

  • PLAID index files: The core PyLate index structure
  • collection.db: A SQLite database mapping document IDs to their text content

Streaming and Chunked Processing

To handle large datasets efficiently, stfo-colbert processes documents in chunks:

  • Documents are streamed from the dataset (not loaded entirely into memory)
  • Each chunk is encoded and added to the index incrementally
  • The collection mapping is saved to SQLite progressively during indexing
  • Default chunk size is 10,000 documents (configurable via --chunk-size)

This approach enables indexing of large datasets (e.g., entire Wikipedia) without running out of memory.

When you build an index from documents, stfo-colbert automatically creates the collection.db file to enable text retrieval in search results. If you pass --index-path with an existing index, search results will include text snippets only if collection.db is present in the index directory.

HTTP API

GET /search

Parameters:

  • query (string, required): The search string
  • k (integer, optional): Top-k results (default: 10, max: 100)

Response:

{
  "query": "...",
  "topk": [
    {
      "pid": "<document_id>",
      "score": 0.95,
      "text": "...",
      "prob": 0.87
    }
  ]
}

Note: The text field is included if the collection mapping is available (e.g., from a delimited TXT or collection.db).

Design Notes

  • Functional approach: Modules expose pure functions; the CLI composes them
  • Minimal dependencies: FastAPI for the web layer, Uvicorn ASGI server, PyLate for model+index, PyMuPDF for PDF parsing
  • Persistent caching: When processing directories, a compressed cache file (.stfo_colbert_cache.txt.xz) is saved in the dataset directory for faster subsequent runs

Development

Install in editable mode:

pip install -e .

Run tests:

pip install pytest
pytest

Examples

Using the included example data

Index Wikipedia summaries and query for specific topics:

# Start the server with Wikipedia summaries
stfo-colbert --dataset-path example_data/wikipedia_summaries.txt

# Query for movies
curl "http://127.0.0.1:8889/search?query=Disney%20animated%20movies&k=3"

# Query for sports
curl "http://127.0.0.1:8889/search?query=Olympic%20track%20and%20field%20events&k=5"

Index arXiv PDFs and search research papers:

# Start the server with PDF directory
stfo-colbert --dataset-path example_data/arxiv_sample

# Search for AI/ML topics
curl "http://127.0.0.1:8889/search?query=machine%20learning%20transformers&k=5"

# Search for specific research areas
curl "http://127.0.0.1:8889/search?query=neural%20network%20architecture&k=3"

Index large Wikipedia dataset:

# First, download and prepare the Wikipedia 20231101.en dataset
# Note: This is a large dataset (~20 GB) and will take time to download
python example_data/wikipedia_20231101_en.py

# Index the Wikipedia dataset with streaming (handles large datasets efficiently)
# The data will be processed in chunks to avoid memory issues but it will take a lot of time anyway
stfo-colbert --dataset-path wikipedia_20231101_en_shuffled.txt --chunk-size 10000

# Search for topics in Wikipedia
curl "http://127.0.0.1:8889/search?query=machine%20learning%20history&k=5"

The wikipedia_20231101_en.py script:

  • Downloads the Wikipedia 20231101.en dataset from Hugging Face
  • Shuffles it with a buffer size of 100,000 (good for building index centroids)
  • Formats it as a delimited text file compatible with stfo-colbert
  • Uses streaming to avoid loading the entire dataset into memory

General usage examples

Index directory of Markdown notes and serve on port 7777:

stfo-colbert --dataset-path ~/notes --port 7777

Serve existing index folder:

stfo-colbert --index-path ./experiments/wiki_index --port 8889

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stfo_colbert-0.2.0.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stfo_colbert-0.2.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file stfo_colbert-0.2.0.tar.gz.

File metadata

  • Download URL: stfo_colbert-0.2.0.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.17

File hashes

Hashes for stfo_colbert-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e67976403905b59586afb294dd1171ab6befda5234f7921552fe8982dcd922b7
MD5 edfa889231704bef8aa0e89edc061627
BLAKE2b-256 a7cf5cf4c1774db0c728d0536296bf9f2d51f87bccdc51d27c1a979af7b6d380

See more details on using hashes here.

File details

Details for the file stfo_colbert-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for stfo_colbert-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96669a790085fdfe643a98d80451c2a2a4bdc1bbe2d263e8427d214201c7112f
MD5 51e58574fe361bcaaefe6b0094d754e1
BLAKE2b-256 d68b6e421974bd1eb726f1932d41476cdd07d73a672a01df43d18d1e5fa71c47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page