Straightforward ColBERT indexing and serving (via PyLate)

These details have not been verified by PyPI

Project links

Project description

stfo-colbert

Straightforward ColBERT indexing and serving (if you need a development ColBERT server)

Design Goals

Straightforward: Single-command usage via CLI (stfo is for "straightforward")
Minimal: Readable, functional code with minimal default dependencies
Simple: One HTTP endpoint only: GET /search
For development usage: Suitable for anyone who needs an adhoc sematic search server

When to Use

Use stfo-colbert when you:

Have a small-to-medium collection and want a simple way to build a ColBERT-style index (via PyLate) and query it over HTTP
Prefer a one-shot CLI to index and serve, without additional orchestration

Installation

From PyPI

pip install stfo-colbert

From source (development)

git clone <repository-url>
cd stfo_colbert
pip install -e .

Quickstart

1. Install the package

pip install stfo-colbert

2. Run the CLI (index and serve)

stfo-colbert \
  --dataset-path /path/to/dataset.txt

3. Query the API

curl "http://127.0.0.1:8889/search?query=hello&k=2"

4. Example response

{
  "query": "hello",
  "topk": [
    {
      "pid": "1",
      "rank": 0,
      "score": 0.92,
      "text": "Hello world! This is a sample document.",
      "prob": 0.51
    },
    {
      "pid": "2",
      "rank": 1,
      "score": 0.87,
      "text": "A friendly hello from another document.",
      "prob": 0.49
    }
  ]
}

CLI Reference

stfo-colbert [options]

Options

Option	Description	Default
`--port`	Port to serve on	`8889`
`--model-name`	Hugging Face model id/name	`mixedbread-ai/mxbai-edge-colbert-v0-17m`
`--index-path`	Path to existing PyLate index directory	(mutually exclusive with `--dataset-path`)
`--dataset-path`	Path to dataset for index creation (file or directory)	-
`--batch-size`	Batch size for encoding	`64`
`--chunk-size`	Number of documents to accumulate before encoding	`10000`

Usage Patterns

Serve an existing index:

stfo-colbert --index-path ./experiments/my_index --port 8889

Build from a delimited TXT, then serve:

stfo-colbert --dataset-path ./data/my_corpus.txt --port 8889

Build from a directory of docs, then serve:

stfo-colbert --dataset-path ./docs_dir --port 8889

Dataset Formats

1. Delimited text file (default)

A plain text file where each document is separated by the delimiter: \n\n--------\n\n

Example:

Document one text

--------

Document two text

Note: Any occurrences of the delimiter inside documents are removed during preprocessing to avoid boundary confusion.

2. Directory of document files

When --dataset-path points to a directory, stfo-colbert will scan for files and create a compressed cache file (.stfo_colbert_cache.txt.xz) in that directory. On later runs, this cache is reused instead of re-parsing all files, significantly speeding up initialization.

Supported file types:

.txt, .md
.pdf

Cache behavior:

The cache file is automatically created after the first directory scan
To force a re-scan, delete the .stfo_colbert_cache.txt.xz file from the dataset directory

Index Format

stfo-colbert uses PyLate's PLAID index under the hood:

Loads the model (default: mixedbread-ai/mxbai-edge-colbert-v0-17m)
Encodes documents in chunks and builds an index incrementally
Serves top-k retrieval via a simple HTTP API

The index directory contains:

PLAID index files: The core PyLate index structure
collection.db: A SQLite database mapping document IDs to their text content

Streaming and Chunked Processing

To handle large datasets efficiently, stfo-colbert processes documents in chunks:

Documents are streamed from the dataset (not loaded entirely into memory)
Each chunk is encoded and added to the index incrementally
The collection mapping is saved to SQLite progressively during indexing
Default chunk size is 10,000 documents (configurable via --chunk-size)

This approach enables indexing of large datasets (e.g., entire Wikipedia) without running out of memory.

When you build an index from documents, stfo-colbert automatically creates the collection.db file to enable text retrieval in search results. If you pass --index-path with an existing index, search results will include text snippets only if collection.db is present in the index directory.

HTTP API

`GET /search`

Parameters:

query (string, required): The search string
k (integer, optional): Top-k results (default: 10, max: 100)

Response:

{
  "query": "...",
  "topk": [
    {
      "pid": "<document_id>",
      "score": 0.95,
      "text": "...",
      "prob": 0.87
    }
  ]
}

Note: The text field is included if the collection mapping is available (e.g., from a delimited TXT or collection.db).

Design Notes

Functional approach: Modules expose pure functions; the CLI composes them
Minimal dependencies: FastAPI for the web layer, Uvicorn ASGI server, PyLate for model+index, PyMuPDF for PDF parsing
Persistent caching: When processing directories, a compressed cache file (.stfo_colbert_cache.txt.xz) is saved in the dataset directory for faster subsequent runs

Development

Install in editable mode:

pip install -e .

Run tests:

pip install pytest
pytest

Examples

Using the included example data

Index Wikipedia summaries and query for specific topics:

# Start the server with Wikipedia summaries
stfo-colbert --dataset-path example_data/wikipedia_summaries.txt

# Query for movies
curl "http://127.0.0.1:8889/search?query=Disney%20animated%20movies&k=3"

# Query for sports
curl "http://127.0.0.1:8889/search?query=Olympic%20track%20and%20field%20events&k=5"

Index arXiv PDFs and search research papers:

# Start the server with PDF directory
stfo-colbert --dataset-path example_data/arxiv_sample

# Search for AI/ML topics
curl "http://127.0.0.1:8889/search?query=machine%20learning%20transformers&k=5"

# Search for specific research areas
curl "http://127.0.0.1:8889/search?query=neural%20network%20architecture&k=3"

Index large Wikipedia dataset:

# First, download and prepare the Wikipedia 20231101.en dataset
# Note: This is a large dataset (~20 GB) and will take time to download
python example_data/wikipedia_20231101_en.py

# Index the Wikipedia dataset with streaming (handles large datasets efficiently)
# The data will be processed in chunks to avoid memory issues but it will take a lot of time anyway
stfo-colbert --dataset-path wikipedia_20231101_en_shuffled.txt --chunk-size 10000

# Search for topics in Wikipedia
curl "http://127.0.0.1:8889/search?query=machine%20learning%20history&k=5"

The wikipedia_20231101_en.py script:

Downloads the Wikipedia 20231101.en dataset from Hugging Face
Shuffles it with a buffer size of 100,000 (good for building index centroids)
Formats it as a delimited text file compatible with stfo-colbert
Uses streaming to avoid loading the entire dataset into memory

General usage examples

Index directory of Markdown notes and serve on port 7777:

stfo-colbert --dataset-path ~/notes --port 7777

Serve existing index folder:

stfo-colbert --index-path ./experiments/wiki_index --port 8889

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Oct 30, 2025

0.1.0

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stfo_colbert-0.2.0.tar.gz (18.9 kB view details)

Uploaded Oct 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stfo_colbert-0.2.0-py3-none-any.whl (15.0 kB view details)

Uploaded Oct 30, 2025 Python 3

File details

Details for the file stfo_colbert-0.2.0.tar.gz.

File metadata

Download URL: stfo_colbert-0.2.0.tar.gz
Upload date: Oct 30, 2025
Size: 18.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.17

File hashes

Hashes for stfo_colbert-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e67976403905b59586afb294dd1171ab6befda5234f7921552fe8982dcd922b7`
MD5	`edfa889231704bef8aa0e89edc061627`
BLAKE2b-256	`a7cf5cf4c1774db0c728d0536296bf9f2d51f87bccdc51d27c1a979af7b6d380`

See more details on using hashes here.

File details

Details for the file stfo_colbert-0.2.0-py3-none-any.whl.

File metadata

Download URL: stfo_colbert-0.2.0-py3-none-any.whl
Upload date: Oct 30, 2025
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.17

File hashes

Hashes for stfo_colbert-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96669a790085fdfe643a98d80451c2a2a4bdc1bbe2d263e8427d214201c7112f`
MD5	`51e58574fe361bcaaefe6b0094d754e1`
BLAKE2b-256	`d68b6e421974bd1eb726f1932d41476cdd07d73a672a01df43d18d1e5fa71c47`

See more details on using hashes here.

stfo-colbert 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

stfo-colbert

Design Goals

When to Use

Installation

From PyPI

From source (development)

Quickstart

1. Install the package

2. Run the CLI (index and serve)

3. Query the API

4. Example response

CLI Reference

Options

Usage Patterns

Dataset Formats

1. Delimited text file (default)

2. Directory of document files

Index Format

Streaming and Chunked Processing

HTTP API

GET /search

Design Notes

Development

Examples

Using the included example data

General usage examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`GET /search`