Search indices (mainly to be combined with RDF query engines) backed by Rust

These details have been verified by PyPI

Project links

Github

GitHub Statistics

Maintainers

ad-freiburg

These details have not been verified by PyPI

Project description

Search RDF

Rust library with restricted Python interface for building and querying search indices, primarily intended to be used with RDF query engines.

Getting Started

Installation

Build from source using Cargo:

cargo build --release

The binary will be available at target/release/search-rdf.

CLI Overview

The search-rdf CLI provides commands to build and serve search indices. All commands require a YAML configuration file.

search-rdf [OPTIONS] [CONFIG] [COMMAND]

Commands:
  data    Download and prepare data
  embed   Generate embeddings for data
  index   Build search indices
  serve   Serve indices via HTTP

Options:
      --force    Force rebuild even if output exists
  -v, --verbose  Enable verbose/debug logging
  -q, --quiet    Suppress info messages (errors and warnings only)
  -h, --help     Print help
  -V, --version  Print version

Running All Steps

To run the complete pipeline (data → embed → index → serve):

search-rdf config.yaml

Running Individual Steps

# Step 1: Download/prepare data
search-rdf data config.yaml

# Step 2: Generate embeddings
search-rdf embed config.yaml

# Step 3: Build indices
search-rdf index config.yaml

# Step 4: Start HTTP server
search-rdf serve config.yaml

Use --force to rebuild outputs even if they already exist:

search-rdf index config.yaml --force

Configuration File Format

The configuration file is written in YAML and has five main sections: datasets, models, embeddings, indices, and server.

Datasets

Defines data sources to be indexed. Each dataset produces a data directory used by indices.

datasets:
  - name: my-dataset           # Unique identifier
    output: data/              # Output directory for processed data
    source:
      # Option 1: SPARQL query against an endpoint
      type: sparql-query
      endpoint: https://query.wikidata.org/sparql
      query: |
        SELECT ?item ?label WHERE {
          ?item rdfs:label ?label .
        }
        LIMIT 1000
      format: json             # json, xml, or tsv
      default_field_type: text # text, image, or image-inline
      headers:                 # Optional HTTP headers
        User-Agent: MyApp/1.0

      # Option 2: Local SPARQL results file
      type: sparql
      path: results.json
      format: json
      default_field_type: text

      # Option 3: JSONL file
      type: jsonl
      path: data.jsonl

SPARQL queries must return exactly 2 columns: an identifier (first column) and a field value (second column). Multiple rows with the same identifier create multiple fields for that item.

Models

Defines embedding models used to generate vector representations.

models:
  # vLLM server (recommended for large-scale embedding)
  - name: my-vllm-model
    type: vllm
    endpoint: http://localhost:8000
    model_name: mixedbread-ai/mxbai-embed-large-v1

  # Sentence Transformers (local inference)
  - name: my-local-model
    type: sentence-transformer
    model_name: sentence-transformers/all-MiniLM-L6-v2
    device: cuda                # cpu, cuda, or mps (default: cpu)
    batch_size: 16              # Inference batch size (default: 16)

  # HuggingFace image models
  - name: my-image-model
    type: huggingface-image
    model_name: openai/clip-vit-base-patch32
    device: cuda
    batch_size: 16

  # OpenCLIP multimodal models (text + image in shared space)
  - name: my-clip-model
    type: open-clip
    model: hf-hub:timm/ViT-B-16-SigLIP2
    device: cuda
    batch_size: 32

Optional embedding parameters can be added to any model:

models:
  - name: my-model
    type: vllm
    endpoint: http://localhost:8000
    model_name: mixedbread-ai/mxbai-embed-large-v1
    params:
      num_dimensions: 512      # Truncate embeddings (for MRL models)
      normalize: true          # L2 normalize embeddings (default: true)

Embeddings

Defines embedding generation jobs that use models to embed dataset fields.

embeddings:
  - name: my-embeddings
    model: my-vllm-model       # Reference to model name
    data: data/                # Input data directory
    output: data/embeddings.safetensors
    batch_size: 64             # Processing batch size (default: 64)

Indices

Defines search indices to build from data and embeddings.

indices:
  # Keyword index (exact token matching with BM25 scoring)
  - name: keyword-index
    type: keyword
    data: data/
    output: index/keyword/

  # Full-text index (Tantivy-based with stemming/tokenization)
  - name: fulltext-index
    type: full-text
    data: data/
    output: index/fulltext/

  # Embedding index with data (semantic search)
  - name: embedding-index
    type: embedding-with-data
    data: data/
    embedding_data: data/embeddings.safetensors
    output: index/embedding/
    model: my-vllm-model       # For query embedding at search time

  # Embedding-only index (no associated text data)
  - name: embedding-only
    type: embedding
    embedding_data: data/embeddings.safetensors
    output: index/embedding-only/

Embedding index parameters:

indices:
  - name: embedding-index
    type: embedding-with-data
    data: data/
    embedding_data: data/embeddings.safetensors
    output: index/embedding/
    model: my-model
    params:
      metric: cosine-normalized  # cosine-normalized, cosine, inner-product, l2, hamming
      precision: bfloat16        # float32, float16, bfloat16, int8, binary
      connectivity: 16           # HNSW M parameter (default: 16)
      expansion_add: 128         # HNSW efConstruction (default: 128)
      expansion_search: 64       # HNSW ef (default: 64)

Server

Configures the HTTP server for serving indices.

server:
  host: 0.0.0.0                 # Bind address (default: 127.0.0.1)
  port: 8080                    # Port (default: 8080)
  cors: true                    # Enable CORS (default: false)
  max_input_size: 100MB         # Max request size in bytes (default: 100MB)
  indices:                      # Indices to serve
    - keyword-index
    - embedding-index
  sparql:                       # Optional: Enable SPARQL service endpoints
    prefix: "http://example.org/"

HTTP API

When the server is running, the following endpoints are available:

Health Check

GET /health

Returns 200 OK if the server is running.

List Indices

GET /indices

Returns a list of available index names.

Search

POST /search/{index_name}
Content-Type: application/json

The request body contains a queries array and search parameters.

Value queries (text, image URL, or base64 image):

{
  "queries": [{"type": "value", "value": "search query"}],
  "k": 10
}

An optional modality field controls how the value is interpreted:

"text" — embed as text (default for text-only models)
"image" — load as image from URL and embed with vision encoder
"image-base64" — decode base64 image data and embed with vision encoder
"iri" — treat as an identifier for neighbor search

When modality is omitted, it is inferred from the model and value content:

Text-only models (vLLM, sentence-transformer): always text
Image-only models (huggingface-image): image URL or base64
Multimodal models (open-clip): image if value looks like a URL, otherwise text

{"queries": [{"type": "value", "value": "https://example.com/image.jpg", "modality": "image"}], "k": 10}

Identifier queries (neighbor search by known IRI):

{
  "queries": [{"type": "identifier", "value": "http://www.wikidata.org/entity/Q42"}],
  "k": 10
}

Pre-computed embedding queries:

{
  "queries": [{"type": "embedding", "value": [0.1, 0.2, 0.3]}],
  "k": 10
}

Search parameters vary by index type:

Keyword/Full-text indices:

k - Number of results (default: 10)

Embedding indices:

k - Number of results (default: 10)
min-score - Minimum similarity score filter
exact - Use exact search instead of approximate (default: false)
rerank - Reranking factor (retrieves k*rerank candidates, then reranks)

Response format:

{
  "matches": [
    [
      {"id": 42, "score": 0.95},
      {"id": 17, "score": 0.87}
    ]
  ]
}

SPARQL Service (optional)

When sparql is configured in the server section:

POST /sparql/{index_name}
POST /sparql/qlproxy/{index_name}

These endpoints enable integration with SPARQL engines that support federated queries.

Example Configuration

Here's a complete example that sets up keyword and semantic search over Wikidata human labels:

datasets:
  - name: wikidata-humans
    output: data/
    source:
      type: sparql-query
      endpoint: https://query.wikidata.org/sparql
      query: |
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX wd: <http://www.wikidata.org/entity/>
        PREFIX wdt: <http://www.wikidata.org/prop/direct/>
        SELECT ?item ?label WHERE {
          ?item wdt:P31 wd:Q5 .
          ?item rdfs:label ?label .
          FILTER(LANG(?label) = "en")
        }
        LIMIT 10000
      format: json
      default_field_type: text

models:
  - name: text-embedding
    type: vllm
    endpoint: http://localhost:8000
    model_name: mixedbread-ai/mxbai-embed-xsmall-v1

embeddings:
  - name: wikidata-embeddings
    model: text-embedding
    data: data/
    output: data/embeddings.safetensors
    batch_size: 128

indices:
  - name: keyword
    type: keyword
    data: data/
    output: index/keyword/

  - name: semantic
    type: embedding-with-data
    data: data/
    embedding_data: data/embeddings.safetensors
    output: index/semantic/
    model: text-embedding
    params:
      metric: cosine-normalized
      precision: bfloat16

server:
  host: 0.0.0.0
  port: 8080
  cors: true
  indices:
    - keyword
    - semantic

Run with:

# Build everything and start serving
search-rdf config.yaml

# Or run steps individually
search-rdf data config.yaml
search-rdf embed config.yaml
search-rdf index config.yaml
search-rdf serve config.yaml

Test with curl:

# Keyword search
curl -X POST http://localhost:8080/search/keyword \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"type": "value", "value": "Albert Einstein"}], "k": 5}'

# Semantic search
curl -X POST http://localhost:8080/search/semantic \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"type": "value", "value": "famous physicist"}], "k": 5}'

Project details

These details have been verified by PyPI

Project links

Github

GitHub Statistics

Maintainers

ad-freiburg

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.3

Mar 23, 2026

0.5.2

Mar 13, 2026

0.5.1

Mar 13, 2026

0.5.0

Mar 12, 2026

0.4.0

Mar 10, 2026

0.3.1

Mar 5, 2026

This version

0.3.0

Mar 4, 2026

0.2.1

Mar 2, 2026

0.2.0

Feb 19, 2026

0.1.1

Feb 19, 2026

0.1.0

Feb 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

search_rdf-0.3.0.tar.gz (3.9 MB view details)

Uploaded Mar 4, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

search_rdf-0.3.0-cp312-abi3-win_amd64.whl (2.4 MB view details)

Uploaded Mar 4, 2026 CPython 3.12+Windows x86-64

search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (5.0 MB view details)

Uploaded Mar 4, 2026 CPython 3.12+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (2.9 MB view details)

Uploaded Mar 4, 2026 CPython 3.11manylinux: glibc 2.28+ x86-64

File details

Details for the file search_rdf-0.3.0.tar.gz.

File metadata

Download URL: search_rdf-0.3.0.tar.gz
Upload date: Mar 4, 2026
Size: 3.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_rdf-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`754abe820a66b27c1e59e1c34736490c5bc1bd44e76a4a5dff1ad1312764474b`
MD5	`cad0ec77e614fb49c79ceaa0fcc13a84`
BLAKE2b-256	`e511926374c240b8727021ccafd7fa24c2d7a1dc6808f5aa994fb5686d628251`

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_rdf-0.3.0.tar.gz:

Publisher: release.yml on bastiscode/search-rdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: search_rdf-0.3.0.tar.gz
- Subject digest: 754abe820a66b27c1e59e1c34736490c5bc1bd44e76a4a5dff1ad1312764474b
- Sigstore transparency entry: 1030053317
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: bastiscode/search-rdf@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bastiscode
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Trigger Event: push

File details

Details for the file search_rdf-0.3.0-cp312-abi3-win_amd64.whl.

File metadata

Download URL: search_rdf-0.3.0-cp312-abi3-win_amd64.whl
Upload date: Mar 4, 2026
Size: 2.4 MB
Tags: CPython 3.12+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_rdf-0.3.0-cp312-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`4813be5755782ee072be3afdc3d570a58f8b18b738dbbfb8f2d286d1eea0a7b1`
MD5	`1db5b927c9c1d8b29c1cb9e465b79cab`
BLAKE2b-256	`c41b00b5c416296bd0283df4730727183ac0c964bc4d2a9ba27a02d1be03bcbb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_rdf-0.3.0-cp312-abi3-win_amd64.whl:

Publisher: release.yml on bastiscode/search-rdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: search_rdf-0.3.0-cp312-abi3-win_amd64.whl
- Subject digest: 4813be5755782ee072be3afdc3d570a58f8b18b738dbbfb8f2d286d1eea0a7b1
- Sigstore transparency entry: 1030053452
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: bastiscode/search-rdf@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bastiscode
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Trigger Event: push

File details

Details for the file search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

Download URL: search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Upload date: Mar 4, 2026
Size: 5.0 MB
Tags: CPython 3.12+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm	Hash digest
SHA256	`6e3d6e30fc9cc5a189642a70cf68b9bdf97d10a692330951ab557a62e514a5cf`
MD5	`f4126efb750d7e0fc84b9a911d26f4cc`
BLAKE2b-256	`b7e486c374e94524b2cb7b6209798d2d913c1b756500e7c3eed606192bded0f0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on bastiscode/search-rdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: search_rdf-0.3.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Subject digest: 6e3d6e30fc9cc5a189642a70cf68b9bdf97d10a692330951ab557a62e514a5cf
- Sigstore transparency entry: 1030053395
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: bastiscode/search-rdf@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bastiscode
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Trigger Event: push

File details

Details for the file search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Mar 4, 2026
Size: 2.9 MB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`d100bcddd6b979a8d5add754ef5871f9c66fc034e9e423baee561d2b9b9c96aa`
MD5	`007cd3b5ba7d6961a446d43e0da73cee`
BLAKE2b-256	`36f621aa5027b571d0fc8a5967ffd5fd060b855d71c5e33670fcbf46cc4fa3cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: release.yml on bastiscode/search-rdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: search_rdf-0.3.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Subject digest: d100bcddd6b979a8d5add754ef5871f9c66fc034e9e423baee561d2b9b9c96aa
- Sigstore transparency entry: 1030053354
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: bastiscode/search-rdf@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bastiscode
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d3fb58ed4f5c6aa175314b3590b31d483b0e73ab
- Trigger Event: push

search-rdf 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Search RDF

Getting Started

Installation

CLI Overview

Running All Steps

Running Individual Steps

Configuration File Format

Datasets

Models

Embeddings

Indices

Server

HTTP API

Health Check

List Indices

Search

SPARQL Service (optional)

Example Configuration

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance