Skip to main content

scibite-toolkit - python library for calling SciBite applications: TERMite, TExpress, SciBite Search, CENtree and Workbench. The library also enables processing of the JSON results from such requests

Project description

SciBite Toolkit

Python library for making API calls to SciBite's suite of products and processing the JSON responses.

Supported Products

  • TERMite - Entity recognition and semantic enrichment (version 6.x)
  • TERMite 7 - Next-generation entity recognition with modern OAuth2 authentication
  • TExpress - Pattern-based entity relationship extraction
  • SciBite Search - Semantic search, document and entity analytics
  • CENtree - Ontology management, navigation, and integration
  • CENtree VectorDB Uploader - Upload ontology embedding CSVs from S3 or local files to Qdrant
  • CENtree Vector Generator - End-to-end ontology→embedding CSV pipeline
  • CENtree Ontology ML - OWL→sentence corpus, embedding generation, and Qdrant indexing
  • Workbench - Dataset annotation and management

Installation

pip install scibite-toolkit

See versions on PyPI

Quick Start Examples


TERMite 7 Examples

TERMite 7 is the modern version with enhanced OAuth2 authentication and improved API.

OAuth2 Client Credentials (SaaS - Recommended)

For modern SaaS deployments using a separate authentication server:

from scibite_toolkit import termite7

# Initialize with context manager for automatic cleanup
with termite7.Termite7RequestBuilder() as t:
    # Set URLs
    t.set_url('https://termite.saas.scibite.com')
    t.set_token_url('https://auth.saas.scibite.com')

    # Authenticate with OAuth2 client credentials
    if not t.set_oauth2('your_client_id', 'your_client_secret'):
        print("Authentication failed!")
        exit(1)

    # Annotate text
    t.set_entities('DRUG,INDICATION')
    t.set_subsume(True)
    t.set_text('Aspirin is used to treat headaches and reduce inflammation.')

    response = t.annotate_text()

    # Process the response
    df = termite7.process_annotation_output(response)
    print(df.head())

OAuth2 Password Grant (Legacy)

For on-premise deployments using username/password authentication:

from scibite_toolkit import termite7

t = termite7.Termite7RequestBuilder()

# Set main TERMite URL and token URL (same server for legacy)
t.set_url('https://termite.example.com')
t.set_token_url('https://termite.example.com')

# Authenticate with username and password
if not t.set_oauth2_legacy('client_id', 'username', 'password'):
    print("Authentication failed!")
    exit(1)

# Annotate a document
t.set_entities('INDICATION,DRUG')
t.set_parser_id('generic')
t.set_file('path/to/document.pdf')

response = t.annotate_document()

# Process the response
df = termite7.process_annotation_output(response)
print(df)

# Clean up file handles
t.close()

Get System Status

from scibite_toolkit import termite7

t = termite7.Termite7RequestBuilder()
t.set_url('https://termite.example.com')
t.set_token_url('https://auth.example.com')
t.set_oauth2('client_id', 'client_secret')

# Get system status
status = termite7.get_system_status(t.url, t.headers)
print(f"Server Version: {status['data']['serverVersion']}")

# Get available vocabularies
vocabs = termite7.get_vocabs(t.url, t.headers)
print(f"Available vocabularies: {len(vocabs['data'])}")

# Get runtime options
rtos = termite7.get_runtime_options(t.url, t.headers)
print(rtos)

TERMite 6 Examples

For legacy TERMite 6.x deployments.

SciBite Hosted (SaaS)

from scibite_toolkit import termite

# Initialize
t = termite.TermiteRequestBuilder()

# Configure
t.set_url('https://termite.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')

# Authenticate
t.set_auth_saas('username', 'password')

# Set runtime options
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)

# Execute and process
response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))

Local Instance (Customer Hosted)

from scibite_toolkit import termite

t = termite.TermiteRequestBuilder()
t.set_url('https://termite.local.example.com')

# Basic authentication for local instances
t.set_basic_auth('username', 'password')

# Configure and execute
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)

response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))

TExpress Examples

Pattern-based entity relationship extraction.

SciBite Hosted

from scibite_toolkit import texpress

t = texpress.TexpressRequestBuilder()

t.set_url('https://texpress.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')
t.set_auth_saas('username', 'password')

# Set pattern to find relationships
t.set_entities('INDICATION,DRUG')
t.set_pattern(':(DRUG):{0,5}:(INDICATION)')  # Find DRUG within 5 words of INDICATION
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')

response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())

Local Instance

from scibite_toolkit import texpress

t = texpress.TexpressRequestBuilder()
t.set_url('https://texpress.local.example.com')
t.set_basic_auth('username', 'password')

t.set_entities('INDICATION,DRUG')
t.set_pattern(':(INDICATION):{0,5}:(INDICATION)')
t.set_input_format('pdf')
t.set_output_format('json')
t.set_binary_content('/path/to/file.pdf')

response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())

SciBite Search Example

Semantic search with entity-based queries and aggregations.

from scibite_toolkit import scibite_search

# Configure
s = scibite_search.SBSRequestBuilder()
s.set_url('https://yourdomain-search.saas.scibite.com/')
s.set_auth_url('https://yourdomain.saas.scibite.com/')

# Authenticate with OAuth2
s.set_oauth2('your_client_id', 'your_client_secret')

# Search documents
query = 'schema_id="clinical_trial" AND (title~INDICATION$D011565 AND DRUG$*)'
# Preferred: request specific fields using the new 'fields' parameter (legacy: 'additional_fields')
response = s.get_docs(query=query, markup=True, limit=100, fields=['*'])

# Get co-occurrence aggregations
# Find top 50 genes co-occurring with psoriasis
response = s.get_aggregates(
    query='INDICATION$D011565',
    vocabs=['HGNCGENE'],
    limit=50
)

Note: Preferred parameter name is fields. The legacy additional_fields is still supported for backward compatibility. When both are provided, fields takes precedence.


CENtree Examples

Ontology navigation and search.

Modern Client (Recommended)

The modern centree_clients module provides better error handling, retries, and context manager support.

from scibite_toolkit.centree_clients import CENtreeReaderClient

# Use context manager for automatic cleanup
with CENtreeReaderClient(
    base_url="https://centree.example.com",
    bearer_token="your_token",
    timeout=(3.0, None)  # Quick connect, unlimited read
) as reader:

    # Search by exact label
    hits = reader.get_classes_by_exact_label("efo", "neuron")
    print(f"Found {len(hits)} matches")

    # Get ontology roots
    roots = reader.get_root_entities("efo", "classes", size=10)

    # Get paths from root to target (great for LLM grounding)
    paths = reader.get_paths_from_root("efo", "MONDO_0007739", as_="labels")
    for path in paths:
        print(" → ".join(path))

# Or authenticate with OAuth2
from scibite_toolkit.centree_clients import CENtreeReaderClient

reader = CENtreeReaderClient(base_url="https://centree.example.com")
if reader.set_oauth2(client_id="...", client_secret="..."):
    hits = reader.get_classes_by_exact_label("efo", "lung")
    print(hits)

CENtree VectorDB Uploader Examples

Upload ontology embedding CSVs from S3 or local files to Qdrant for vector search.

Qdrant version compatibility: The qdrant-client Python package must match your Qdrant server version within one minor version (e.g. client 1.7.x for server 1.7.x or 1.8.x). A mismatch may cause silent data corruption or connection errors. Pin the client version to match your server: pip install qdrant-client==1.7.0

CLI Usage

# Upload all datasets under the configured S3 prefix
centree2vec-upload --config config.yaml

# Upload only specific ontologies
centree2vec-upload --config config.yaml --ontology efo mondo

# Upload local embedding files directly (no S3 required)
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz

# Replace existing vectors for each ontology before uploading
centree2vec-upload --config config.yaml --replace

# Combine --local and --replace to re-upload a single ontology
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz --replace

# Dry-run to preview which files would be processed
centree2vec-upload --config config.yaml --dry-run

# Public S3 bucket with anonymous access
centree2vec-upload --config config.yaml --anonymous

Python API

from scibite_toolkit.centree_vectordb_uploader import run, load_config

# Load YAML configuration
cfg = load_config("config.yaml")

# Run the upload pipeline
results = run(cfg)
for r in results:
    print(f"{r['ontology']}: {r['total_rows']} vectors uploaded")

# Replace existing vectors for each ontology before uploading
results = run(cfg, replace=True)

# Dry-run to inspect what would be uploaded
results = run(cfg, dry_run=True)

Generate a Starter Config

# Write the bundled example config to the current directory
centree2vec-upload --init

# Or specify a custom path
centree2vec-upload --init my-config.yaml

Configuration Reference

Key Type Default Description
qdrant.url str Required. Qdrant server URL
qdrant.collection_name str Required. Target collection name
qdrant.distance str cosine Distance metric: cosine, euclid, dot, manhattan
qdrant.api_key_env str Env var name holding the Qdrant API key
qdrant.hnsw_config.m int 32 HNSW graph connectivity
qdrant.hnsw_config.ef_construct int 256 HNSW index build search depth
qdrant.hnsw_config.full_scan_threshold int 10000 Point count below which brute-force is used
s3.bucket str Required (S3 mode). S3 bucket name
s3.prefix str Required (S3 mode). S3 key prefix for embedding files
s3.anonymous bool false Use unsigned requests for public buckets
s3.endpoint_url str Custom S3-compatible endpoint URL
s3.region str eu-west-2 AWS region
ingest.vector_size int 384 Embedding dimension
ingest.batch_size int 1024 Points per Qdrant upload batch
ingest.chunk_size int 500000 Rows per pandas read chunk
ingest.parallel_uploads int 4 Parallel upload threads
ingest.build_indices_after_upload bool true Build payload indexes after upload
ingest.payload_index_fields list [metadata.iri, metadata.id, metadata.ontology] Fields to index
selection.ontologies list Ontology names to ingest (all if omitted)
selection.include_files list S3 keys to force-include
selection.exclude_files list S3 keys to always skip (highest priority)

CENtree Vector Generator Examples

End-to-end pipeline that takes a local ontology file, generates a sentence corpus via Owl2Sentence, encodes embeddings with sentence-transformers, and writes a gzipped CSV ready for Qdrant upload. Requires the oml extras:

pip install scibite-toolkit[oml]

CLI Usage

# Generate embeddings from an OWL file (outputs <name>_embeddings.csv.gz)
centree2vec-generate ontology.owl

# Custom output path and model
centree2vec-generate ontology.owl -o output.csv.gz --model all-MiniLM-L6-v2

# With debug logging and custom batch size
centree2vec-generate ontology.owl --debug --batch-size 64

Python API

import argparse
from scibite_toolkit.centree_vector_generator import (
    validate_format,
    derive_ontology_name,
    generate_corpus,
    generate_embeddings,
    write_output,
    run,
)

# Use the full pipeline via run()
args = argparse.Namespace(
    input_file="ontology.owl",
    output="embeddings.csv.gz",
    model="sentence-transformers/all-MiniLM-L6-v2",
    batch_size=128,
    debug=False,
    include_sentences=False,
)
run(args)

# Or use individual stages
fmt = validate_format("ontology.owl")       # "xml"
name = derive_ontology_name("ontology.owl")  # "ontology"
df = generate_corpus("ontology.owl", name)
df = generate_embeddings(df, "sentence-transformers/all-MiniLM-L6-v2", batch_size=128)
write_output(df, "ontology_embeddings.csv.gz")

Arguments

Argument Default Description
input_file (required) Path to the ontology file
--output, -o <name>_embeddings.csv.gz Output file path
--model sentence-transformers/all-MiniLM-L6-v2 Sentence-transformers model
--batch-size 128 Encoding batch size
--debug false Enable verbose Owl2Sentence logging

Output Format

Gzipped CSV with columns:

Column Description
id Unique identifier for the sentence
iri IRI of the ontology class
label Human-readable class label
ontology Ontology name (derived from filename)
content Generated sentence text
embeddings JSON-encoded 384-dimensional float array

Pipeline

ontology.owl ──▶ Owl2Sentence ──▶ corpus (DataFrame) ──▶ SentenceTransformer ──▶ embeddings.csv.gz
                 (parse & generate     (id, iri, label,     (encode content         (ready for
                  sentences)            ontology, content)    column)                 Qdrant upload)

The output is directly compatible with centree2vec_qdrant_uploader.py.


CENtree Ontology ML Examples

Convert OWL ontologies to natural-language corpora, generate sentence embeddings, and index them in Qdrant. Requires the oml extras:

pip install scibite-toolkit[oml]

Python API

from scibite_toolkit.centree_ontology_ml import Owl2Sentence, generate_embeddings

# Load ontology and generate sentence corpus
o2s = Owl2Sentence(owl_file="ontology.owl")
documents = o2s.run()

# Generate embeddings
texts = [doc.content for doc in documents]
embeddings = generate_embeddings(texts, model_name="sentence-transformers/all-MiniLM-L6-v2")

CLI Usage

The owl2sentence command exposes three pipeline stages:

# 1. Convert OWL to sentence corpus
owl2sentence corpus -i ontology.owl -o corpus.csv

# 2. Generate embeddings
owl2sentence embed -i corpus.csv -o embeddings.csv -m sentence-transformers/all-MiniLM-L6-v2

# 3. Index in Qdrant
owl2sentence index -i embeddings.csv --url http://localhost:6333 --collection my_ontology

# Pipeline chaining via stdout/stdin
owl2sentence corpus -i ontology.owl -o - | owl2sentence embed -i - -o - | owl2sentence index -i - --url http://localhost:6333 --collection my_ontology

Workbench Example

Dataset management and annotation.

from scibite_toolkit import workbench

# Initialize
wb = workbench.WorkbenchRequestBuilder()
wb.set_url('https://workbench.example.com')

# Authenticate
wb.set_oauth2('client_id', 'username', 'password')

# Create dataset
wb.set_dataset_name('My Analysis Dataset')
wb.set_dataset_desc('Dataset for clinical trial analysis')
wb.create_dataset()

# Upload file
wb.set_file_input('path/to/data.xlsx')
wb.upload_file_to_dataset()

# Configure and run annotation
vocabs = [[5, 6], [8, 9]]  # Vocabulary IDs
attrs = [200, 201]  # Attribute IDs
wb.set_termite_config('', vocabs, attrs)
wb.auto_annotate_dataset()

Key Features

Context Manager Support (TERMite 7, CENtree Clients)

Modern clients support context managers for automatic resource cleanup:

with termite7.Termite7RequestBuilder() as t:
    t.set_url('...')
    # ... work with client ...
# File handles automatically closed

Error Handling

All OAuth2 methods return boolean status for easy error handling:

if not t.set_oauth2(client_id, client_secret):
    print("Authentication failed - check credentials")
    exit(1)

Logging

Enable detailed logging for debugging:

import logging

logging.basicConfig(level=logging.DEBUG)

# Or set per-client
t = termite7.Termite7RequestBuilder(log_level='DEBUG')

Session Management

All clients use requests.Session() for efficient connection pooling and automatic retry handling.


License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scibite_toolkit-1.5.0a2.tar.gz (169.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scibite_toolkit-1.5.0a2-py3-none-any.whl (181.8 kB view details)

Uploaded Python 3

File details

Details for the file scibite_toolkit-1.5.0a2.tar.gz.

File metadata

  • Download URL: scibite_toolkit-1.5.0a2.tar.gz
  • Upload date:
  • Size: 169.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for scibite_toolkit-1.5.0a2.tar.gz
Algorithm Hash digest
SHA256 af422edc7bb9c2c9ee3c8b635af452925ce6f3c05bcf786dee3fb47d41d9378d
MD5 f85501785162a73a8c3bb099fdd199b7
BLAKE2b-256 38828cc735df162f17c9f9d71861d72385b5ae8175ca9af233eae896e2b00f3c

See more details on using hashes here.

File details

Details for the file scibite_toolkit-1.5.0a2-py3-none-any.whl.

File metadata

File hashes

Hashes for scibite_toolkit-1.5.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 a07b76d10b30208dea49b3e3714f67b8a2cb0512d43fcbc808dfd4eac9975398
MD5 c9ff1278599d243db7dec8f90295857f
BLAKE2b-256 e4cc1cb3aede037697f2200055d1b0c80a401c5481b1020b4f691cc818be2e34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page