scibite-toolkit - python library for calling SciBite applications: TERMite, TExpress, SciBite Search, CENtree and Workbench. The library also enables processing of the JSON results from such requests
Project description
SciBite Toolkit
Python library for making API calls to SciBite's suite of products and processing the JSON responses.
Supported Products
- TERMite - Entity recognition and semantic enrichment (version 6.x)
- TERMite 7 - Next-generation entity recognition with modern OAuth2 authentication
- TExpress - Pattern-based entity relationship extraction
- SciBite Search - Semantic search, document and entity analytics
- CENtree - Ontology management, navigation, and integration
- CENtree VectorDB Uploader - Upload ontology embedding CSVs from S3 or local files to Qdrant
- CENtree Vector Generator - End-to-end ontology→embedding CSV pipeline
- CENtree Ontology ML - OWL→sentence corpus, embedding generation, and Qdrant indexing
- Workbench - Dataset annotation and management
Installation
pip install scibite-toolkit
See versions on PyPI
Quick Start Examples
- TERMite 7 - Modern client with OAuth2
- TERMite 6 - Legacy client
- TExpress - Pattern matching
- SciBite Search
- CENtree - Ontology navigation
- CENtree VectorDB Uploader - S3/local→Qdrant upload
- CENtree Vector Generator - Ontology→embedding CSV
- CENtree Ontology ML - OWL→embeddings pipeline
- Workbench
TERMite 7 Examples
TERMite 7 is the modern version with enhanced OAuth2 authentication and improved API.
OAuth2 Client Credentials (SaaS - Recommended)
For modern SaaS deployments using a separate authentication server:
from scibite_toolkit import termite7
# Initialize with context manager for automatic cleanup
with termite7.Termite7RequestBuilder() as t:
# Set URLs
t.set_url('https://termite.saas.scibite.com')
t.set_token_url('https://auth.saas.scibite.com')
# Authenticate with OAuth2 client credentials
if not t.set_oauth2('your_client_id', 'your_client_secret'):
print("Authentication failed!")
exit(1)
# Annotate text
t.set_entities('DRUG,INDICATION')
t.set_subsume(True)
t.set_text('Aspirin is used to treat headaches and reduce inflammation.')
response = t.annotate_text()
# Process the response
df = termite7.process_annotation_output(response)
print(df.head())
OAuth2 Password Grant (Legacy)
For on-premise deployments using username/password authentication:
from scibite_toolkit import termite7
t = termite7.Termite7RequestBuilder()
# Set main TERMite URL and token URL (same server for legacy)
t.set_url('https://termite.example.com')
t.set_token_url('https://termite.example.com')
# Authenticate with username and password
if not t.set_oauth2_legacy('client_id', 'username', 'password'):
print("Authentication failed!")
exit(1)
# Annotate a document
t.set_entities('INDICATION,DRUG')
t.set_parser_id('generic')
t.set_file('path/to/document.pdf')
response = t.annotate_document()
# Process the response
df = termite7.process_annotation_output(response)
print(df)
# Clean up file handles
t.close()
Get System Status
from scibite_toolkit import termite7
t = termite7.Termite7RequestBuilder()
t.set_url('https://termite.example.com')
t.set_token_url('https://auth.example.com')
t.set_oauth2('client_id', 'client_secret')
# Get system status
status = termite7.get_system_status(t.url, t.headers)
print(f"Server Version: {status['data']['serverVersion']}")
# Get available vocabularies
vocabs = termite7.get_vocabs(t.url, t.headers)
print(f"Available vocabularies: {len(vocabs['data'])}")
# Get runtime options
rtos = termite7.get_runtime_options(t.url, t.headers)
print(rtos)
TERMite 6 Examples
For legacy TERMite 6.x deployments.
SciBite Hosted (SaaS)
from scibite_toolkit import termite
# Initialize
t = termite.TermiteRequestBuilder()
# Configure
t.set_url('https://termite.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')
# Authenticate
t.set_auth_saas('username', 'password')
# Set runtime options
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)
# Execute and process
response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))
Local Instance (Customer Hosted)
from scibite_toolkit import termite
t = termite.TermiteRequestBuilder()
t.set_url('https://termite.local.example.com')
# Basic authentication for local instances
t.set_basic_auth('username', 'password')
# Configure and execute
t.set_entities('INDICATION')
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
t.set_subsume(True)
response = t.execute()
df = termite.get_termite_dataframe(response)
print(df.head(3))
TExpress Examples
Pattern-based entity relationship extraction.
SciBite Hosted
from scibite_toolkit import texpress
t = texpress.TexpressRequestBuilder()
t.set_url('https://texpress.saas.scibite.com')
t.set_saas_login_url('https://login.saas.scibite.com')
t.set_auth_saas('username', 'password')
# Set pattern to find relationships
t.set_entities('INDICATION,DRUG')
t.set_pattern(':(DRUG):{0,5}:(INDICATION)') # Find DRUG within 5 words of INDICATION
t.set_input_format('medline.xml')
t.set_output_format('json')
t.set_binary_content('path/to/file.xml')
response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())
Local Instance
from scibite_toolkit import texpress
t = texpress.TexpressRequestBuilder()
t.set_url('https://texpress.local.example.com')
t.set_basic_auth('username', 'password')
t.set_entities('INDICATION,DRUG')
t.set_pattern(':(INDICATION):{0,5}:(INDICATION)')
t.set_input_format('pdf')
t.set_output_format('json')
t.set_binary_content('/path/to/file.pdf')
response = t.execute()
df = texpress.get_texpress_dataframe(response)
print(df.head())
SciBite Search Example
Semantic search with entity-based queries and aggregations.
from scibite_toolkit import scibite_search
# Configure
s = scibite_search.SBSRequestBuilder()
s.set_url('https://yourdomain-search.saas.scibite.com/')
s.set_auth_url('https://yourdomain.saas.scibite.com/')
# Authenticate with OAuth2
s.set_oauth2('your_client_id', 'your_client_secret')
# Search documents
query = 'schema_id="clinical_trial" AND (title~INDICATION$D011565 AND DRUG$*)'
# Preferred: request specific fields using the new 'fields' parameter (legacy: 'additional_fields')
response = s.get_docs(query=query, markup=True, limit=100, fields=['*'])
# Get co-occurrence aggregations
# Find top 50 genes co-occurring with psoriasis
response = s.get_aggregates(
query='INDICATION$D011565',
vocabs=['HGNCGENE'],
limit=50
)
Note: Preferred parameter name is
fields. The legacyadditional_fieldsis still supported for backward compatibility. When both are provided,fieldstakes precedence.
CENtree Examples
Ontology navigation and search.
Modern Client (Recommended)
The modern centree_clients module provides better error handling, retries, and context manager support.
from scibite_toolkit.centree_clients import CENtreeReaderClient
# Use context manager for automatic cleanup
with CENtreeReaderClient(
base_url="https://centree.example.com",
bearer_token="your_token",
timeout=(3.0, None) # Quick connect, unlimited read
) as reader:
# Search by exact label
hits = reader.get_classes_by_exact_label("efo", "neuron")
print(f"Found {len(hits)} matches")
# Get ontology roots
roots = reader.get_root_entities("efo", "classes", size=10)
# Get paths from root to target (great for LLM grounding)
paths = reader.get_paths_from_root("efo", "MONDO_0007739", as_="labels")
for path in paths:
print(" → ".join(path))
# Or authenticate with OAuth2
from scibite_toolkit.centree_clients import CENtreeReaderClient
reader = CENtreeReaderClient(base_url="https://centree.example.com")
if reader.set_oauth2(client_id="...", client_secret="..."):
hits = reader.get_classes_by_exact_label("efo", "lung")
print(hits)
CENtree VectorDB Uploader Examples
Upload ontology embedding CSVs from S3 or local files to Qdrant for vector search.
Qdrant version compatibility: The
qdrant-clientPython package must match your Qdrant server version within one minor version (e.g. client 1.7.x for server 1.7.x or 1.8.x). A mismatch may cause silent data corruption or connection errors. Pin the client version to match your server:pip install qdrant-client==1.7.0
CLI Usage
# Upload all datasets under the configured S3 prefix
centree2vec-upload --config config.yaml
# Upload only specific ontologies
centree2vec-upload --config config.yaml --ontology efo mondo
# Upload local embedding files directly (no S3 required)
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz
# Replace existing vectors for each ontology before uploading
centree2vec-upload --config config.yaml --replace
# Combine --local and --replace to re-upload a single ontology
centree2vec-upload --config config.yaml --local efo_embeddings.csv.gz --replace
# Dry-run to preview which files would be processed
centree2vec-upload --config config.yaml --dry-run
# Public S3 bucket with anonymous access
centree2vec-upload --config config.yaml --anonymous
Python API
from scibite_toolkit.centree_vectordb_uploader import run, load_config
# Load YAML configuration
cfg = load_config("config.yaml")
# Run the upload pipeline
results = run(cfg)
for r in results:
print(f"{r['ontology']}: {r['total_rows']} vectors uploaded")
# Replace existing vectors for each ontology before uploading
results = run(cfg, replace=True)
# Dry-run to inspect what would be uploaded
results = run(cfg, dry_run=True)
Generate a Starter Config
# Write the bundled example config to the current directory
centree2vec-upload --init
# Or specify a custom path
centree2vec-upload --init my-config.yaml
Configuration Reference
| Key | Type | Default | Description |
|---|---|---|---|
qdrant.url |
str | — | Required. Qdrant server URL |
qdrant.collection_name |
str | — | Required. Target collection name |
qdrant.distance |
str | cosine |
Distance metric: cosine, euclid, dot, manhattan |
qdrant.api_key_env |
str | — | Env var name holding the Qdrant API key |
qdrant.hnsw_config.m |
int | 32 |
HNSW graph connectivity |
qdrant.hnsw_config.ef_construct |
int | 256 |
HNSW index build search depth |
qdrant.hnsw_config.full_scan_threshold |
int | 10000 |
Point count below which brute-force is used |
s3.bucket |
str | — | Required (S3 mode). S3 bucket name |
s3.prefix |
str | — | Required (S3 mode). S3 key prefix for embedding files |
s3.anonymous |
bool | false |
Use unsigned requests for public buckets |
s3.endpoint_url |
str | — | Custom S3-compatible endpoint URL |
s3.region |
str | eu-west-2 |
AWS region |
ingest.vector_size |
int | 384 |
Embedding dimension |
ingest.batch_size |
int | 1024 |
Points per Qdrant upload batch |
ingest.chunk_size |
int | 500000 |
Rows per pandas read chunk |
ingest.parallel_uploads |
int | 4 |
Parallel upload threads |
ingest.build_indices_after_upload |
bool | true |
Build payload indexes after upload |
ingest.payload_index_fields |
list | [metadata.iri, metadata.id, metadata.ontology] |
Fields to index |
selection.ontologies |
list | — | Ontology names to ingest (all if omitted) |
selection.include_files |
list | — | S3 keys to force-include |
selection.exclude_files |
list | — | S3 keys to always skip (highest priority) |
CENtree Vector Generator Examples
End-to-end pipeline that takes a local ontology file, generates a sentence corpus via Owl2Sentence, encodes embeddings with sentence-transformers, and writes a gzipped CSV ready for Qdrant upload. Requires the oml extras:
pip install scibite-toolkit[oml]
CLI Usage
# Generate embeddings from an OWL file (outputs <name>_embeddings.csv.gz)
centree2vec-generate ontology.owl
# Custom output path and model
centree2vec-generate ontology.owl -o output.csv.gz --model all-MiniLM-L6-v2
# With debug logging and custom batch size
centree2vec-generate ontology.owl --debug --batch-size 64
Python API
import argparse
from scibite_toolkit.centree_vector_generator import (
validate_format,
derive_ontology_name,
generate_corpus,
generate_embeddings,
write_output,
run,
)
# Use the full pipeline via run()
args = argparse.Namespace(
input_file="ontology.owl",
output="embeddings.csv.gz",
model="sentence-transformers/all-MiniLM-L6-v2",
batch_size=128,
debug=False,
include_sentences=False,
)
run(args)
# Or use individual stages
fmt = validate_format("ontology.owl") # "xml"
name = derive_ontology_name("ontology.owl") # "ontology"
df = generate_corpus("ontology.owl", name)
df = generate_embeddings(df, "sentence-transformers/all-MiniLM-L6-v2", batch_size=128)
write_output(df, "ontology_embeddings.csv.gz")
Arguments
| Argument | Default | Description |
|---|---|---|
input_file |
(required) | Path to the ontology file |
--output, -o |
<name>_embeddings.csv.gz |
Output file path |
--model |
sentence-transformers/all-MiniLM-L6-v2 |
Sentence-transformers model |
--batch-size |
128 |
Encoding batch size |
--debug |
false |
Enable verbose Owl2Sentence logging |
Output Format
Gzipped CSV with columns:
| Column | Description |
|---|---|
id |
Unique identifier for the sentence |
iri |
IRI of the ontology class |
label |
Human-readable class label |
ontology |
Ontology name (derived from filename) |
content |
Generated sentence text |
embeddings |
JSON-encoded 384-dimensional float array |
Pipeline
ontology.owl ──▶ Owl2Sentence ──▶ corpus (DataFrame) ──▶ SentenceTransformer ──▶ embeddings.csv.gz
(parse & generate (id, iri, label, (encode content (ready for
sentences) ontology, content) column) Qdrant upload)
The output is directly compatible with centree2vec_qdrant_uploader.py.
CENtree Ontology ML Examples
Convert OWL ontologies to natural-language corpora, generate sentence embeddings, and index them in Qdrant. Requires the oml extras:
pip install scibite-toolkit[oml]
Python API
from scibite_toolkit.centree_ontology_ml import Owl2Sentence, generate_embeddings
# Load ontology and generate sentence corpus
o2s = Owl2Sentence(owl_file="ontology.owl")
documents = o2s.run()
# Generate embeddings
texts = [doc.content for doc in documents]
embeddings = generate_embeddings(texts, model_name="sentence-transformers/all-MiniLM-L6-v2")
CLI Usage
The owl2sentence command exposes three pipeline stages:
# 1. Convert OWL to sentence corpus
owl2sentence corpus -i ontology.owl -o corpus.csv
# 2. Generate embeddings
owl2sentence embed -i corpus.csv -o embeddings.csv -m sentence-transformers/all-MiniLM-L6-v2
# 3. Index in Qdrant
owl2sentence index -i embeddings.csv --url http://localhost:6333 --collection my_ontology
# Pipeline chaining via stdout/stdin
owl2sentence corpus -i ontology.owl -o - | owl2sentence embed -i - -o - | owl2sentence index -i - --url http://localhost:6333 --collection my_ontology
Workbench Example
Dataset management and annotation.
from scibite_toolkit import workbench
# Initialize
wb = workbench.WorkbenchRequestBuilder()
wb.set_url('https://workbench.example.com')
# Authenticate
wb.set_oauth2('client_id', 'username', 'password')
# Create dataset
wb.set_dataset_name('My Analysis Dataset')
wb.set_dataset_desc('Dataset for clinical trial analysis')
wb.create_dataset()
# Upload file
wb.set_file_input('path/to/data.xlsx')
wb.upload_file_to_dataset()
# Configure and run annotation
vocabs = [[5, 6], [8, 9]] # Vocabulary IDs
attrs = [200, 201] # Attribute IDs
wb.set_termite_config('', vocabs, attrs)
wb.auto_annotate_dataset()
Key Features
Context Manager Support (TERMite 7, CENtree Clients)
Modern clients support context managers for automatic resource cleanup:
with termite7.Termite7RequestBuilder() as t:
t.set_url('...')
# ... work with client ...
# File handles automatically closed
Error Handling
All OAuth2 methods return boolean status for easy error handling:
if not t.set_oauth2(client_id, client_secret):
print("Authentication failed - check credentials")
exit(1)
Logging
Enable detailed logging for debugging:
import logging
logging.basicConfig(level=logging.DEBUG)
# Or set per-client
t = termite7.Termite7RequestBuilder(log_level='DEBUG')
Session Management
All clients use requests.Session() for efficient connection pooling and automatic retry handling.
License
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scibite_toolkit-1.5.0a2.tar.gz.
File metadata
- Download URL: scibite_toolkit-1.5.0a2.tar.gz
- Upload date:
- Size: 169.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af422edc7bb9c2c9ee3c8b635af452925ce6f3c05bcf786dee3fb47d41d9378d
|
|
| MD5 |
f85501785162a73a8c3bb099fdd199b7
|
|
| BLAKE2b-256 |
38828cc735df162f17c9f9d71861d72385b5ae8175ca9af233eae896e2b00f3c
|
File details
Details for the file scibite_toolkit-1.5.0a2-py3-none-any.whl.
File metadata
- Download URL: scibite_toolkit-1.5.0a2-py3-none-any.whl
- Upload date:
- Size: 181.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a07b76d10b30208dea49b3e3714f67b8a2cb0512d43fcbc808dfd4eac9975398
|
|
| MD5 |
c9ff1278599d243db7dec8f90295857f
|
|
| BLAKE2b-256 |
e4cc1cb3aede037697f2200055d1b0c80a401c5481b1020b4f691cc818be2e34
|