ontology_loader
Project description
ontology_loader
A suite of tools to configure and load an ontology from the OboFoundary into the data object for OntologyClass as specified by the NMDC schema.
Architecture: MongoDB access patterns
MongoDBLoader reaches MongoDB through two paths simultaneously — a deliberate hybrid, not an oversight.
linkml-store. Used for schema-aware setup and for any path where per-document work is acceptable:
Client(handle=...)/attach_database(...)— declarative connection that integrates with NMDC's LinkML schema tooling.db.create_collection(name, recreate_if_exists=False)— idempotent collection setup.collection.index(...)— idempotent index declaration onid,name(class collection) and(subject, predicate, object)(relation collection)._handle_obsolete_terms— per-item processing of the small obsolete subset.
Raw pymongo. Used only for the bulk-upsert phase, exposed via the lazy MongoDBLoader._py_db property:
py_collection.bulk_write([UpdateOne(...upsert=True), ...], ordered=False).
Why both?
linkml_store.api.stores.mongodb.mongodb_collection.upsert (as of the version pinned here) iterates per-item with find_one followed by update_one/insert_one. For small ontologies, that's fine. For larger ones, it's too slow.
The pymongo path is a bypass, not a permanent split. Upstream issue linkml/linkml-store#77 tracks adding bulk_write support to linkml-store.
Development Environment
Pre-requisites
-
=Python 3.10
- Poetry
- Docker
- MongoDB
- NMDC materialized schema
- ENV variable for MONGO_PASSWORD (or pass it in via the cli/runner itself directly)
% docker pull mongo
% docker run -d --name mongodb-container -p 27018:27017 mongo
MongoDB Connection Settings
When connecting to MongoDB, you need to set the correct environment variables depending on where your code is running:
-
When running from your local machine (CLI or tests):
export MONGO_HOST=localhost export MONGO_PORT=27018 export ENABLE_DB_TESTS=true export MONGO_PASSWORD="your_valid_password"
-
When running inside Docker containers:
export MONGO_HOST=mongo export MONGO_PORT=27017
The Docker container networking uses container names (like 'mongo') for internal communication, while your host machine must use 'localhost' with the mapped port (27018).
Basic mongosh commands
% docker ps
% docker exec -it [mongodb-container-id] bash
% mongosh mongodb://admin:root@mongo:27017/nmdc?authSource=admin
% show dbs
% use nmdc
% db.ontology_class_set.find().pretty()
% db.ontology_relation_set.find().pretty()
% db.ontology_class_set.find( { id: { $regex: /^PO/ } } ).pretty()
% db.ontology_class_set.find( { id: { $regex: /^UBERON/ } } ).pretty()
% db.ontology_class_set.find( { id: { $regex: /^ENVO/ } } ).pretty()
Command line
% poetry install
% poetry run ontology_loader --help
% poetry run ontology_loader --source-ontology envo
% poetry run ontology_loader --source-ontology envo --source-ontology po --source-ontology uberon
Four flags:
--source-ontology <name>— required, repeatable. Lowercase prefix (envo, po, uberon, ncbitaxon, …). Multiple ontologies are processed sequentially in the given order.--report-directory <dir>— TSV report destination (only used inmeticulousmode). Defaults to a fresh temp directory.--mode {meticulous|fast-initial}— defaultmeticulous. See "Modes" below.--closure {combined|isa|partof|all|none}— defaultcombined. Repeatable; values combine.allandnoneare exclusive.
Modes
meticulous(default): Preserves 0.2.x behavior — pure linkml-store, per-item upsert, force-refresh of the pystow cache on every run, TSV reports (ontology_updates.tsv,ontology_inserts.tsv,ontology_relation_inserts.tsv) written to--report-directory. Use this for incremental updates of an already-loaded ontology.fast-initial: Maximum-throughput first-time install. Raw pymongoinsert_many(ordered=False), no upsert, no pre-read, no report tracking, no TSV writes. Reuses the pystow cache if present (downloads only when missing). Use this when the target collections are empty or duplicate-key errors are acceptable. Expected ~3-5x faster thanmeticulouson large ontologies (e.g. NCBITaxon's 2.7M classes + 54.7M relations).
Closure shorthands
--closure combined(default): emitsentailed_isa_partof_closure(rdfs:subClassOf ∪ BFO:0000050).--closure isa: emitsentailed_isa_closure(rdfs:subClassOf only).--closure partof: emitsentailed_partof_closure(BFO:0000050 only).--closure all: shorthand for--closure combined --closure isa --closure partof. Exclusive.--closure none: emit no ancestry closure, only direct relationships. Exclusive.
Repeat the flag to combine specific closures: --closure isa --closure partof emits both entailed_isa_closure and entailed_partof_closure.
Running the tests
% make test
Running the linter
% make lint
Python API
pip install nmdc-ontology-loader
from ontology_loader.ontology_load_controller import OntologyLoaderController
import tempfile
# Default: pure linkml-store + TSV reports (preserves 0.2.x behavior)
OntologyLoaderController(
source_ontology="envo", # str or list[str]
report_directory=tempfile.gettempdir(), # only used in 'meticulous' mode
mode="meticulous", # or 'fast-initial'
closure="combined", # str or list[str]
).run_ontology_loader()
Fast first-time install of a large ontology
OntologyLoaderController(
source_ontology="ncbitaxon",
mode="fast-initial", # raw pymongo insert_many, no upsert, no reports
closure="isa", # is_a only; combined closure is too large for NCBITaxon
).run_ontology_loader()
Multiple ontologies in one invocation
OntologyLoaderController(
source_ontology=["envo", "po", "uberon"], # processed sequentially in given order
mode="meticulous",
).run_ontology_loader()
Using with an existing MongoDB connection
If you already have a MongoDB connection (e.g., in a Dagster/Dagit job), pass it directly:
from pymongo import MongoClient
from ontology_loader.ontology_load_controller import OntologyLoaderController
mongo_client = MongoClient("mongodb://admin:password@localhost:27018/nmdc?authSource=admin")
OntologyLoaderController(
source_ontology="envo",
mode="meticulous",
mongo_client=mongo_client, # Pass the existing client
db_name="nmdc", # Required when passing an existing client
).run_ontology_loader()
Note: When passing an existing MongoDB client, you must also provide
db_name. The database name cannot be auto-determined from a MongoClient instance.
Migrating from 0.2.x
The 0.2.x constructor signature (source_ontology, output_directory, generate_reports, mongo_client, db_name) continues to work as deprecated aliases. The exact call site in nmdc-runtime's Dagster load_ontology op runs unchanged under 0.3.0; two DeprecationWarning lines appear in the logs as a nudge.
| old kwarg | new kwarg | behavior |
|---|---|---|
source_ontology=<str> |
source_ontology=<str | list[str]> |
unchanged; now also accepts a list |
output_directory=<str> |
report_directory=<str> |
renamed; old kwarg works as alias with DeprecationWarning. Passing both raises. |
generate_reports=True |
(gone — implicit) | no-op with DeprecationWarning (True was always the default) |
generate_reports=False |
mode='fast-initial' |
mapped with DeprecationWarning. If mode was also passed and isn't 'meticulous', raises. |
| (none) | mode='meticulous' (default) |
new; default preserves 0.2.x write path |
| (none) | closure='combined' (default) |
new; default preserves 0.2.x ancestry behavior |
See CHANGELOG.md for the full release note and a side-by-side migration code sample.
Testing CRUD operations in a live MongoDB
The test suite follows a single convention: tests that need MongoDB run automatically when MongoDB and credentials are available; they skip gracefully when not.
In practice:
- Mock-only tests (e.g.
tests/test_mock_mongodb_loader.py) run unconditionally — no MongoDB or credentials needed. - Tests that exercise a live MongoDB are gated by
MONGO_PASSWORD(and a few additionally requireENABLE_DB_TESTS=trueas an extra safety check against accidental writes against unintended databases). When the gating env vars are unset, those tests skip with a clear reason; when they are set, the tests connect to the MongoDB pointed at by the rest of theMONGO_*env vars.
Required env vars when running the live-DB tests:
export MONGO_HOST=localhost
export MONGO_PORT=27017 # or whatever your local Mongo listens on
export MONGO_USERNAME=admin
export MONGO_PASSWORD="your_valid_password"
export MONGO_DB=nmdc # read by the loader (see src/ontology_loader/mongo_db_config.py)
export MONGO_DBNAME=nmdc # read by tests/test_ontology_class_null_values.py — currently a separate name from MONGO_DB
export ENABLE_DB_TESTS=true # required by tests/test_ontology_load_controller.py
Then:
make test
Same command runs without the env vars; the DB-gated tests just skip. Mock-only tests still run either way. This is intended both to prevent accidental writes against a live database when env
vars aren't deliberately set, and to make sure MONGO_PASSWORD is never hardcoded in the codebase.
Safety rules for DB-writing tests
Any test that writes or modifies MongoDB documents must follow these rules:
- Use a dedicated scratch database or collection name — never the production names (
nmdc,ontology_class_set,ontology_relation_set). The scratch name should be specific enough that it can't collide with real data (e.g.ontology_loader_smoke_test). - Verify the target does not already exist before writing — if it does, the test must fail loudly with a clear message so the developer investigates rather than silently overwriting unrelated data.
- Clean up unconditionally at the end — wrap the test in
try/finallyso the cleanup runs even when assertions fail.
The smoke test tests/test_cli_smoke.py::test_controller_end_to_end_against_live_mongo shows the pattern.
What each live-DB test does
| File | What it touches |
|---|---|
tests/test_linkml_store_client_connections.py |
Verifies that both raw pymongo and linkml-store's Client can establish a connection. |
tests/test_ontology_class_null_values.py |
Inserts and reads ontology class docs to confirm boolean/text fields don't store null. |
tests/test_ontology_load_controller.py |
Runs OntologyLoaderController.run_ontology_loader() against a small live ENVO load. |
tests/test_cli_smoke.py::test_controller_end_to_end_against_live_mongo |
Stubs the heavy semsql step, runs the controller end-to-end against MongoDB, and verifies the expected documents. |
| ======= |
The same test command will run without the environment variables, but it will only mock database calls. This is intended to help prevent accidental data loss or corruption in a live database environment and to ensure that MONGO_PASSWORD is not hardcoded in the codebase.
Reset collections in dev
docker exec -it nmdc-runtime-test-mongo-1 bash
mongosh mongodb://admin:root@mongo:27017/nmdc?authSource=admin
db.ontology_class_set.find({}).pretty()
db.ontology_relation_set.find({}).pretty()
db.biosample_set.find({}).pretty()
db.ontology_class_set.drop()
db.ontology_relation_set.drop()
db.ontology_class_set.countDocuments()
db.ontology_relation_set.countDocuments()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ontology_loader-0.2.3.tar.gz.
File metadata
- Download URL: ontology_loader-0.2.3.tar.gz
- Upload date:
- Size: 23.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
664391dc89797ed0b4ce497d844f91784d90d19c8c6f0ffb646ac4a550232a0b
|
|
| MD5 |
e80878dc5a6a6c2de9fab48045e25847
|
|
| BLAKE2b-256 |
28c93bb12b6841165b2624cc93136f948be61d8a2cece9047fe70cb622d3b363
|
File details
Details for the file ontology_loader-0.2.3-py3-none-any.whl.
File metadata
- Download URL: ontology_loader-0.2.3-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee2cc3a158c1729894dd4fe6b09bc016c35dfcc02a8f70a2983b2932321737ae
|
|
| MD5 |
1bc2d7cabc6f8c15e2af23dc48505a07
|
|
| BLAKE2b-256 |
77208df759ed687efc30006d8edc2f65cb8308ca384a0755de90b66910649f85
|