Database module based on lx-dtypes for research data management.

These details have not been verified by PyPI

Project links

Project description

lx-ai

A PyTorch-based training framework for multi-label classification of gastroenterology endoscopic images. Designed for both research and production workflows, with strong emphasis on data integrity, reproducibility, and flexible data sourcing.

The framework provides an end-to-end pipeline that handles:

Dataset construction
Label processing
Split generation
Model training
Evaluation
Reporting

All in a consistent and validated manner.

Core Capabilities

Multi-label Classification

Supports multiple simultaneous labels per image (e.g., polyp, blood, instrument, etc.)
Handles incomplete annotations using masking

Flexible Data Sources

PostgreSQL database — production/service mode
SQLite — local development mode
Legacy JSONL + image directory — offline datasets

Robust Dataset Handling

Label filtering by labelset version
Explicit handling of:
- Known positives
- Known negatives
- Unknown labels
Configurable semantics:
- Treat unknown as negative (closed-world)
- Ignore unknown (open-world)

Stable and Reproducible Data Splitting

Bucket-based splitting with deterministic hashing
Grouping by video_id or old_examination_id
Persistent video bucket registry to ensure:
- No data leakage
- Stable splits across runs
- Reproducibility in experiments

Model Architecture Flexibility

GastroNet ResNet50 (recommended for medical domain)
Standard backbones (ImageNet pretrained or random)
Easy extension for new architectures

Training Features

Focal loss with:
- Per-label class weighting
- Masking for unknown labels
Separate learning rates for backbone and head
Optional backbone freezing
Cosine annealing scheduler with warmup

Evaluation and Metrics

Global metrics: precision, recall, F1-score, accuracy
Per-label metrics
Support for:
- Standard evaluation (with negatives)
- Positives-only evaluation (when negatives are unavailable)

Data Validation and Diagnostics

Automatic dataset validation reports
Label distribution analysis
Split integrity checks
Dataset imbalance detection
Video and dataset-level diagnostics

Reproducibility and Traceability

Full configuration captured in metadata
Saved model weights and training history
Persistent bucket assignments
Deterministic dataset splits

Production and Development Compatibility

Service mode using PostgreSQL (production)
Local mode using SQLite (development)
Frame path remapping for local debugging of production data

Testing and Reliability

Extensive unit test coverage for:
- Configuration validation
- Dataset building
- Splitting logic
- Bucket hashing and allocation
- Loss functions and metrics
- Database loaders

Design Principles

lx-ai is built around a few key principles:

No Data Leakage

Group-based splitting ensures frames from the same examination or video never cross splits.

Reproducibility First

Persistent bucket assignment guarantees identical splits across runs and environments.

Explicit Label Semantics

Unknown labels are never silently treated as negatives unless explicitly configured.

Separation of Concerns

Data loading, splitting, training, and evaluation are modular and independently testable.

Production-Aware Design

The same pipeline works in both local development and service-based production environments.

Branches and Database Usage

`sandbox`

Used for sandbox database work and direct PostgreSQL access.

`prototype`

Used for the service-compatible workflow and local development.

Service mode: PostgreSQL
Local mode: SQLite

Quick Start

Paths a new developer should configure

Main place:

.env

Recommended local values: (write complete in all variables e.g DATA_DIR=/home/admin/dev/lx-ai/data not DATA_DIR=${WORKING_DIR}/data in .env file)

# Runtime roots
# Root of your system (user-specific)
HOME_DIR=/home/<your-user>
# Path where lx-ai repository is cloned
WORKING_DIR=${HOME_DIR}/dev/lx-ai
# Main data directory (all runtime data)
DATA_DIR=${WORKING_DIR}/data
# Configuration directory (passwords, configs)
CONF_DIR=${WORKING_DIR}/conf
# Storage root (usually same as DATA_DIR)
STORAGE_DIR=${DATA_DIR}
# Frame storage (extracted images)
FRAME_DIR=${DATA_DIR}/frames

# Training outputs
# Root for all training artifacts
TRAINING_ROOT=${DATA_DIR}/model_training
# Pretrained and saved model checkpoints
CHECKPOINTS_DIR=${TRAINING_ROOT}/checkpoints
# Training outputs (models, logs, metadata)
RUNS_DIR=${TRAINING_ROOT}/runs
# Bucket snapshots (split reproducibility)
BUCKET_SNAPSHOT_DIR=${TRAINING_ROOT}/buckets

# Model checkpoint
BACKBONE_CHECKPOINT=${CHECKPOINTS_DIR}/RN50_GastroNet-1M_DINOv1.pth

# Training config-a relative path inside the repository.
TRAINING_CONFIG_PATH=lx_ai/ai_model_config/train_sandbox_postgres.yaml

# Optional JSONL mode - data_source: jsonl
LEGACY_IMAGE_DIR=${DATA_DIR}/legacy_images/images
LEGACY_JSONL_PATH=${DATA_DIR}/legacy_images/legacy_img_dicts.jsonl

# Optional CSV import - used by lx_ai/scripts/import_csv_sqlite.py
CSV_DIR=${DATA_DIR}/import/csv

# Local SQLite
SQLITE_DB_PATH=${WORKING_DIR}/dev_db.sqlite

# Frame path remap for local development
FRAME_PATH_REMAP_SOURCE=/var/endoreg-service-user/lx-annotate/data/frames
FRAME_PATH_REMAP_TARGET=${WORKING_DIR}/data/frames_mirror

Database variables

Local development with SQLite

DB_BACKEND=sqlite
DJANGO_SETTINGS_MODULE=${WORKING_DIR}.settings.settings_dev
DJANGO_DB_ENGINE=django.db.backends.sqlite3
SQLITE_DB_PATH=${WORKING_DIR}/dev_db.sqlite

For SQLite, these PostgreSQL-style values may exist but are not the active DB connection:

DJANGO_DB_HOST=
DJANGO_DB_PORT=
DJANGO_DB_NAME=
DJANGO_DB_USER=

Production or service with PostgreSQL

These are normally generated in .env.systemd by the Luxnix service:

DB_BACKEND=postgres
DJANGO_SETTINGS_MODULE=SQLITE_DB_PATH=${WORKING_DIR}.settings.settings_prod
DJANGO_DB_ENGINE=django.db.backends.postgresql
DJANGO_DB_HOST=localhost
DJANGO_DB_PORT=5432
DJANGO_DB_NAME=<database_name> #e.g endoregDbLocal
DJANGO_DB_USER=<database_user> #e.g endoregDbLocal
DJANGO_DB_PASSWORD_FILE=${CONF_DIR}/db_pwd  # e.g /var/endoreg-service-user/lx-ai/conf/db_pwd
DJANGO_DB_SSLMODE=prefer

Files where these are used

Path variable	Purpose	Used in
`DATA_DIR`	Main data root	`secretspec.toml`, `devenv.nix`, training YAML
`CONF_DIR`	Password and config files	`secretspec.toml`, service `.env.systemd`
`FRAME_DIR`	Default frame directory	Django settings and diagnostics
`TRAINING_ROOT`	Training artifact root	training config
`CHECKPOINTS_DIR`	Backbone checkpoint folder	training config
`RUNS_DIR`	Saved models, metadata, reports	training config
`BUCKET_SNAPSHOT_DIR`	Bucket snapshots	`lx_ai/training/bucket_snapshot.py`
`BACKBONE_CHECKPOINT`	GastroNet checkpoint path	`train_sandbox_postgres.yaml`
`TRAINING_CONFIG_PATH`	Which YAML file `run_training.py` loads	`lx_ai/run_training.py`
`LEGACY_IMAGE_DIR`	JSONL image folder	`data_loader_for_model_input.py`
`LEGACY_JSONL_PATH`	JSONL annotation file	`data_loader_for_model_input.py`
`CSV_DIR`	CSV import folder	`scripts/import_csv_sqlite.py`
`SQLITE_DB_PATH`	Local SQLite DB file	SQLite loaders
`FRAME_PATH_REMAP_SOURCE`	Original service frame path prefix	`data_loader_for_model_training.py`
`FRAME_PATH_REMAP_TARGET`	Local mirrored frame path prefix	`data_loader_for_model_training.py`

Training configuration

Edit the following file:

lx_ai/ai_model_config/train_sandbox_postgres.yaml

Important fields to configure:

dataset_ids: [1, 2]
labelset_id: 5
labelset_version_to_train: 3

backbone_name: gastro_rn50
backbone_checkpoint: "$BACKBONE_CHECKPOINT"

base_dir: "$DATA_DIR"
training_root: "$TRAINING_ROOT"
checkpoints_dir: "$CHECKPOINTS_DIR"
runs_dir: "$RUNS_DIR"

For a new dataset, update:

dataset_ids: [1, 2, 3]

For a different labelset, update:

labelset_id: <your_labelset_id>
labelset_version_to_train: <your_version>

Verify your paths

Run the path diagnostics to validate your configuration:

secretspec run --provider env -- uv run python -c "from lx_ai.utils.path_diagnostics import print_runtime_path_diagnostics; print_runtime_path_diagnostics()"

Main entry point

python lx_ai/run_training.py

Or as a module:

python -m lx_ai.run_training

Recommended development workflow

cd /home/admin/dev/lx-ai
devenv shell
python lx_ai/run_training.py

Configuration

The primary training config file is:

lx_ai/ai_model_config/train_sandbox_postgres.yaml

It controls:

dataset and labelset selection
data source
model backbone and checkpoint
training hyperparameters
scheduler settings
device selection
unknown-label behavior

Example fields:

dataset_uuid: sandbox_ds
data_source: postgres
dataset_ids: [1, 2]
labelset_id: 5
labelset_version_to_train: 3
treat_unlabeled_as_negative: false
backbone_name: gastro_rn50
backbone_checkpoint: /path/to/RN50_GastroNet-1M_DINOv1.pth
freeze_backbone: true
num_epochs: 20
batch_size: 16
lr_head: 0.001
lr_backbone: 0.0001
gamma_focal: 2.0
alpha_focal: 0.25
use_scheduler: true
warmup_epochs: 2
min_lr: 1.0e-6
device: cuda
random_seed: 42
bucket_policy:
  num_buckets: 5
  validation_buckets: [3]
  test_buckets: [4]
save_bucket_snapshot: false

Supported Backbones

Supported backbone names:

gastro_rn50
resnet50_imagenet
resnet50_random
efficientnet_b0_imagenet

Backbones are implemented in lx_ai/ai_model/model_backbones.py.

Data Sources

PostgreSQL mode

Use data_source: postgres.

Database loaders are in:

lx_ai/utils/db_loader_for_model_input.py
lx_ai/utils/data_loader_for_model_input.py

Connection variables are resolved from:

DEV_DB_* first
DJANGO_DB_* second

Password resolution supports:

*_PASSWORD
*_PASSWORD_FILE

SQLite mode

For local development, set:

export DB_BACKEND=sqlite

The local loader supports SQLite through the same input pipeline.

JSONL mode

Use data_source: jsonl and provide:

jsonl_path: /path/to/data.jsonl

Expected JSONL format:

{"labels": ["polyp"], "old_examination_id": 1, "old_id": 10, "filename": "10.jpg"}

Unknown Label Handling

Two modes are supported:

treat_unlabeled_as_negative: false
- Unknown labels are masked out
- Loss and metrics ignore unknown values
treat_unlabeled_as_negative: true
- Unknown labels are treated as negative
- Use only when missing labels imply negative examples

Dataset and Bucket Splitting

The loader builds datasets with:

frame-level multi-label vectors
label masks for unknown annotations
stable video/examination split assignment
bucket policy support for train/validation/test

Bucket policy example:

bucket_policy:
  num_buckets: 5
  validation_buckets: [3]
  test_buckets: [4]

Training buckets are all remaining buckets not assigned to validation or test.

The split logic preserves:

same-video grouping
stable bucket assignments
split exclusivity
dataset integrity

Training Flow

Training is managed by:

lx_ai/ai_model_training/trainer_gastronet_multilabel.py

Typical steps:

Load TrainingConfig
Build dataset
Validate labels and sources
Create PyTorch datasets and loaders
Create model and optimizer
Train for configured epochs
Validate and select best checkpoint
Evaluate test split
Save model weights and metadata

Loss and Metrics

Loss

Loss implementation:

lx_ai/ai_model/losses.py

Uses:

focal_loss_with_mask
compute_class_weights

Supports:

multi-label logits
per-label weights
label masks
focal alpha and gamma

Metrics

Metrics implementation:

lx_ai/ai_model_matrics/metrics.py

Supported metrics:

precision
recall
F1
accuracy
TP / FP / TN / FN
per-label metrics
positives-only metrics when negatives are unavailable

Outputs

Trained model artifacts are saved under:

data/model_training/runs/

Output files:

dataset_<dataset_uuid>_<backbone_name>_v<labelset_version>_multilabel.pth
dataset_<dataset_uuid>_<backbone_name>_v<labelset_version>_multilabel_meta.json

Metadata includes:

config
labelset
used labels
training history
final test metrics
bucket policy and sizes

Troubleshooting

Missing labelset

Verify labelset_id and labelset_version_to_train exist in the database.

Empty dataset

Check dataset_ids and ensure annotations exist for those IDs.

Image file not found

For local development against service database paths, remap frame roots:

export FRAME_PATH_REMAP_SOURCE="/var/endoreg-service-user/lx-annotate/data/frames"
export FRAME_PATH_REMAP_TARGET="/home/admin/dev/lx-ai/data/frames_mirror"

Missing GastroNet checkpoint

Verify the path for RN50_GastroNet-1M_DINOv1.pth.

PostgreSQL password errors

Use one of:

export DEV_DB_PASSWORD=your_password

export DEV_DB_PASSWORD_FILE=/path/to/password/file

Service mode uses the DJANGO_DB_* equivalents.

Testing

Run tests with:

pytest -q

Run a single file:

pytest tests/ai_model_config/test_training_config.py -q --no-cov

Run both:

pytest -q

Project Structure

lx-ai/
├── lx_ai/
│   ├── ai_model/
│   ├── ai_model_config/
│   ├── ai_model_dataset/
│   ├── ai_model_matrics/
│   ├── ai_model_split/
│   ├── ai_model_training/
│   ├── data_validation/
│   ├── scripts/
│   ├── utils/
│   └── run_training.py
├── tests/
├── data/
├── pyproject.toml
└── README.md

License

See LICENSE for details.

PostgreSQL and JSONL data sources
GastroNet ResNet50 and standard backbone options
Focal loss with per-label weighting and unknown label masking
Stable train/validation/test split by video or examination grouping
Model export with metadata and training history
Unit tests for core config, dataset, metrics, loss, and split logic

Branches and Database Usage

`gs02_sandbox_db`

Used for sandbox database work and direct PostgreSQL access on our own server gs-02. how to run it can be found here https://github.com/wg-lux/lx-ai/wiki/AI-Model---Running-Commands#lx-ai---gs02_sandbox_db

`prototype`

Used for the service-compatible workflow and local development.

Service mode: PostgreSQL
Local mode: SQLite

Quick Start

Main entry point

python lx_ai/run_training.py

Or as a module:

python -m lx_ai.run_training

Recommended development workflow

cd /home/admin/dev/lx-ai
devenv shell
python lx_ai/run_training.py

Configuration

The primary training config file is:

lx_ai/ai_model_config/train_sandbox_postgres.yaml

It controls:

dataset and labelset selection
data source
model backbone and checkpoint
training hyperparameters
scheduler settings
device selection
unknown-label behavior

Example fields:

dataset_uuid: sandbox_ds
data_source: postgres
dataset_ids: [1, 2]
labelset_id: 5
labelset_version_to_train: 3
treat_unlabeled_as_negative: false
backbone_name: gastro_rn50
backbone_checkpoint: /path/to/RN50_GastroNet-1M_DINOv1.pth
freeze_backbone: true
num_epochs: 20
batch_size: 16
lr_head: 0.001
lr_backbone: 0.0001
gamma_focal: 2.0
alpha_focal: 0.25
use_scheduler: true
warmup_epochs: 2
min_lr: 1.0e-6
device: cuda
random_seed: 42
bucket_policy:
  num_buckets: 5
  validation_buckets: [3]
  test_buckets: [4]
save_bucket_snapshot: false

Supported Backbones

Supported backbone names:

gastro_rn50
resnet50_imagenet
resnet50_random
efficientnet_b0_imagenet

Backbones are implemented in lx_ai/ai_model/model_backbones.py.

Data Sources

PostgreSQL mode

Use data_source: postgres.

Database loaders are in:

lx_ai/utils/db_loader_for_model_input.py
lx_ai/utils/data_loader_for_model_input.py

Connection variables are resolved from:

DEV_DB_* first
DJANGO_DB_* second

Password resolution supports:

*_PASSWORD
*_PASSWORD_FILE

SQLite mode

For local development, set:

export DB_BACKEND=sqlite

The local loader supports SQLite through the same input pipeline.

JSONL mode

Use data_source: jsonl and provide:

jsonl_path: /path/to/data.jsonl

Expected JSONL format:

{"labels": ["polyp"], "old_examination_id": 1, "old_id": 10, "filename": "10.jpg"}

Unknown Label Handling

Two modes are supported:

treat_unlabeled_as_negative: false
- Unknown labels are masked out
- Loss and metrics ignore unknown values
treat_unlabeled_as_negative: true
- Unknown labels are treated as negative
- Use only when missing labels imply negative examples

Dataset and Bucket Splitting

The loader builds datasets with:

frame-level multi-label vectors
label masks for unknown annotations
stable video/examination split assignment
bucket policy support for train/validation/test

Bucket policy example:

bucket_policy:
  num_buckets: 5
  validation_buckets: [3]
  test_buckets: [4]

Training buckets are all remaining buckets not assigned to validation or test.

The split logic preserves:

same-video grouping
stable bucket assignments
split exclusivity
dataset integrity

Training Flow

Training is managed by:

lx_ai/ai_model_training/trainer_gastronet_multilabel.py

Typical steps:

Load TrainingConfig
Build dataset
Validate labels and sources
Create PyTorch datasets and loaders
Create model and optimizer
Train for configured epochs
Validate and select best checkpoint
Evaluate test split
Save model weights and metadata

Loss and Metrics

Loss

Loss implementation:

lx_ai/ai_model/losses.py

Uses:

focal_loss_with_mask
compute_class_weights

Supports:

multi-label logits
per-label weights
label masks
focal alpha and gamma

Metrics

Metrics implementation:

lx_ai/ai_model_matrics/metrics.py

Supported metrics:

precision
recall
F1
accuracy
TP / FP / TN / FN
per-label metrics
positives-only metrics when negatives are unavailable

Outputs

Trained model artifacts are saved under:

data/model_training/runs/

Output files:

dataset_<dataset_uuid>_<backbone_name>_v<labelset_version>_multilabel.pth
dataset_<dataset_uuid>_<backbone_name>_v<labelset_version>_multilabel_meta.json

Metadata includes:

config
labelset
used labels
training history
final test metrics
bucket policy and sizes

Troubleshooting

Missing labelset

Verify labelset_id and labelset_version_to_train exist in the database.

Empty dataset

Check dataset_ids and ensure annotations exist for those IDs.

Image file not found

For local development against service database paths, remap frame roots:

export FRAME_PATH_REMAP_SOURCE="/var/endoreg-service-user/lx-annotate/data/frames"
export FRAME_PATH_REMAP_TARGET="/home/admin/dev/lx-ai/data/frames_mirror"

Missing GastroNet checkpoint

Verify the path for RN50_GastroNet-1M_DINOv1.pth.

PostgreSQL password errors

Use one of:

export DEV_DB_PASSWORD=your_password

export DEV_DB_PASSWORD_FILE=/path/to/password/file

Service mode uses the DJANGO_DB_* equivalents.

Testing

Run tests with:

pytest -q

and

pytest --cov=lx_ai --cov-report=html

# then open
firefox htmlcov/index.html

Run a single file:

pytest tests/ai_model_config/test_training_config.py -q --no-cov

Run both:

pytest -q

Project Structure

lx-ai/
├── lx_ai/
│   ├── ai_model/
│   ├── ai_model_config/
│   ├── ai_model_dataset/
│   ├── ai_model_matrics/
│   ├── ai_model_split/
│   ├── ai_model_training/
│   ├── data_validation/
│   ├── scripts/
│   ├── utils/
│   └── run_training.py
├── tests/
├── data/
├── pyproject.toml
└── README.md

License

See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

May 6, 2026

0.1.2

Apr 29, 2026

0.1.1

Apr 29, 2026

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lx_ai_module-0.1.0.tar.gz (10.5 MB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lx_ai_module-0.1.0-py3-none-any.whl (88.2 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file lx_ai_module-0.1.0.tar.gz.

File metadata

Download URL: lx_ai_module-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 10.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lx_ai_module-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dcd6d56b2e8163391949e4523d7c2053c081f735024fde67ed00aeaf38342c66`
MD5	`cb2364793845821c86992fe5b2710404`
BLAKE2b-256	`abf95aaafd6c11054e6c9de1b9f0f6a67f9096e1f36c1dbff7d6e936f36259ae`

See more details on using hashes here.

File details

Details for the file lx_ai_module-0.1.0-py3-none-any.whl.

File metadata

Download URL: lx_ai_module-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 88.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lx_ai_module-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e00a855a609c8653484bfe6405c026f6e48e4141e4ae92db31ced12fda7345c`
MD5	`25e0e9af3ff30f4284b545e53f5df4f4`
BLAKE2b-256	`547a96f05e65a2a727705246bd85c6662a004c414950ac04f4ba29802c3bd20c`

See more details on using hashes here.

lx-ai-module 0.1.0

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lx-ai

Core Capabilities

Multi-label Classification

Flexible Data Sources

Robust Dataset Handling

Stable and Reproducible Data Splitting

Model Architecture Flexibility

Training Features

Evaluation and Metrics

Data Validation and Diagnostics

Reproducibility and Traceability

Production and Development Compatibility

Testing and Reliability

Design Principles

No Data Leakage

Reproducibility First

Explicit Label Semantics

Separation of Concerns

Production-Aware Design

Branches and Database Usage

sandbox

prototype

Quick Start

Paths a new developer should configure

Database variables

Local development with SQLite

Production or service with PostgreSQL

Files where these are used

Training configuration

Verify your paths

Main entry point

Recommended development workflow

Configuration

Supported Backbones

Data Sources

PostgreSQL mode

SQLite mode

JSONL mode

Unknown Label Handling

Dataset and Bucket Splitting

Training Flow

Loss and Metrics

Loss

Metrics

Outputs

Troubleshooting

Missing labelset

Empty dataset

Image file not found

Missing GastroNet checkpoint

PostgreSQL password errors

Testing

Project Structure

License

Branches and Database Usage

gs02_sandbox_db

prototype

Quick Start

Main entry point

Recommended development workflow

Configuration

Supported Backbones

Data Sources

PostgreSQL mode

SQLite mode

JSONL mode

Unknown Label Handling

Dataset and Bucket Splitting

Training Flow

Loss and Metrics

`sandbox`

`prototype`

`gs02_sandbox_db`

`prototype`