This package aims to offer helper functions that simplify model building and evaluation
Project description
aiqclib
aiqclib is a Python library that provides a configuration-driven workflow for machine learning, simplifying dataset preparation, model training, and data classification. It is a core component of the AIQC project that aims to enhance anomaly detection in CTD (Conductivity, Temperature, Depth) data.
ML Algorithms Supported by aiqclib
| Category | Algorithm | Short Name | Method |
|---|---|---|---|
| Tree-Based & Ensemble | XGBoost | XGB | Ensemble (Boosting) |
| Random Forest | RF | Ensemble (Bagging) | |
| Decision Tree | DT | Tree | |
| Linear & Geometric | Logistic Regression | Logit | Linear |
| Linear Discriminant Analysis | LDA | Linear / Statistical | |
| Support Vector Machine | SVM | Geometric | |
| Instance-Based (Distance-Based) | K-Nearest Neighbors | KNN | Distance-based |
| Probabilistic | Gaussian Naive Bayes | GNB | Probabilistic |
| Neural Network | Multilayer Perceptron | MLP | Neural Network |
Installation
The package is available on PyPI and conda-forge.
Using pip:
pip install aiqclib
Using conda:
conda install -c conda-forge aiqclib
Documentation
Project documentation is hosted on Read the Docs.
Core Concepts
The library is designed around a three-stage workflow:
- Dataset Preparation: Prepare feature datasets from raw data and generate training, validation, and test data sets.
- Training & Evaluation: Train machine learning models and evaluate their performance using cross-validation.
- Classification: Apply a trained model to classify new, unseen data.
Each stage is controlled by a YAML configuration file, allowing you to define and reproduce your entire workflow with ease.
Usage
The general workflow for any task in aiqclib follows these steps:
- Generate a Configuration Template: Create a starter YAML file for the task (e.g.,
prepare,train,classify). - Customize the Configuration: Edit the YAML file to specify paths, dataset names, and other parameters.
- Run the Task: Load the configuration and execute the main function for the task.
1. Dataset Preparation
This workflow processes your input data and creates training, validation, and test sets.
Step 1: Generate a configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/prepare_config.yaml", stage="prepare")
Step 2: Customize prepare_config.yaml.
You must edit the file to set the correct input/output paths and define your dataset. See the Configuration section for details.
Step 3: Run the preparation process.
import aiqclib as aq
config = aq.read_config("/path/to/prepare_config.yaml")
aq.create_training_dataset(config)
This generates the following output folders:
- summary: Statistics of input data used for normalization.
- select: Profiles with bad observation flags (positive samples) and good profiles (negative samples).
- locate: Observation records for both positive and negative profiles.
- extract: Features extracted from the observation records.
- training: The final training, validation, and test datasets.
2. Model Training and Evaluation
This workflow uses the prepared dataset to train a model and evaluate its performance.
Step 1: Generate a training configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/training_config.yaml", stage="train")
Step 2: Customize training_config.yaml.
Edit the file to point to your prepared dataset and define training parameters.
Step 3: Train and evaluate the model.
import aiqclib as aq
config = aq.read_config("/path/to/training_config.yaml")
aq.train_and_evaluate(config)
This generates the following output folders:
- validate: Results from the cross-validation process.
- build: The final trained models and their evaluation results on the test dataset.
3. Data Classification
This workflow applies a trained model to classify all observations in a dataset.
Step 1: Generate a classification configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/classification_config.yaml", stage="classify")
Step 2: Customize classification_config.yaml.
Edit the file to point to the input data and the trained model.
Step 3: Run classification.
import aiqclib as aq
config = aq.read_config("/path/to/classification_config.yaml")
aq.classify_dataset(config)
This workflow processes a dataset using a trained model and generates:
- classify: The final classification results and a summary report.
Configuration
Configuration is managed via YAML files. The write_config_template function provides a starting point that you must customize for each module.
1. Dataset Preparation (stage="prepare")
The preparation config requires you to modify two key sections:
-
path_info_sets: Defines the location of input and output data.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: base_path: /path/to/input # EDIT: Directory with input files step_folder_name: "" split: step_folder_name: training
-
data_sets: Defines a specific dataset to be processed.data_sets: - name: dataset_0001 # EDIT: Your data set name dataset_folder_name: dataset_0001 # EDIT: Your output folder input_file_name: nrt_cora_bo_4.parquet # EDIT: Your input filename
2. Training and Evaluation (stage="train")
The training config links the prepared data to the model training process.
-
path_info_sets: Defines where to find the prepared dataset and where to save model artifacts.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: step_folder_name: training
-
training_sets: Links to a dataset prepared in the previous workflow.training_sets: - name: training_0001 # EDIT: Your training name dataset_folder_name: dataset_0001 # EDIT: Your output folder
3. Classification (stage="classify")
The classification config uses a trained model to classify new data.
-
path_info_sets: Defines paths for raw data, models, and classification results.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: base_path: /path/to/input # EDIT: Directory with input files step_folder_name: "" model: base_path: /path/to/model # EDIT: Directory with model files step_folder_name: model concat: step_folder_name: classification # EDIT: Directory with classification results
-
classification_sets: Defines a specific dataset to be classified.classification_sets: - name: classification_0001 # EDIT: Your classification name dataset_folder_name: dataset_0001 # EDIT: Your output folder input_file_name: nrt_cora_bo_4.parquet # EDIT: Your input filename
Contributing & Development
We welcome contributions! Please use the following guidelines for development.
Environment Setup
We recommend using uv for managing the development environment.
- Install
uv. We recommend installinguvinto your base conda/mamba environment so theuvcommand is available globally without clutteringbase. If you don't use conda/mamba, you can install it with pip instead.
# Using mamba (recommended)
mamba activate base
mamba install -n base -c conda-forge uv
# Or using conda
conda activate base
conda install -n base -c conda-forge uv
# Or using pip
pip install uv
Alternatively, the [standalone installer](https://docs.astral.sh/uv/getting-started/installation/) from Astral works on any platform without needing Python or conda preinstalled.
- Create and activate the project's virtual environment. From the project's root directory, run the following:
# Create the virtual environment in a .venv folder
uv venv
# Activate the virtual environment
source .venv/bin/activate
- Install the project and its dependencies.
This command installs the library in "editable" mode (
-e) and pulls in all dependencies frompyproject.toml.
uv sync
uv pip install -e .
- Download the test data. The test fixtures (~15 MB of parquet, joblib, and YAML files) are not stored in the repository. They live as a GitHub release asset and need to be downloaded once before tests can run:
bash scripts/fetch_test_data.sh
This places the fixtures under `tests/data/`. The script requires the [`gh` CLI](https://cli.github.com) (authenticated via `gh auth login`) and `unzip`. To pin a specific data version or pull from a fork, override the defaults via environment variables:
TEST_DATA_VERSION=test-data-v1.0.1 bash scripts/fetch_test_data.sh
You only need to re-run this when the test data version changes.
Running Tests
With your environment activated and test data downloaded, you can run the test suite using pytest.
uv run pytest -v
Code Style (Linting & Formatting)
We use Ruff for linting and formatting.
Linting: Check the library and test code for style issues.
# Lint the library source code
uv run ruff check src
# Lint the test code
uv run ruff check tests
Formatting: Automatically format the code to match the project's style.
# Format the library source code
uv run ruff format src
# Format the test code
uv run ruff format tests
Documentation (for Maintainers)
Building Docs Locally
-
Update Docstrings (Requires Google Gemini API Key):
# Update docstrings for source files python ./docs/scripts/update_docstrings.py src docs/scripts/prompt_main.txt # Update docstrings for test files python ./docs/scripts/update_docstrings.py tests docs/scripts/prompt_unittest.txt
-
Review Docstrings: Manually review all modified files. Remove generated headers/footers and correct any sections marked with "Issues:".
-
Update API Documents: From the project root, run:
uv run sphinx-apidoc -f --remove-old --module-first -o docs/source/api src/aiqclib
-
Build HTML: From the project root, run:
cd docs; uv run make html; cd ..
You can view the generated site by opening
docs/build/html/index.htmlin a browser.
Deployment (for Maintainers)
PyPI
The package is published to PyPI automatically via a GitHub Action whenever a new release is created on GitHub.
conda-forge (Automatic)
The conda-forge bot automatically creates a pull request and merges it into the main branch when a new version of the package is published on PyPI.
conda-forge (Manual)
Bump version with new dependencies
When runtime dependencies change, the automated PR from the conda-forge bot may fail. In that case, you must manually update the feedstock by creating a pull request to the conda-forge/aiqclib-feedstock repository in this case.
- Install build tools:
mamba install -c conda-forge conda-build conda-smithy grayskull
- Fork and clone the
aiqclib-feedstockrepository. - Sync with upstream (e.g., add
conda-forge/aiqclib-feedstockas a remote namedupstreamandgit rebase upstream/main). - Update the forked repo:
git checkout main # Go to your local main branch git fetch upstream # Get latest changes from original repo git rebase upstream/main # Make your local main perfectly linear with original git push origin main --force # Update your GitHub fork's main (optional but good practice)
- Create a new branch (e.g.,
git checkout -b update_vX.Y.Z). - Generate a strict recipe (e.g.,
grayskull pypi aiqclib --strict-conda-forge). - Review
recipes/meta.yamland ensure it meetsconda-forgestandards. - Rerender the feedstock (e.g.,
conda smithy rerender -c auto). - Commit, push, and open a pull request to the
staged-recipesrepository. - Merge it after passing CI.
Initial upload
Submitting the package on conda-forge involves creating a pull request to the conda-forge/staged-recipes repository.
- Fork and clone the
staged-recipesrepository. - Configure upstream the
git remote add upstream https://github.com/conda-forge/aiqclib-feedstock.git - Create a new branch (e.g.,
git checkout -b aiqclib-recipe). - Generate a strict recipe:
grayskull pypi aiqclib --strict-conda-forge. - Review
recipes/aiqclib/meta.yamland ensure it meetsconda-forgestandards. - Commit, push, and open a pull request to the
staged-recipesrepository.
Anaconda.org (Manual)
Publishing to the <username> channel on Anaconda.org is a manual process.
-
Install build tools:
mamba install -c conda-forge conda-build anaconda-client grayskull
-
Generate Recipe: From the project root, run
grayskull pypi aiqclib. This createsaiqclib/meta.yaml. -
Build Package:
conda build aiqclib -
Upload Package:
anaconda login anaconda upload /path/to/your/conda-bld/noarch/aiqclib-*.conda
-
Cleanup: Copy
aiqclib/meta.yamltoconda/meta.yamlfor version control and remove the temporaryaiqclibdirectory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aiqclib-0.2.0.tar.gz.
File metadata
- Download URL: aiqclib-0.2.0.tar.gz
- Upload date:
- Size: 287.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
240b7589bdc09aaf5fc6016f31d7b57329c5cba3e8fac3df6f1475cf25674581
|
|
| MD5 |
856b7a260166c7787ccdd6cccdc58c11
|
|
| BLAKE2b-256 |
010e231ce915949326c4d333285f6ab85230821f2561a649adc26862460a3adc
|
Provenance
The following attestation bundles were made for aiqclib-0.2.0.tar.gz:
Publisher:
publish_to_pypi.yml on AIQC-Hub/aiqclib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aiqclib-0.2.0.tar.gz -
Subject digest:
240b7589bdc09aaf5fc6016f31d7b57329c5cba3e8fac3df6f1475cf25674581 - Sigstore transparency entry: 1510026270
- Sigstore integration time:
-
Permalink:
AIQC-Hub/aiqclib@1cd3e4d0d49f162b7e5a1ac683a8bef80d6ed200 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/AIQC-Hub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_to_pypi.yml@1cd3e4d0d49f162b7e5a1ac683a8bef80d6ed200 -
Trigger Event:
release
-
Statement type:
File details
Details for the file aiqclib-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aiqclib-0.2.0-py3-none-any.whl
- Upload date:
- Size: 152.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
687fa7b388648a2a2c996eb5d4ea6075cc03dba40d4f98d999ba32394bcc972d
|
|
| MD5 |
80f956192835a6decc98519677197e7c
|
|
| BLAKE2b-256 |
aef5e046c9d6a25d17a00ccc647b81875b4e15aa789cb7faaa5833640c9127f3
|
Provenance
The following attestation bundles were made for aiqclib-0.2.0-py3-none-any.whl:
Publisher:
publish_to_pypi.yml on AIQC-Hub/aiqclib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aiqclib-0.2.0-py3-none-any.whl -
Subject digest:
687fa7b388648a2a2c996eb5d4ea6075cc03dba40d4f98d999ba32394bcc972d - Sigstore transparency entry: 1510026327
- Sigstore integration time:
-
Permalink:
AIQC-Hub/aiqclib@1cd3e4d0d49f162b7e5a1ac683a8bef80d6ed200 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/AIQC-Hub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_to_pypi.yml@1cd3e4d0d49f162b7e5a1ac683a8bef80d6ed200 -
Trigger Event:
release
-
Statement type: