This package aims to offer helper functions that simplify model building and evaluation
Project description
aiqclib
aiqclib is a Python library that provides a configuration-driven workflow for machine learning, simplifying dataset preparation, model training, and data classification. It is a core component of the AIQC project that aims to enhance anomaly detection in CTD (Conductivity, Temperature, Depth) data.
ML Algorithms Supported by aiqclib
| Category | Algorithm | Short Name | Method |
|---|---|---|---|
| Tree-Based & Ensemble | XGBoost | XGB | Ensemble (Boosting) |
| Random Forest | RF | Ensemble (Bagging) | |
| Decision Tree | DT | Tree | |
| Linear & Geometric | Logistic Regression | Logit | Linear |
| Linear Discriminant Analysis | LDA | Linear / Statistical | |
| Support Vector Machine | SVM | Geometric | |
| Instance-Based (Distance-Based) | K-Nearest Neighbors | KNN | Distance-based |
| Probabilistic | Gaussian Naive Bayes | GNB | Probabilistic |
| Neural Network | Multilayer Perceptron | MLP | Neural Network |
Installation
The package is available on PyPI and conda-forge.
Using pip:
pip install aiqclib
Using conda:
conda install -c conda-forge aiqclib
Documentation
Project documentation is hosted on Read the Docs.
Core Concepts
The library is designed around a three-stage workflow:
- Dataset Preparation: Prepare feature datasets from raw data and generate training, validation, and test data sets.
- Training & Evaluation: Train machine learning models and evaluate their performance using cross-validation.
- Classification: Apply a trained model to classify new, unseen data.
Each stage is controlled by a YAML configuration file, allowing you to define and reproduce your entire workflow with ease.
Usage
The general workflow for any task in aiqclib follows these steps:
- Generate a Configuration Template: Create a starter YAML file for the task (e.g.,
prepare,train,classify). - Customize the Configuration: Edit the YAML file to specify paths, dataset names, and other parameters.
- Run the Task: Load the configuration and execute the main function for the task.
1. Dataset Preparation
This workflow processes your input data and creates training, validation, and test sets.
Step 1: Generate a configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/prepare_config.yaml", stage="prepare")
Step 2: Customize prepare_config.yaml.
You must edit the file to set the correct input/output paths and define your dataset. See the Configuration section for details.
Step 3: Run the preparation process.
import aiqclib as aq
config = aq.read_config("/path/to/prepare_config.yaml")
aq.create_training_dataset(config)
This generates the following output folders:
- summary: Statistics of input data used for normalization.
- select: Profiles with bad observation flags (positive samples) and good profiles (negative samples).
- locate: Observation records for both positive and negative profiles.
- extract: Features extracted from the observation records.
- training: The final training, validation, and test datasets.
2. Model Training and Evaluation
This workflow uses the prepared dataset to train a model and evaluate its performance.
Step 1: Generate a training configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/training_config.yaml", stage="train")
Step 2: Customize training_config.yaml.
Edit the file to point to your prepared dataset and define training parameters.
Step 3: Train and evaluate the model.
import aiqclib as aq
config = aq.read_config("/path/to/training_config.yaml")
aq.train_and_evaluate(config)
This generates the following output folders:
- validate: Results from the cross-validation process.
- build: The final trained models and their evaluation results on the test dataset.
3. Data Classification
This workflow applies a trained model to classify all observations in a dataset.
Step 1: Generate a classification configuration template.
import aiqclib as aq
aq.write_config_template(file_name="/path/to/classification_config.yaml", stage="classify")
Step 2: Customize classification_config.yaml.
Edit the file to point to the input data and the trained model.
Step 3: Run classification.
import aiqclib as aq
config = aq.read_config("/path/to/classification_config.yaml")
aq.classify_dataset(config)
This workflow processes a dataset using a trained model and generates:
- classify: The final classification results and a summary report.
Configuration
Configuration is managed via YAML files. The write_config_template function provides a starting point that you must customize for each module.
1. Dataset Preparation (stage="prepare")
The preparation config requires you to modify two key sections:
-
path_info_sets: Defines the location of input and output data.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: base_path: /path/to/input # EDIT: Directory with input files step_folder_name: "" split: step_folder_name: training
-
data_sets: Defines a specific dataset to be processed.data_sets: - name: dataset_0001 # EDIT: Your data set name dataset_folder_name: dataset_0001 # EDIT: Your output folder input_file_name: nrt_cora_bo_4.parquet # EDIT: Your input filename
2. Training and Evaluation (stage="train")
The training config links the prepared data to the model training process.
-
path_info_sets: Defines where to find the prepared dataset and where to save model artifacts.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: step_folder_name: training
-
training_sets: Links to a dataset prepared in the previous workflow.training_sets: - name: training_0001 # EDIT: Your training name dataset_folder_name: dataset_0001 # EDIT: Your output folder
3. Classification (stage="classify")
The classification config uses a trained model to classify new data.
-
path_info_sets: Defines paths for raw data, models, and classification results.path_info_sets: - name: data_set_1 common: base_path: /path/to/data # EDIT: Root output directory input: base_path: /path/to/input # EDIT: Directory with input files step_folder_name: "" model: base_path: /path/to/model # EDIT: Directory with model files step_folder_name: model concat: step_folder_name: classification # EDIT: Directory with classification results
-
classification_sets: Defines a specific dataset to be classified.classification_sets: - name: classification_0001 # EDIT: Your classification name dataset_folder_name: dataset_0001 # EDIT: Your output folder input_file_name: nrt_cora_bo_4.parquet # EDIT: Your input filename
Contributing & Development
We welcome contributions! Please use the following guidelines for development.
Environment Setup
We recommend using uv for managing the development environment.
-
Install
uvinto your base conda/mamba environment. This makes theuvcommand available globally without cluttering yourbaseenvironment.# Using mamba (recommended) mamba activate base mamba install -n base -c conda-forge uv # Or using conda conda activate base conda install -n base -c conda-forge uv
-
Create and activate the project's virtual environment. From the project's root directory, run the following:
# Create the virtual environment in a .venv folder uv venv # Activate the virtual environment source .venv/bin/activate
-
Install the project and its dependencies. This command installs the library in "editable" mode (
-e) and pulls in all dependencies frompyproject.toml.uv sync uv pip install -e .
Running Tests
With your environment activated, you can run the test suite using pytest.
uv run pytest -v
Code Style (Linting & Formatting)
We use Ruff for linting and formatting.
Linting: Check the library and test code for style issues.
# Lint the library source code
uv ruff check src
# Lint the test code
ruff check tests
Formatting: Automatically format the code to match the project's style.
# Format the library source code
ruff format src
# Format the test code
ruff format tests
Documentation (for Maintainers)
Building Docs Locally
-
Update Docstrings (Requires Google Gemini API Key):
# Update docstrings for source files python ./docs/scripts/update_docstrings.py src docs/scripts/prompt_main.txt # Update docstrings for test files python ./docs/scripts/update_docstrings.py tests docs/scripts/prompt_unittest.txt
-
Review Docstrings: Manually review all modified files. Remove generated headers/footers and correct any sections marked with "Issues:".
-
Update API Documents: From the project root, run:
uv run sphinx-apidoc -f --remove-old --module-first -o docs/source/api src/aiqclib
-
Build HTML: From the project root, run:
cd docs; uv run make html; cd ..
You can view the generated site by opening
docs/build/html/index.htmlin a browser.
Deployment (for Maintainers)
PyPI
The package is published to PyPI automatically via a GitHub Action whenever a new release is created on GitHub.
conda-forge (Automatic)
The conda-forge bot automatically creates a pull request and merges it into the main branch when a new version of the package is published on PyPI.
conda-forge (Manual)
Bump version with new dependencies
When runtime dependencies change, the automated PR from the conda-forge bot may fail. In that case, you must manually update the feedstock by creating a pull request to the conda-forge/aiqclib-feedstock repository in this case.
- Install build tools:
mamba install -c conda-forge conda-build conda-smithy grayskull
- Fork and clone the
aiqclib-feedstockrepository. - Sync with upstream (e.g., add
conda-forge/aiqclib-feedstockas a remote namedupstreamandgit rebase upstream/main). - Update the forked repo:
git checkout main # Go to your local main branch git fetch upstream # Get latest changes from original repo git rebase upstream/main # Make your local main perfectly linear with original git push origin main --force # Update your GitHub fork's main (optional but good practice)
- Create a new branch (e.g.,
git checkout -b update_vX.Y.Z). - Generate a strict recipe (e.g.,
grayskull pypi aiqclib --strict-conda-forge). - Review
recipes/meta.yamland ensure it meetsconda-forgestandards. - Rerender the feedstock (e.g.,
conda smithy rerender -c auto). - Commit, push, and open a pull request to the
staged-recipesrepository. - Merge it after passing CI.
Initial upload
Submitting the package on conda-forge involves creating a pull request to the conda-forge/staged-recipes repository.
- Fork and clone the
staged-recipesrepository. - Configure upstream the
git remote add upstream https://github.com/conda-forge/aiqclib-feedstock.git - Create a new branch (e.g.,
git checkout -b aiqclib-recipe). - Generate a strict recipe:
grayskull pypi aiqclib --strict-conda-forge. - Review
recipes/aiqclib/meta.yamland ensure it meetsconda-forgestandards. - Commit, push, and open a pull request to the
staged-recipesrepository.
Anaconda.org (Manual)
Publishing to the <username> channel on Anaconda.org is a manual process.
- Install build tools:
mamba install -c conda-forge conda-build anaconda-client grayskull
conda-smithy
2. Generate Recipe:
From the project root, run grayskull pypi aiqclib. This creates aiqclib/meta.yaml.
-
Build Package:
conda build aiqclib -
Upload Package:
anaconda login anaconda upload /path/to/your/conda-bld/noarch/aiqclib-*.conda
-
Cleanup: Copy
aiqclib/meta.yamltoconda/meta.yamlfor version control and remove the temporaryaiqclibdirectory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aiqclib-0.1.0.tar.gz.
File metadata
- Download URL: aiqclib-0.1.0.tar.gz
- Upload date:
- Size: 11.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5926192464e97d229b6e74d41f59d5c05044fc1d52297a6b0d9d343b00a87c15
|
|
| MD5 |
3c82926a41949cd9827fd7be110575d0
|
|
| BLAKE2b-256 |
58edefbb1a3b16f08ffba026d8e3676b7ddfa1aadcdd8f1bea2dbb87c726b73e
|
File details
Details for the file aiqclib-0.1.0-py3-none-any.whl.
File metadata
- Download URL: aiqclib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 152.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d2a6b78decb4372c97f3d0a85926a44b42838cea853b5e01aaf51a14dbc529c
|
|
| MD5 |
635381943dfe8fdd7fdc7cef08f252e7
|
|
| BLAKE2b-256 |
5ee099f70dde1dacb1bca5ee86af807742914f908951781c173848cf692ff199
|