Skip to main content

Python client library for KWS Platform API - dataset splits, feature npz download, model/artifacts/metrics push

Project description

KWS Library (kwslib)

Thư viện Python dùng API từ backend KWS_Server, dùng trên Google Colab, Jupyter Notebook hoặc script Python.

Single Source of Truth (Operations): Tất cả tài liệu vận hành/quy trình/template của dự án nằm tại Workspace/ (entry: Workspace/README.md). Repo này chỉ giữ tài liệu kỹ thuật cho component.

Ba công việc chính (KWS_Lib ↔ KWS_Server)

KWS_Lib là thư viện hỗ trợ train model trên Google Colab, Jupyter Notebook hoặc script; mọi dữ liệu và metadata đều qua KWS_Server (DB + MinIO).

Công việc Mô tả API / High-level
1. Tạo dataset_split (train, val, test) Lấy danh sách file feature (MFCC/npz) → chia train/val/test (pandas, sklearn) → đẩy lên DB. dataset_splits.get_mfcc_files() → chia → dataset_splits.create_split_from_list(); hoặc DatasetPipeline.get_data()split_and_push().
2. Tải npz (feature extraction từ MinIO) Tải file .npz của split đã tạo (server đọc từ MinIO, stream qua API). dataset_splits.download(split_id, output_path) (ZIP npz); DatasetSplitFilesClient.download_all_npz(split_id, output_dir); list_files(split_id, file_type="npz") + download_file(...).
3. Đẩy thông tin model (config, artifacts, metrics) Đăng ký run, upload artifact (file model lên MinIO + DB), POST metrics. ModelManager.register_run(), push_artifact(), push_metrics(); hoặc experiments.create_run(), artifacts.upload(), metrics.create(payload=...).

Chuẩn metrics khi POST: bắt buộc Accuracy, Precision, Recall, F1-Score, Confusion Matrix (dùng build_metrics_payload hoặc metrics_from_sklearn).

Features

  • API Coverage: Bọc toàn bộ API backend (datasets, dataset_splits, models, experiments, metrics, mlflow)
  • Data splitting: get_mfcc_files, create_split_from_list, script create_dataset_split.py + pandas
  • Chuẩn metrics: build_metrics_payload, metrics_from_sklearn → payload đúng chuẩn để POST /api/v1/metrics
  • MinIO qua API: Tải .wav / .npz qua API (stream), không cần kết nối trực tiếp MinIO
  • Telegram: Optional thông báo khi job xong

Installation (PyPI)

pip install kwslib

Optional: TensorFlow (cho training CNN):

pip install kwslib[tensorflow]

From source:

git clone <repository>
cd KWS_Lib
pip install -e .

Quick Start

Smoke check (no login)

Verify KWS_Server is reachable:

python -m kwslib.smoke

Override base URL:

KWS_SERVER_URL=http://127.0.0.1:8000 python -m kwslib.smoke

Basic Usage

from kwslib import KWSClient

# Initialize client
client = KWSClient(base_url="http://localhost:8000")

# Login
client.login(username="admin", password="password")

# List datasets
datasets = client.datasets.list()
print(f"Found {datasets['total']} datasets")

# Get dataset details
dataset = client.datasets.get(dataset_id=1)
print(f"Dataset: {dataset['name']}")

Download Dataset Split Files for Training

from kwslib import KWSClient, DatasetSplitFilesClient

# Initialize API client
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Initialize files client (uses API, no direct MinIO connection)
files_client = DatasetSplitFilesClient(api)

# List all files in split
files_info = files_client.list_files(split_id=1, file_type="npz")
print(f"Found {files_info['total_files']} files")

# Download all .npz files
files_client.download_all_npz(
    split_id=1,
    output_dir="features"
)

# Download all .wav files
files_client.download_all_wav(
    split_id=1,
    output_dir="audio"
)

# Or download as ZIP
files_client.download_all_files_zip(
    split_id=1,
    file_type="npz",
    output_path="features.zip"
)

# Get presigned URLs (for Google Colab)
urls = files_client.get_file_urls(split_id=1, file_type="npz")
for file_info in urls["files"]:
    print(f"{file_info['file_name']}: {file_info['url']}")

With Telegram Notifications

from kwslib import KWSClient, TelegramNotifier

# Initialize
client = KWSClient(base_url="http://localhost:8000")
client.login(username="admin", password="password")

notifier = TelegramNotifier(
    bot_token="YOUR_BOT_TOKEN",
    chat_id="YOUR_CHAT_ID"
)

# Create experiment run (triggers background job)
run = client.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=1,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-training",
)

# Wait for completion
job_id = run.get("job_id")
status = client.jobs.wait_for_completion(job_id)

# Send notification
if status["status"] == "completed":
    notifier.send(f"Training completed! Results: {status['result']}")
else:
    notifier.send(f"Training failed: {status.get('error')}")

API Modules

Authentication

  • client.auth.login() - Login
  • client.auth.logout() - Logout
  • client.auth.get_me() - Get current user info

Datasets

  • client.datasets.list() - List datasets
  • client.datasets.get() - Get dataset
  • client.datasets.create() - Create dataset
  • client.datasets.update() - Update dataset
  • client.datasets.delete() - Delete dataset
  • client.datasets.list_versions() - List versions
  • client.datasets.create_version() - Create version

Models

  • client.models.list() - List models
  • client.models.get() - Get model
  • client.models.create() - Create model
  • client.models.list_model_inits() - List model architectures

Experiments

  • client.experiments.list() - List experiments
  • client.experiments.create() - Create experiment
  • client.experiments.create_run(experiment_id, model_id, dataset_split_id, config, git_commit) - Tạo run (background job)
  • client.experiments.list_runs(experiment_id) - List runs của một experiment
  • client.experiments.list_runs_global(experiment_id=..., model_id=...) - List tất cả runs (có lọc)
  • client.experiments.get_run(experiment_id, run_id) - Chi tiết run

Dataset Splits

  • client.dataset_splits.list(dataset_version_id=..., config_name=..., name=...) - List splits
  • client.dataset_splits.get(split_id) - Lấy một split theo ID
  • client.dataset_splits.create(dataset_version_id, name, config_name) - Tạo bản ghi split (chỉ metadata)
  • client.dataset_splits.create_split_from_list(...) - Tạo split từ danh sách file (sau khi chia bằng pandas)
  • client.dataset_splits.get_mfcc_files(dataset_version_id, ...) - Lấy danh sách file MFCC (DataFrame) để chia
  • client.dataset_splits.download(split_id, output_path) - Tải split dạng ZIP (npz)
  • client.dataset_splits.generate(split_id) - Trigger job generate split
  • client.dataset_splits.list_files(split_id, file_type) - List file .wav / .npz trong split

Metrics (chuẩn: Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

  • client.metrics.list(model_id=..., dataset_split_id=...) - List metrics
  • client.metrics.get(metric_id) - Chi tiết metric
  • client.metrics.create(payload=payload) - POST metric (dùng payload từ build_metrics_payload hoặc metrics_from_sklearn)
  • client.metrics.compare_metrics(model_ids=[1,2,3], split_id=...) - So sánh metrics giữa các model

Audio

  • client.audio.list_keyword_samples() - List keyword audio
  • client.audio.upload_keyword_sample() - Upload audio
  • client.audio.get_keyword_sample_url() - Get presigned URL

Features

  • client.features.get_keyword_features() - Get features
  • client.features.extract_keyword_features() - Extract features

Jobs

  • client.jobs.get() - Get job status
  • client.jobs.list() - List jobs
  • client.jobs.wait_for_completion() - Wait for job completion

Dataset Split Files Client

  • files_client.list_files() - List all files in split
  • files_client.download_wav() - Download a .wav file
  • files_client.download_npz() - Download and load a .npz file
  • files_client.download_all_wav() - Download all .wav files
  • files_client.download_all_npz() - Download all .npz files
  • files_client.download_all_files_zip() - Download all files as ZIP
  • files_client.get_file_urls() - Get presigned URLs for all files

Telegram Notifier

  • notifier.send() - Send message
  • notifier.send_file() - Send file
  • notifier.send_photo() - Send photo

Examples

Data splitting (pandas + create_split_from_list)

from kwslib import KWSClient
from create_dataset_split import get_split_data, push_splits
from sklearn.model_selection import train_test_split

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# 1. Lấy danh sách file MFCC
df = get_split_data(api=api, dataset_version_id=48, feature_type_id=2)

# 2. Chia train/test (stratify theo label)
train_df, test_df = train_test_split(
    df, train_size=0.8, test_size=0.2, random_state=42, stratify=df["derivative_label"]
)

# 3. Đẩy lên DB (tạo splits + gán file)
created = push_splits(
    api=api,
    dataset_version_id=48,
    config_name="config_80_20",
    splits={"train": train_df, "test": test_df},
)
# created = {"train": 123, "test": 124}

Metrics chuẩn (Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

from kwslib import KWSClient, build_metrics_payload, metrics_from_sklearn
import numpy as np

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Cách 1: Từ y_true, y_pred (sklearn)
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0, 1, 0, 0])
payload = metrics_from_sklearn(
    y_true, y_pred,
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    average="weighted",
)
api.metrics.create(payload=payload)

# Cách 2: Từ dict metrics đã tính
metrics = {
    "accuracy": 0.92,
    "precision": 0.91,
    "recall": 0.90,
    "f1_score": 0.905,
    "confusion_matrix": [[50, 2], [3, 45]],  # 2D list int
}
payload = build_metrics_payload(
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    metrics=metrics,
)
api.metrics.create(payload=payload)

# So sánh nhiều model trên một split
comparison = api.metrics.compare_metrics(model_ids=[1, 2, 3], split_id=1)

Complete training workflow (tóm tắt)

# 1. Tạo split (metadata) hoặc dùng create_split_from_list sau khi chia pandas
split = api.dataset_splits.create(dataset_version_id=1, name="train", config_name="config_70_15_15")
# Hoặc: push_splits(api, dataset_version_id, config_name, splits={"train": train_df, "val": val_df, "test": test_df})

# 2. Generate split (job) nếu cần
# job = api.dataset_splits.generate(split_id)

# 3. Tạo experiment run (background job)
run = api.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=split_id,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-training",
)

# 4. Sau khi train, POST metrics (chuẩn: accuracy, precision, recall, f1_score, confusion_matrix)
# run_id = ID experiment run (lấy từ list_runs sau khi job create_run hoàn thành)
payload = metrics_from_sklearn(y_true, y_pred, model_id=1, dataset_split_id=split_id, experiment_run_id=run_id)
api.metrics.create(payload=payload)

Google Colab Usage

# In Google Colab, use presigned URLs for direct download
from kwslib import KWSClient, DatasetSplitFilesClient

api = KWSClient(base_url="https://your-api.com")
api.login(username="admin", password="password")

files_client = DatasetSplitFilesClient(api)

# Get presigned URLs
urls = files_client.get_file_urls(split_id=1, file_type="npz")

# Download in Colab
import urllib.request
for file_info in urls["files"]:
    urllib.request.urlretrieve(
        file_info["url"],
        f"/content/{file_info['file_name']}"
    )

Configuration

Environment Variables

You can set default values using environment variables:

export KWS_BASE_URL="http://localhost:8000"
export KWS_USERNAME="admin"
export KWS_PASSWORD="password"
export MINIO_ENDPOINT="localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

Đảm bảo đã tăng version trong pyproject.tomlsetup.py trước khi build.

License

MIT License

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwslib-0.0.4.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kwslib-0.0.4-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file kwslib-0.0.4.tar.gz.

File metadata

  • Download URL: kwslib-0.0.4.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kwslib-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d6cf5eff2b9a52f769d7109d38dbf460c07d5ec801d2359868c17541d17866a0
MD5 865924f86587697faa880f910e24c9f3
BLAKE2b-256 604cde21ad90cb7f7bd11d09eeb49a454a4515bcc0fdaa40df054727da11be90

See more details on using hashes here.

File details

Details for the file kwslib-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: kwslib-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 66.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kwslib-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2a4c5d08634b713e0b8d7c58071da2f395d9d6b49d6c13c13d015f6a3e9a39c4
MD5 466a4f8bf008f04fc082fce5f60aa637
BLAKE2b-256 eb6fc81722dce2f8b07235dee22e262d878b33dc980936898251f240cdd3f928

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page