Skip to main content

Python client library for KWS Platform API - dataset splits, feature npz download, model/artifacts/metrics push

Project description

KWS Library (kwslib)

Thư viện Python dùng API từ backend KWS_Server, dùng trên Google Colab, Jupyter Notebook hoặc script Python.

Single Source of Truth (Operations): Tất cả tài liệu vận hành/quy trình/template của dự án nằm tại Workspace/ (entry: Workspace/README.md). Repo này chỉ giữ tài liệu kỹ thuật cho component.

Ba công việc chính (KWS_Lib ↔ KWS_Server)

KWS_Lib là thư viện client để thao tác dữ liệu và metadata qua KWS_Server (DB + MinIO), dùng tốt trên Google Colab, Jupyter Notebook hoặc script.

Công việc Mô tả API / High-level
1. Tạo dataset_split (train, val, test) Lấy danh sách file feature (MFCC/npz) → chia train/val/test (pandas, sklearn) → đẩy lên DB. dataset_splits.get_mfcc_files() → chia → dataset_splits.create_split_from_list(); hoặc DatasetPipeline.get_data()split_and_push().
2. Tải npz (feature extraction từ MinIO) Tải file .npz của split đã tạo (server đọc từ MinIO, stream qua API). dataset_splits.download(split_id, output_path) (ZIP npz); DatasetSplitFilesClient.download_all_npz(split_id, output_dir); list_files(split_id, file_type="npz") + download_file(...).
3. Đẩy thông tin model (config, artifacts, metrics) Đăng ký run, upload artifact (file model lên MinIO + DB), POST metrics. ModelManager.register_run(), push_artifact(), push_metrics(); hoặc experiments.create_run(), artifacts.upload(), metrics.create(payload=...).

Chuẩn metrics khi POST: bắt buộc Accuracy, Precision, Recall, F1-Score, Confusion Matrix (dùng build_metrics_payload hoặc metrics_from_sklearn).

Features

  • API Coverage: Bọc toàn bộ API backend (datasets, dataset_splits, models, experiments, metrics, mlflow)
  • Data splitting: get_mfcc_files, create_split_from_list, script create_dataset_split.py + pandas
  • Chuẩn metrics: build_metrics_payload, metrics_from_sklearn → payload đúng chuẩn để POST /api/v1/metrics
  • MinIO qua API: Tải .wav / .npz qua API (stream), không cần kết nối trực tiếp MinIO
  • Telegram: Optional thông báo khi job xong

Installation (PyPI)

pip install kwslib

Yêu cầu môi trường:

  • Python >=3.10
  • Tương thích tốt với Google Colab (NumPy 2.x / SciPy 1.13+ / scikit-learn 1.6+)

From source:

git clone <repository>
cd KWS_Lib
pip install -e .

Quick Start

Smoke check (no login)

Verify KWS_Server is reachable:

python -m kwslib.smoke

Override base URL:

KWS_SERVER_URL=http://127.0.0.1:8000 python -m kwslib.smoke

Basic Usage

from kwslib import KWSClient

# Initialize client
client = KWSClient(base_url="http://localhost:8000")

# Login
client.login(username="admin", password="password")

# List datasets
datasets = client.datasets.list()
print(f"Found {datasets['total']} datasets")

# Get dataset details
dataset = client.datasets.get(dataset_id=1)
print(f"Dataset: {dataset['name']}")

Download Dataset Split Files for Training

from kwslib import KWSClient, DatasetSplitFilesClient

# Initialize API client
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Initialize files client (uses API, no direct MinIO connection)
files_client = DatasetSplitFilesClient(api)

# List all files in split
files_info = files_client.list_files(split_id=1, file_type="npz")
print(f"Found {files_info['total_files']} files")

# Download all .npz files
files_client.download_all_npz(
    split_id=1,
    output_dir="features"
)

# Download all .wav files
files_client.download_all_wav(
    split_id=1,
    output_dir="audio"
)

# Or download as ZIP
files_client.download_all_files_zip(
    split_id=1,
    file_type="npz",
    output_path="features.zip"
)

# Get file metadata (for Google Colab loop/download strategy)
urls = files_client.get_file_urls(split_id=1, file_type="npz")
for file_info in urls["files"]:
    print(file_info["file_name"])

With Telegram Notifications

from kwslib import KWSClient, TelegramNotifier

# Initialize
client = KWSClient(base_url="http://localhost:8000")
client.login(username="admin", password="password")

notifier = TelegramNotifier(
    bot_token="YOUR_BOT_TOKEN",
    chat_id="YOUR_CHAT_ID"
)

# Create experiment run (triggers background job)
run = client.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=1,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# Wait for completion
job_id = run.get("job_id")
status = client.jobs.wait_for_completion(job_id)

# Send notification
if status["status"] == "completed":
    notifier.send(f"Training completed! Results: {status['result']}")
else:
    notifier.send(f"Training failed: {status.get('error')}")

API Modules

Authentication

  • client.auth.login() - Login
  • client.auth.logout() - Logout
  • client.auth.get_me() - Get current user info

Datasets

  • client.datasets.list() - List datasets
  • client.datasets.get() - Get dataset
  • client.datasets.create() - Create dataset
  • client.datasets.update() - Update dataset
  • client.datasets.delete() - Delete dataset
  • client.datasets.list_versions() - List versions
  • client.datasets.create_version() - Create version

Models

  • client.models.list() - List models
  • client.models.get() - Get model
  • client.models.create() - Create model
  • client.models.list_model_inits() - List model architectures

Experiments

  • client.experiments.list() - List experiments
  • client.experiments.create() - Create experiment
  • client.experiments.create_run(experiment_id, model_id, dataset_split_id, config, git_commit) - Tạo run (background job)
  • client.experiments.list_runs(experiment_id) - List runs của một experiment
  • client.experiments.list_runs_global(experiment_id=..., model_id=...) - List tất cả runs (có lọc)
  • client.experiments.get_run(experiment_id, run_id) - Chi tiết run

Dataset Splits

  • client.dataset_splits.list(dataset_version_id=..., config_name=..., name=...) - List splits
  • client.dataset_splits.get(split_id) - Lấy một split theo ID
  • client.dataset_splits.create(dataset_version_id, name, config_name) - Tạo bản ghi split (chỉ metadata)
  • client.dataset_splits.create_split_from_list(...) - Tạo split từ danh sách file (sau khi chia bằng pandas)
  • client.dataset_splits.get_mfcc_files(dataset_version_id, ...) - Lấy danh sách file MFCC (DataFrame) để chia
  • client.dataset_splits.download(split_id, output_path) - Tải split dạng ZIP (npz)
  • client.dataset_splits.generate(split_id) - Trigger job generate split
  • client.dataset_splits.list_files(split_id, file_type) - List file .wav / .npz trong split

Metrics (chuẩn: Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

  • client.metrics.list(model_id=..., dataset_split_id=...) - List metrics
  • client.metrics.get(metric_id) - Chi tiết metric
  • client.metrics.create(payload=payload) - POST metric (dùng payload từ build_metrics_payload hoặc metrics_from_sklearn)
  • client.metrics.compare_metrics(model_ids=[1,2,3], split_id=...) - So sánh metrics giữa các model

Audio

  • client.audio.list_keyword_samples() - List keyword audio
  • client.audio.upload_keyword_sample() - Upload audio
  • client.audio.get_keyword_sample_url() - Get download URL from API

Features

  • client.features.get_keyword_features() - Get features
  • client.features.extract_keyword_features() - Extract features

Jobs

  • client.jobs.get() - Get job status
  • client.jobs.list() - List jobs
  • client.jobs.wait_for_completion() - Wait for job completion

Dataset Split Files Client

  • files_client.list_files() - List all files in split
  • files_client.download_wav() - Download a .wav file
  • files_client.download_npz() - Download and load a .npz file
  • files_client.download_all_wav() - Download all .wav files
  • files_client.download_all_npz() - Download all .npz files
  • files_client.download_all_files_zip() - Download all files as ZIP
  • files_client.get_file_urls() - Get file metadata for all files

Telegram Notifier

  • notifier.send() - Send message
  • notifier.send_file() - Send file
  • notifier.send_photo() - Send photo

Examples

Data splitting (pandas + create_split_from_list)

from kwslib import KWSClient
from create_dataset_split import get_split_data, push_splits
from sklearn.model_selection import train_test_split

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# 1. Lấy danh sách file MFCC
df = get_split_data(api=api, dataset_version_id=48, feature_type_id=2)

# 2. Chia train/test (stratify theo label)
train_df, test_df = train_test_split(
    df, train_size=0.8, test_size=0.2, random_state=42, stratify=df["derivative_label"]
)

# 3. Đẩy lên DB (tạo splits + gán file)
created = push_splits(
    api=api,
    dataset_version_id=48,
    config_name="config_80_20",
    splits={"train": train_df, "test": test_df},
)
# created = {"train": 123, "test": 124}

Metrics chuẩn (Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

from kwslib import KWSClient, build_metrics_payload, metrics_from_sklearn
import numpy as np

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Cách 1: Từ y_true, y_pred (sklearn)
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0, 1, 0, 0])
payload = metrics_from_sklearn(
    y_true, y_pred,
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    average="weighted",
)
api.metrics.create(payload=payload)

# Cách 2: Từ dict metrics đã tính
metrics = {
    "accuracy": 0.92,
    "precision": 0.91,
    "recall": 0.90,
    "f1_score": 0.905,
    "confusion_matrix": [[50, 2], [3, 45]],  # 2D list int
}
payload = build_metrics_payload(
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    metrics=metrics,
)
api.metrics.create(payload=payload)

# So sánh nhiều model trên một split
comparison = api.metrics.compare_metrics(model_ids=[1, 2, 3], split_id=1)

Complete split/download/upload workflow (tóm tắt)

# 1. Tạo split (metadata) hoặc dùng create_split_from_list sau khi chia pandas
split = api.dataset_splits.create(dataset_version_id=1, name="train", config_name="config_70_15_15")
# Hoặc: push_splits(api, dataset_version_id, config_name, splits={"train": train_df, "val": val_df, "test": test_df})

# 2. Generate split (job) nếu cần
# job = api.dataset_splits.generate(split_id)

# 3. Tạo experiment run (background job)
run = api.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=split_id,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# 4. Sau khi train, POST metrics (chuẩn: accuracy, precision, recall, f1_score, confusion_matrix)
# run_id = ID experiment run (lấy từ list_runs sau khi job create_run hoàn thành)
payload = metrics_from_sklearn(y_true, y_pred, model_id=1, dataset_split_id=split_id, experiment_run_id=run_id)
api.metrics.create(payload=payload)

Google Colab Usage

# In Google Colab, iterate file metadata then call API download endpoints
from kwslib import KWSClient, DatasetSplitFilesClient

api = KWSClient(base_url="https://your-api.com")
api.login(username="admin", password="password")

files_client = DatasetSplitFilesClient(api)

# Get file metadata
urls = files_client.get_file_urls(split_id=1, file_type="npz")

# Download in Colab
import urllib.request
for file_info in urls["files"]:
    urllib.request.urlretrieve(
        file_info["url"],
        f"/content/{file_info['file_name']}"
    )

Configuration

Environment Variables

You can set default values using environment variables:

export KWS_BASE_URL="http://localhost:8000"
export KWS_USERNAME="admin"
export KWS_PASSWORD="password"
export MINIO_ENDPOINT="localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

Đảm bảo đã tăng version trong pyproject.toml trước khi build.

License

MIT License

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwslib-0.0.8.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kwslib-0.0.8-py3-none-any.whl (61.3 kB view details)

Uploaded Python 3

File details

Details for the file kwslib-0.0.8.tar.gz.

File metadata

  • Download URL: kwslib-0.0.8.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kwslib-0.0.8.tar.gz
Algorithm Hash digest
SHA256 2e4d08e613898c49bcf6a15ab5acd683d1fc6c9120dfeefdfe9169517aa38f4e
MD5 5b5dfea7d3c429e1e4d34d4a86e64a61
BLAKE2b-256 2b5d887a2a7001f916a951dc030e100d42587156d94163970e33651ce776427f

See more details on using hashes here.

File details

Details for the file kwslib-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: kwslib-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 61.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kwslib-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 725dcd6019aebe01e1779ff103119138ac0dc09708f1aee69fe9d48653de5e9a
MD5 79a6de6ec2c1771e3a9048d8041dc10a
BLAKE2b-256 7fa131a97e6d23da7c9f16306ad84e96c0cad285499de9880811ac84dc3c89e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page