Skip to main content

Python client library for KWS Platform API - dataset splits, feature npz download, model/artifacts/metrics push

Project description

KWS Library (kwslib)

Thư viện Python dùng API từ backend KWS_Server, dùng trên Google Colab, Jupyter Notebook hoặc script Python.

Single Source of Truth (Operations): Tất cả tài liệu vận hành/quy trình/template của dự án nằm tại Workspace/ (entry: Workspace/README.md). Repo này chỉ giữ tài liệu kỹ thuật cho component.

Ba công việc chính (KWS_Lib ↔ KWS_Server)

KWS_Lib là thư viện client để thao tác dữ liệu và metadata qua KWS_Server (DB + MinIO), dùng tốt trên Google Colab, Jupyter Notebook hoặc script.

Công việc Mô tả API / High-level
1. Tạo dataset_split (train, val, test) Lấy danh sách file feature (MFCC/npz) → chia train/val/test (pandas, sklearn) → đẩy lên DB. dataset_splits.get_mfcc_files() → chia → dataset_splits.create_split_from_list(); hoặc DatasetPipeline.get_data()split_and_push().
2. Tải npz (feature extraction từ MinIO) Tải file .npz của split đã tạo (server đọc từ MinIO, stream qua API). dataset_splits.download(split_id, output_path) (ZIP npz); DatasetSplitFilesClient.download_all_npz(split_id, output_dir); list_files(split_id, file_type="npz") + download_file(...).
3. Đẩy thông tin model (config, artifacts, metrics) Đăng ký run, upload artifact (file model lên MinIO + DB), POST metrics. ModelManager.register_run(), push_artifact(), push_metrics(); hoặc experiments.create_run(), artifacts.upload(), metrics.create(payload=...).

Chuẩn metrics khi POST: bắt buộc Accuracy, Precision, Recall, F1-Score, Confusion Matrix (dùng build_metrics_payload hoặc metrics_from_sklearn).

Features

  • API Coverage: Bọc toàn bộ API backend (datasets, dataset_splits, models, experiments, metrics, mlflow)
  • Data splitting: get_mfcc_files, create_split_from_list, script create_dataset_split.py + pandas
  • Chuẩn metrics: build_metrics_payload, metrics_from_sklearn → payload đúng chuẩn để POST /api/v1/metrics
  • MinIO qua API: Tải .wav / .npz qua API (stream), không cần kết nối trực tiếp MinIO
  • Telegram: Optional thông báo khi job xong

Installation (PyPI)

pip install kwslib

From source:

git clone <repository>
cd KWS_Lib
pip install -e .

Quick Start

Smoke check (no login)

Verify KWS_Server is reachable:

python -m kwslib.smoke

Override base URL:

KWS_SERVER_URL=http://127.0.0.1:8000 python -m kwslib.smoke

Basic Usage

from kwslib import KWSClient

# Initialize client
client = KWSClient(base_url="http://localhost:8000")

# Login
client.login(username="admin", password="password")

# List datasets
datasets = client.datasets.list()
print(f"Found {datasets['total']} datasets")

# Get dataset details
dataset = client.datasets.get(dataset_id=1)
print(f"Dataset: {dataset['name']}")

Download Dataset Split Files for Training

from kwslib import KWSClient, DatasetSplitFilesClient

# Initialize API client
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Initialize files client (uses API, no direct MinIO connection)
files_client = DatasetSplitFilesClient(api)

# List all files in split
files_info = files_client.list_files(split_id=1, file_type="npz")
print(f"Found {files_info['total_files']} files")

# Download all .npz files
files_client.download_all_npz(
    split_id=1,
    output_dir="features"
)

# Download all .wav files
files_client.download_all_wav(
    split_id=1,
    output_dir="audio"
)

# Or download as ZIP
files_client.download_all_files_zip(
    split_id=1,
    file_type="npz",
    output_path="features.zip"
)

# Get file metadata (for Google Colab loop/download strategy)
urls = files_client.get_file_urls(split_id=1, file_type="npz")
for file_info in urls["files"]:
    print(file_info["file_name"])

With Telegram Notifications

from kwslib import KWSClient, TelegramNotifier

# Initialize
client = KWSClient(base_url="http://localhost:8000")
client.login(username="admin", password="password")

notifier = TelegramNotifier(
    bot_token="YOUR_BOT_TOKEN",
    chat_id="YOUR_CHAT_ID"
)

# Create experiment run (triggers background job)
run = client.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=1,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# Wait for completion
job_id = run.get("job_id")
status = client.jobs.wait_for_completion(job_id)

# Send notification
if status["status"] == "completed":
    notifier.send(f"Training completed! Results: {status['result']}")
else:
    notifier.send(f"Training failed: {status.get('error')}")

API Modules

Authentication

  • client.auth.login() - Login
  • client.auth.logout() - Logout
  • client.auth.get_me() - Get current user info

Datasets

  • client.datasets.list() - List datasets
  • client.datasets.get() - Get dataset
  • client.datasets.create() - Create dataset
  • client.datasets.update() - Update dataset
  • client.datasets.delete() - Delete dataset
  • client.datasets.list_versions() - List versions
  • client.datasets.create_version() - Create version

Models

  • client.models.list() - List models
  • client.models.get() - Get model
  • client.models.create() - Create model
  • client.models.list_model_inits() - List model architectures

Experiments

  • client.experiments.list() - List experiments
  • client.experiments.create() - Create experiment
  • client.experiments.create_run(experiment_id, model_id, dataset_split_id, config, git_commit) - Tạo run (background job)
  • client.experiments.list_runs(experiment_id) - List runs của một experiment
  • client.experiments.list_runs_global(experiment_id=..., model_id=...) - List tất cả runs (có lọc)
  • client.experiments.get_run(experiment_id, run_id) - Chi tiết run

Dataset Splits

  • client.dataset_splits.list(dataset_version_id=..., config_name=..., name=...) - List splits
  • client.dataset_splits.get(split_id) - Lấy một split theo ID
  • client.dataset_splits.create(dataset_version_id, name, config_name) - Tạo bản ghi split (chỉ metadata)
  • client.dataset_splits.create_split_from_list(...) - Tạo split từ danh sách file (sau khi chia bằng pandas)
  • client.dataset_splits.get_mfcc_files(dataset_version_id, ...) - Lấy danh sách file MFCC (DataFrame) để chia
  • client.dataset_splits.download(split_id, output_path) - Tải split dạng ZIP (npz)
  • client.dataset_splits.generate(split_id) - Trigger job generate split
  • client.dataset_splits.list_files(split_id, file_type) - List file .wav / .npz trong split

Metrics (chuẩn: Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

  • client.metrics.list(model_id=..., dataset_split_id=...) - List metrics
  • client.metrics.get(metric_id) - Chi tiết metric
  • client.metrics.create(payload=payload) - POST metric (dùng payload từ build_metrics_payload hoặc metrics_from_sklearn)
  • client.metrics.compare_metrics(model_ids=[1,2,3], split_id=...) - So sánh metrics giữa các model

Audio

  • client.audio.list_keyword_samples() - List keyword audio
  • client.audio.upload_keyword_sample() - Upload audio
  • client.audio.get_keyword_sample_url() - Get download URL from API

Features

  • client.features.get_keyword_features() - Get features
  • client.features.extract_keyword_features() - Extract features

Jobs

  • client.jobs.get() - Get job status
  • client.jobs.list() - List jobs
  • client.jobs.wait_for_completion() - Wait for job completion

Dataset Split Files Client

  • files_client.list_files() - List all files in split
  • files_client.download_wav() - Download a .wav file
  • files_client.download_npz() - Download and load a .npz file
  • files_client.download_all_wav() - Download all .wav files
  • files_client.download_all_npz() - Download all .npz files
  • files_client.download_all_files_zip() - Download all files as ZIP
  • files_client.get_file_urls() - Get file metadata for all files

Telegram Notifier

  • notifier.send() - Send message
  • notifier.send_file() - Send file
  • notifier.send_photo() - Send photo

Examples

Data splitting (pandas + create_split_from_list)

from kwslib import KWSClient
from create_dataset_split import get_split_data, push_splits
from sklearn.model_selection import train_test_split

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# 1. Lấy danh sách file MFCC
df = get_split_data(api=api, dataset_version_id=48, feature_type_id=2)

# 2. Chia train/test (stratify theo label)
train_df, test_df = train_test_split(
    df, train_size=0.8, test_size=0.2, random_state=42, stratify=df["derivative_label"]
)

# 3. Đẩy lên DB (tạo splits + gán file)
created = push_splits(
    api=api,
    dataset_version_id=48,
    config_name="config_80_20",
    splits={"train": train_df, "test": test_df},
)
# created = {"train": 123, "test": 124}

Metrics chuẩn (Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

from kwslib import KWSClient, build_metrics_payload, metrics_from_sklearn
import numpy as np

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Cách 1: Từ y_true, y_pred (sklearn)
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0, 1, 0, 0])
payload = metrics_from_sklearn(
    y_true, y_pred,
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    average="weighted",
)
api.metrics.create(payload=payload)

# Cách 2: Từ dict metrics đã tính
metrics = {
    "accuracy": 0.92,
    "precision": 0.91,
    "recall": 0.90,
    "f1_score": 0.905,
    "confusion_matrix": [[50, 2], [3, 45]],  # 2D list int
}
payload = build_metrics_payload(
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    metrics=metrics,
)
api.metrics.create(payload=payload)

# So sánh nhiều model trên một split
comparison = api.metrics.compare_metrics(model_ids=[1, 2, 3], split_id=1)

Complete split/download/upload workflow (tóm tắt)

# 1. Tạo split (metadata) hoặc dùng create_split_from_list sau khi chia pandas
split = api.dataset_splits.create(dataset_version_id=1, name="train", config_name="config_70_15_15")
# Hoặc: push_splits(api, dataset_version_id, config_name, splits={"train": train_df, "val": val_df, "test": test_df})

# 2. Generate split (job) nếu cần
# job = api.dataset_splits.generate(split_id)

# 3. Tạo experiment run (background job)
run = api.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=split_id,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# 4. Sau khi train, POST metrics (chuẩn: accuracy, precision, recall, f1_score, confusion_matrix)
# run_id = ID experiment run (lấy từ list_runs sau khi job create_run hoàn thành)
payload = metrics_from_sklearn(y_true, y_pred, model_id=1, dataset_split_id=split_id, experiment_run_id=run_id)
api.metrics.create(payload=payload)

Google Colab Usage

# In Google Colab, iterate file metadata then call API download endpoints
from kwslib import KWSClient, DatasetSplitFilesClient

api = KWSClient(base_url="https://your-api.com")
api.login(username="admin", password="password")

files_client = DatasetSplitFilesClient(api)

# Get file metadata
urls = files_client.get_file_urls(split_id=1, file_type="npz")

# Download in Colab
import urllib.request
for file_info in urls["files"]:
    urllib.request.urlretrieve(
        file_info["url"],
        f"/content/{file_info['file_name']}"
    )

Configuration

Environment Variables

You can set default values using environment variables:

export KWS_BASE_URL="http://localhost:8000"
export KWS_USERNAME="admin"
export KWS_PASSWORD="password"
export MINIO_ENDPOINT="localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

Đảm bảo đã tăng version trong pyproject.toml trước khi build.

License

MIT License

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwslib-0.0.6.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kwslib-0.0.6-py3-none-any.whl (61.2 kB view details)

Uploaded Python 3

File details

Details for the file kwslib-0.0.6.tar.gz.

File metadata

  • Download URL: kwslib-0.0.6.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kwslib-0.0.6.tar.gz
Algorithm Hash digest
SHA256 d79384523a65c8fca7187c495928684087f1015134cc23aebe272ac0f9f4fcfb
MD5 c7e85b2c5b93af858325160ac4fa938b
BLAKE2b-256 5587e01679cd66d77e0827dea940606246068f819e38ef76913f72ac2c481504

See more details on using hashes here.

File details

Details for the file kwslib-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: kwslib-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 61.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kwslib-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ce3d097e37f2781f8115209cfec7e47ac622d783cbc63481f924b6da7cd1ce7c
MD5 789fdb2aa8756cfaf19ce41e750a92e2
BLAKE2b-256 88051a6d7033fcdb9580f8bcac431964a9a7a5e403084e64cbc88dd2f7d372d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page