Python client library for KWS Platform API - dataset splits, feature npz download, model/artifacts/metrics push
Project description
KWS Library (kwslib)
Thư viện Python dùng API từ backend KWS_Server, dùng trên Google Colab, Jupyter Notebook hoặc script Python.
Single Source of Truth (Operations): Tất cả tài liệu vận hành/quy trình/template của dự án nằm tại
Workspace/(entry:Workspace/README.md). Repo này chỉ giữ tài liệu kỹ thuật cho component.
Ba công việc chính (KWS_Lib ↔ KWS_Server)
KWS_Lib là thư viện client để thao tác dữ liệu và metadata qua KWS_Server (DB + MinIO), dùng tốt trên Google Colab, Jupyter Notebook hoặc script.
| Công việc | Mô tả | API / High-level |
|---|---|---|
| 1. Tạo dataset_split (train, val, test) | Lấy danh sách file feature (MFCC/npz) → chia train/val/test (pandas, sklearn) → đẩy lên DB. | dataset_splits.get_mfcc_files() → chia → dataset_splits.create_split_from_list(); hoặc DatasetPipeline.get_data() → split_and_push(). |
| 2. Tải npz (feature extraction từ MinIO) | Tải file .npz của split đã tạo (server đọc từ MinIO, stream qua API). | dataset_splits.download(split_id, output_path) (ZIP npz); DatasetSplitFilesClient.download_all_npz(split_id, output_dir); list_files(split_id, file_type="npz") + download_file(...). |
| 3. Đẩy thông tin model (config, artifacts, metrics) | Đăng ký run, upload artifact (file model lên MinIO + DB), POST metrics. | ModelManager.register_run(), push_artifact(), push_metrics(); hoặc experiments.create_run(), artifacts.upload(), metrics.create(payload=...). |
Chuẩn metrics khi POST: bắt buộc Accuracy, Precision, Recall, F1-Score, Confusion Matrix (dùng build_metrics_payload hoặc metrics_from_sklearn).
Features
- API Coverage: Bọc toàn bộ API backend (datasets, dataset_splits, models, experiments, metrics, mlflow)
- Data splitting:
get_mfcc_files,create_split_from_list, scriptcreate_dataset_split.py+ pandas - Chuẩn metrics:
build_metrics_payload,metrics_from_sklearn→ payload đúng chuẩn để POST/api/v1/metrics - MinIO qua API: Tải .wav / .npz qua API (stream), không cần kết nối trực tiếp MinIO
- Telegram: Optional thông báo khi job xong
Installation (PyPI)
pip install kwslib
Yêu cầu môi trường:
- Python
>=3.10 - Tương thích tốt với Google Colab (NumPy 2.x / SciPy 1.13+ / scikit-learn 1.6+)
From source:
git clone <repository>
cd KWS_Lib
pip install -e .
Quick Start
Smoke check (no login)
Verify KWS_Server is reachable:
python -m kwslib.smoke
Override base URL:
KWS_SERVER_URL=http://127.0.0.1:8000 python -m kwslib.smoke
Basic Usage
from kwslib import KWSClient
# Initialize client
client = KWSClient(base_url="http://localhost:8000")
# Login
client.login(username="admin", password="password")
# List datasets
datasets = client.datasets.list()
print(f"Found {datasets['total']} datasets")
# Get dataset details
dataset = client.datasets.get(dataset_id=1)
print(f"Dataset: {dataset['name']}")
Download Dataset Split Files for Training
from kwslib import KWSClient, DatasetSplitFilesClient
# Initialize API client
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")
# Initialize files client (uses API, no direct MinIO connection)
files_client = DatasetSplitFilesClient(api)
# List all files in split
files_info = files_client.list_files(split_id=1, file_type="npz")
print(f"Found {files_info['total_files']} files")
# Download all .npz files
files_client.download_all_npz(
split_id=1,
output_dir="features"
)
# Download all .wav files
files_client.download_all_wav(
split_id=1,
output_dir="audio"
)
# Or download as ZIP
files_client.download_all_files_zip(
split_id=1,
file_type="npz",
output_path="features.zip"
)
# Get file metadata (for Google Colab loop/download strategy)
urls = files_client.get_file_urls(split_id=1, file_type="npz")
for file_info in urls["files"]:
print(file_info["file_name"])
With Telegram Notifications
from kwslib import KWSClient, TelegramNotifier
# Initialize
client = KWSClient(base_url="http://localhost:8000")
client.login(username="admin", password="password")
notifier = TelegramNotifier(
bot_token="YOUR_BOT_TOKEN",
chat_id="YOUR_CHAT_ID"
)
# Create experiment run (triggers background job)
run = client.experiments.create_run(
experiment_id=1,
model_id=1,
dataset_split_id=1,
config={"learning_rate": 0.001, "batch_size": 32},
git_commit="manual-run",
)
# Wait for completion
job_id = run.get("job_id")
status = client.jobs.wait_for_completion(job_id)
# Send notification
if status["status"] == "completed":
notifier.send(f"Training completed! Results: {status['result']}")
else:
notifier.send(f"Training failed: {status.get('error')}")
API Modules
Authentication
client.auth.login()- Loginclient.auth.logout()- Logoutclient.auth.get_me()- Get current user info
Datasets
client.datasets.list()- List datasetsclient.datasets.get()- Get datasetclient.datasets.create()- Create datasetclient.datasets.update()- Update datasetclient.datasets.delete()- Delete datasetclient.datasets.list_versions()- List versionsclient.datasets.create_version()- Create version
Models
client.models.list()- List modelsclient.models.get()- Get modelclient.models.create()- Create modelclient.models.list_model_inits()- List model architectures
Experiments
client.experiments.list()- List experimentsclient.experiments.create()- Create experimentclient.experiments.create_run(experiment_id, model_id, dataset_split_id, config, git_commit)- Tạo run (background job)client.experiments.list_runs(experiment_id)- List runs của một experimentclient.experiments.list_runs_global(experiment_id=..., model_id=...)- List tất cả runs (có lọc)client.experiments.get_run(experiment_id, run_id)- Chi tiết run
Dataset Splits
client.dataset_splits.list(dataset_version_id=..., config_name=..., name=...)- List splitsclient.dataset_splits.get(split_id)- Lấy một split theo IDclient.dataset_splits.create(dataset_version_id, name, config_name)- Tạo bản ghi split (chỉ metadata)client.dataset_splits.create_split_from_list(...)- Tạo split từ danh sách file (sau khi chia bằng pandas)client.dataset_splits.get_mfcc_files(dataset_version_id, ...)- Lấy danh sách file MFCC (DataFrame) để chiaclient.dataset_splits.download(split_id, output_path)- Tải split dạng ZIP (npz)client.dataset_splits.generate(split_id)- Trigger job generate splitclient.dataset_splits.list_files(split_id, file_type)- List file .wav / .npz trong split
Metrics (chuẩn: Accuracy, Precision, Recall, F1-Score, Confusion Matrix)
client.metrics.list(model_id=..., dataset_split_id=...)- List metricsclient.metrics.get(metric_id)- Chi tiết metricclient.metrics.create(payload=payload)- POST metric (dùng payload từbuild_metrics_payloadhoặcmetrics_from_sklearn)client.metrics.compare_metrics(model_ids=[1,2,3], split_id=...)- So sánh metrics giữa các model
Audio
client.audio.list_keyword_samples()- List keyword audioclient.audio.upload_keyword_sample()- Upload audioclient.audio.get_keyword_sample_url()- Get download URL from API
Features
client.features.get_keyword_features()- Get featuresclient.features.extract_keyword_features()- Extract features
Jobs
client.jobs.get()- Get job statusclient.jobs.list()- List jobsclient.jobs.wait_for_completion()- Wait for job completion
Dataset Split Files Client
files_client.list_files()- List all files in splitfiles_client.download_wav()- Download a .wav filefiles_client.download_npz()- Download and load a .npz filefiles_client.download_all_wav()- Download all .wav filesfiles_client.download_all_npz()- Download all .npz filesfiles_client.download_all_files_zip()- Download all files as ZIPfiles_client.get_file_urls()- Get file metadata for all files
Telegram Notifier
notifier.send()- Send messagenotifier.send_file()- Send filenotifier.send_photo()- Send photo
Examples
Data splitting (pandas + create_split_from_list)
from kwslib import KWSClient
from create_dataset_split import get_split_data, push_splits
from sklearn.model_selection import train_test_split
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")
# 1. Lấy danh sách file MFCC
df = get_split_data(api=api, dataset_version_id=48, feature_type_id=2)
# 2. Chia train/test (stratify theo label)
train_df, test_df = train_test_split(
df, train_size=0.8, test_size=0.2, random_state=42, stratify=df["derivative_label"]
)
# 3. Đẩy lên DB (tạo splits + gán file)
created = push_splits(
api=api,
dataset_version_id=48,
config_name="config_80_20",
splits={"train": train_df, "test": test_df},
)
# created = {"train": 123, "test": 124}
Metrics chuẩn (Accuracy, Precision, Recall, F1-Score, Confusion Matrix)
from kwslib import KWSClient, build_metrics_payload, metrics_from_sklearn
import numpy as np
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")
# Cách 1: Từ y_true, y_pred (sklearn)
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0, 1, 0, 0])
payload = metrics_from_sklearn(
y_true, y_pred,
model_id=1, dataset_split_id=1, experiment_run_id=1,
average="weighted",
)
api.metrics.create(payload=payload)
# Cách 2: Từ dict metrics đã tính
metrics = {
"accuracy": 0.92,
"precision": 0.91,
"recall": 0.90,
"f1_score": 0.905,
"confusion_matrix": [[50, 2], [3, 45]], # 2D list int
}
payload = build_metrics_payload(
model_id=1, dataset_split_id=1, experiment_run_id=1,
metrics=metrics,
)
api.metrics.create(payload=payload)
# So sánh nhiều model trên một split
comparison = api.metrics.compare_metrics(model_ids=[1, 2, 3], split_id=1)
Complete split/download/upload workflow (tóm tắt)
# 1. Tạo split (metadata) hoặc dùng create_split_from_list sau khi chia pandas
split = api.dataset_splits.create(dataset_version_id=1, name="train", config_name="config_70_15_15")
# Hoặc: push_splits(api, dataset_version_id, config_name, splits={"train": train_df, "val": val_df, "test": test_df})
# 2. Generate split (job) nếu cần
# job = api.dataset_splits.generate(split_id)
# 3. Tạo experiment run (background job)
run = api.experiments.create_run(
experiment_id=1,
model_id=1,
dataset_split_id=split_id,
config={"learning_rate": 0.001, "batch_size": 32},
git_commit="manual-run",
)
# 4. Sau khi train, POST metrics (chuẩn: accuracy, precision, recall, f1_score, confusion_matrix)
# run_id = ID experiment run (lấy từ list_runs sau khi job create_run hoàn thành)
payload = metrics_from_sklearn(y_true, y_pred, model_id=1, dataset_split_id=split_id, experiment_run_id=run_id)
api.metrics.create(payload=payload)
Google Colab Usage
# In Google Colab, iterate file metadata then call API download endpoints
from kwslib import KWSClient, DatasetSplitFilesClient
api = KWSClient(base_url="https://your-api.com")
api.login(username="admin", password="password")
files_client = DatasetSplitFilesClient(api)
# Get file metadata
urls = files_client.get_file_urls(split_id=1, file_type="npz")
# Download in Colab
import urllib.request
for file_info in urls["files"]:
urllib.request.urlretrieve(
file_info["url"],
f"/content/{file_info['file_name']}"
)
Configuration
Environment Variables
You can set default values using environment variables:
export KWS_BASE_URL="http://localhost:8000"
export KWS_USERNAME="admin"
export KWS_PASSWORD="password"
export MINIO_ENDPOINT="localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"
Publishing to PyPI
pip install build twine
python -m build
twine upload dist/*
Đảm bảo đã tăng version trong pyproject.toml trước khi build.
License
MIT License
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kwslib-0.0.8.tar.gz.
File metadata
- Download URL: kwslib-0.0.8.tar.gz
- Upload date:
- Size: 46.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e4d08e613898c49bcf6a15ab5acd683d1fc6c9120dfeefdfe9169517aa38f4e
|
|
| MD5 |
5b5dfea7d3c429e1e4d34d4a86e64a61
|
|
| BLAKE2b-256 |
2b5d887a2a7001f916a951dc030e100d42587156d94163970e33651ce776427f
|
File details
Details for the file kwslib-0.0.8-py3-none-any.whl.
File metadata
- Download URL: kwslib-0.0.8-py3-none-any.whl
- Upload date:
- Size: 61.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
725dcd6019aebe01e1779ff103119138ac0dc09708f1aee69fe9d48653de5e9a
|
|
| MD5 |
79a6de6ec2c1771e3a9048d8041dc10a
|
|
| BLAKE2b-256 |
7fa131a97e6d23da7c9f16306ad84e96c0cad285499de9880811ac84dc3c89e3
|