No project description provided
Project description
Gyyre - Context-Aware Semantic Operators for Machine Learning Pipelines
Gyyre is a research project to extend Python-based machine learning scripts with semantic operators. It is heavily relying on the awesome work from the skrub project!
Semantic Operators
sem_choose(nl_prompt)-- a semantic drop-in alternative for skrub's choose_from to suggest hyperparameter ranges and other pipeline componentssem_fillna(target_column, nl_prompt: str, impute_with_existing_values_only)-- missing value imputationwith_sem_features(nl_prompt, how_many)-- automated generation of additional feature columns in dataframessem_select(nl_prompt)-- a semantic drop-in alternative for skrub's selectors to select columns from dataframes
Example
import gyyre
import skrub
from sklearn.ensemble import HistGradientBoostingClassifier
from gyyre import sem_choose
dataset = skrub.datasets.fetch_credit_fraud()
products = skrub.var("products", dataset.products)
baskets = skrub.var("baskets", dataset.baskets)
baskets = baskets.skb.subsample(n=5000, how="random")
basket_ids = baskets[["ID"]].skb.mark_as_X()
fraud_flags = baskets["fraud_flag"].skb.mark_as_y()
# Impute missing values in your data
products = products.sem_fillna(
target_column="make",
nl_prompt="Infer the manufacturer from relevant product-related attributes like title or description.",
impute_with_existing_values_only=True,
)
kept_products = products[products["basket_ID"].isin(basket_ids["ID"])]
# Generate new features for the model to train
kept_products = kept_products.with_sem_features(
nl_prompt="""
Generate additional brand- and manufacturer-related product features. Make sure that they can be
efficiently computed on large datasets, and that they work across a large number of brands and
manufacturers. Use your intrinsic knowledge about what products and brands fraudsters focus on
to make sure that the new features are helpful for the prediction task at hand.
""",
name="brand_features",
how_many=5,
)
vectorizer = skrub.TableVectorizer()
vectorized_products = kept_products.skb.apply_with_sem_choose(
vectorizer,
exclude_cols="basket_ID",
# Choose encoders for your data
choices=sem_choose(
high_cardinality="""
A fast encoder for messy columns with potentially invalid data that can scale to many unique
values, can handle missing values and that outputs a pandas Dataframe as result.
"""
),
)
aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index()
augmented_baskets = basket_ids.merge(aggregated_products, left_on="ID", right_on="basket_ID").drop(
columns=["ID", "basket_ID"]
)
hgb = HistGradientBoostingClassifier()
fraud_detector = augmented_baskets.skb.apply_with_sem_choose(
hgb,
y=fraud_flags,
# Get suggestions for hyperparameters
choices=sem_choose(learning_rate="A range of reasonable learning rates to try")
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gyyre-0.0.1.tar.gz
(16.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
gyyre-0.0.1-py3-none-any.whl
(19.9 kB
view details)
File details
Details for the file gyyre-0.0.1.tar.gz.
File metadata
- Download URL: gyyre-0.0.1.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.19 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfc6cc1f97fe531133afad5e98dbd2b12b119704f23027e9ba16dc634c992595
|
|
| MD5 |
a61c83a111645e41db90b446cf571bba
|
|
| BLAKE2b-256 |
ba6f92d969c9f278015605fc90e4cfcce29afe7b0b123ff6b11425849ee56cb6
|
File details
Details for the file gyyre-0.0.1-py3-none-any.whl.
File metadata
- Download URL: gyyre-0.0.1-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.9.19 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f6bf543980e87eee65fde36fdea20fd4fb93edbcd403f40d3c62d9ba1779ac1
|
|
| MD5 |
a9bd9a482d4e8e1ba0c0808104f40390
|
|
| BLAKE2b-256 |
06e9675684aa98cad0ffb16197cab40a70d7a86d6f148181fc0c6eb1ca6303b1
|