LLM-powered estimators for scikit-learn pipelines

These details have not been verified by PyPI

Project links

Homepage

Project description

promptlearn

PyPI - Python Version PyPI - Wheel PyPI - Implementation

promptlearn brings large language models into your scikit-learn workflow. It is able to look at data, reason about the meaning of inputs and outputs, relate it to and identify relevant knowledge of the world, automatically building standalone executable Python code that augments the relationships of the original data with relevant materialized world-knowledge about categorical variables.

📊 Outperforming Traditional Models with Built-In Knowledge

Consider a simple binary classification task: predicting whether an animal is a mammal given things like its name, weight, and lifespan (python examples/quickstart.py --demo compare --dataset mammal).

Traditional models depend solely on the input features. But promptlearn models can use their internal understanding of zoology to form highly accurate rules, pulling in data about known mammals, and making that knowledge available in explicit reference tables for subsequent predictions.

model	accuracy (higher is better)	fit_time_sec	predict_time_sec
promptlearn_o3-mini	0.94	49.11	0.0028
promptlearn_o4-mini	0.86	60.96	0.0024
promptlearn_gpt-3.5-turbo	0.66	20.25	0.0027
promptlearn_gpt-4o	0.66	43.93	0.0023
logistic_regression	0.60	0.02	0.0010
decision_tree	0.53	0.0014	0.0005
gradient_boosting	0.53	0.02	0.0011
promptlearn_gpt-4	0.40	12.49	0.0022
dummy	0.34	0.0006	0.0001
random_forest	0.28	0.01	0.0017

This type of semantic generalization is a powerful advantage for LLM-backed models.

Now compare performance on a regression task where the data contains samples of objects falling from different heights, under different gravity (python examples/quickstart.py --demo compare --dataset fall). This is a classic physics problem, with a well-known equation:

fall_time_s = sqrt((2 * height_m) / gravity_mps2)

promptlearn estimators are able to recover this exact formula, using just the dataframe itself, and use it to generate perfect predictions:

model	mse (lower is better)	fit_time_sec	predict_time_sec
promptlearn_gpt-4o	0.000	2.92	0.001
promptlearn_o3-mini	0.000	10.80	0.001
promptlearn_o4-mini	0.000	7.96	0.001
random_forest	0.028	0.01	0.002
gradient_boosting	0.035	0.01	0.001
decision_tree	0.067	0.001	0.000
linear_regression	0.498	0.001	0.000
dummy	5.273	0.001	0.000
promptlearn_gpt-3.5-turbo	18.193	3.01	0.002
promptlearn_gpt-4	855.445	2.43	0.001

No feature engineering was performed. No physics constants were added. The model discovered the rule and applied it directly. Classical regressors, by contrast, approximated a curve but missed the exact structure.

These results highlight the practical benefit of reasoning models: they learn compact, expressive heuristics and can outperform traditional systems when symbolic insight or background knowledge is essential.

🤖 Estimators Powered by Language

promptlearn provides scikit-learn-compatible estimators that use LLMs as the modeling engine:

PromptClassifier – for predicting classes through generalized reasoning
PromptRegressor – for modeling numeric relationships in data
PromptFeatureEngineer – a transformer that derives new, world-knowledge-rich features for a downstream classical model

These estimators follow the same API as other scikit-learn models (fit, predict, score) but operate via dynamic prompt construction and few-shot abstraction.

🧪 LLM Feature Engineering (`PromptFeatureEngineer`)

PromptFeatureEngineer is a scikit-learn transformer that, at fit, asks the LLM to write a standalone transform() function deriving new features from semantically meaningful columns (e.g. mapping a country to its GDP tier, parsing a date into is_weekend, bucketing ages). At transform it just runs that generated code — no per-row LLM calls — and appends the engineered columns, so it drops straight into a Pipeline before any classical model:

from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from promptlearn import PromptFeatureEngineer

# PromptFeatureEngineer appends engineered columns to the original frame, so a
# downstream linear model still wants categoricals one-hot encoded.
encode = ColumnTransformer(
    [("cat", OneHotEncoder(handle_unknown="ignore"), selector(dtype_exclude="number"))],
    remainder="passthrough",
)
pipe = Pipeline([
    ("features", PromptFeatureEngineer()),  # LLM-generated feature code
    ("encode", encode),
    ("model", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

This lets a fast, interpretable linear model benefit from the LLM's world knowledge while keeping inference cheap and serializable.

🚀 Try It

Everything runnable lives in a single guided tour, examples/quickstart.py — a menu of self-contained demos. Each makes live LLM calls, so run them one at a time:

python examples/quickstart.py --list             # see all the demos
python examples/quickstart.py --demo zero_row     # fit on column names only
python examples/quickstart.py --demo titanic --dump artifacts/   # deep tour: generated code, explain(), joblib
python examples/quickstart.py --demo compare --dataset mammal    # promptlearn vs sklearn/XGBoost

The demos cover zero-row fitting, .sample(), joblib round-tripping, world-knowledge reasoning, linear/nonlinear/multi-output regression, XOR, GridSearchCV tuning, a large real OpenML dataset, the side-by-side model compare, and the deep titanic walkthrough (generated predict() code, explain(), and artifact dumping).

The compare demo is powered by the reusable promptlearn.compare_models(models, X_train, y_train, X_test, y_test) helper, which works with any mix of promptlearn and sklearn/XGBoost estimators.

📈 Benchmark: feature engineering across 10 OpenML datasets

Accuracy on a held-out test split for 10 OpenML classification datasets with semantically meaningful categoricals. promptlearn+FE is PromptFeatureEngineer → one-hot → LogisticRegression; the promptlearn contenders use gpt-5.4-mini. Reproduce with benchmarks/run_openml_benchmark.py.

dataset	promptlearn	promptlearn+FE	logreg	xgboost
adult	0.782	0.858	0.864	0.850
credit-g	0.680	0.748	0.724	0.728
bank-marketing	0.678	0.876	0.868	0.878
mushroom	0.970	1.000	1.000	1.000
car	0.417	0.963	0.910	0.988
nursery	0.492	0.944	0.932	0.974
vote	0.193	0.945	0.954	0.982
tic-tac-toe	0.979	1.000	0.979	0.983
kr-vs-kp	0.598	0.966	0.964	0.992
monks-2	0.616	0.623	0.583	0.874
mean	0.640	0.892	0.878	0.925

Takeaway: adding PromptFeatureEngineer in front of a plain logistic regression lifts mean accuracy from 0.878 to 0.892 — beating logistic regression on most datasets and even edging out XGBoost on adult, credit-g, and tic-tac-toe, while keeping a fully interpretable linear model and cheap inference. Using the LLM as a direct classifier (promptlearn alone) is weaker here: when the target's class encoding is arbitrary (e.g. party labels in vote), direct prediction has no semantic foothold, which is exactly the gap feature engineering closes.

🔌 Choose Your Provider

The LLM provider is selected by the model string and resolved via LiteLLM, so you are not locked into OpenAI:

PromptClassifier(model="gpt-5.5")            # OpenAI (the default)
PromptClassifier(model="claude-sonnet-4-6")  # Anthropic
PromptClassifier(model="ollama:llama3.1")    # local Ollama

API keys are read from the usual per-provider environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, …); local providers like Ollama need none.

To change the default model without touching code, set PROMPTLEARN_MODEL (e.g. export PROMPTLEARN_MODEL=gpt-5.4-mini for faster, cheaper runs). An explicit model= argument always takes precedence.

🕳 Zero-Example Learning

If you call .fit() with no rows — just column names — promptlearn will still return a working model.

This is possible because the LLM can hallucinate a plausible mapping based on:

Column names
Prior knowledge
Type hints or value patterns

This makes rapid prototyping and conceptual modeling trivial.

🧪 Native `.sample()` Support

You can generate synthetic rows directly from any trained model using .sample(n):

>>> model.sample(3)
fruit    is_citrus
Lime     1
Banana   0
Orange   1

This is useful for:

Understanding what the model believes
Creating test sets or bootstrapped data
Building readable examples from internal logic

🔎 Explain the Learned Rule

Call .explain() to get a plain-English description of the heuristic the model learned — useful for interpretability reporting:

>>> explanation = model.explain()
>>> print(explanation)
Predicts 1 (adult) when `age` is at least 18, otherwise 0.

>>> explanation.features_used
['age']

explain() returns an Explanation object with meta and data dicts (keys also reachable as attributes) that is JSON round-trippable via to_json() / Explanation.from_json(...). A bare explain() describes the whole model (global, and cached so it's deterministic); passing a single row, explain(X), describes that one prediction (local).

💾 Save and Reload with `joblib`

Like any scikit-learn model, promptlearn estimators can be serialized:

import joblib

joblib.dump(model, "model.joblib")
model = joblib.load("model.joblib")

The compiled prediction function is excluded from the saved file and recompiled on load. The heuristic remains intact, interpretable, and ready to use.

📚 Related Work

Scikit-LLM

Scikit-LLM provides zero- and few-shot classification through template-based prompting.
It is lightweight and NLP-focused.

promptlearn offers a broader modeling philosophy:

Capability	Scikit-LLM	promptlearn
Produces runnable Python code	❌ No	✅ Yes
Regression support	❌ No	✅ Yes

🛠 Development

Install the dev dependencies and enable the git hooks:

pip install -r requirements-dev.txt
pre-commit install

The pre-commit hooks run black and the full test suite, and both must pass before a commit is allowed. Note the test suite makes live LLM calls, so it needs a provider API key (e.g. OPENAI_API_KEY). Bypass the hooks in an emergency with git commit --no-verify.

📁 License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.5.0

Jun 24, 2026

0.4.1

Jun 24, 2026

0.3.0

Jul 13, 2025

0.2.3

Jul 12, 2025

0.2.2

Jul 10, 2025

0.2.1

Jul 10, 2025

0.2.0

Jul 9, 2025

0.1.3

Jul 8, 2025

0.1.2

Jul 8, 2025

0.1.1

Jul 6, 2025

0.1.0

Jul 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptlearn-0.5.0.tar.gz (42.7 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promptlearn-0.5.0-py3-none-any.whl (31.6 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file promptlearn-0.5.0.tar.gz.

File metadata

Download URL: promptlearn-0.5.0.tar.gz
Upload date: Jun 24, 2026
Size: 42.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for promptlearn-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`d03ce6eeeba299db02815f05dc178fa222683817bda4a4d34c535318afa232b5`
MD5	`6b280262af5c32d48d769110ae08bc97`
BLAKE2b-256	`5e5529beaf33d9926afbfd8e900976317018121631fe1be824a6be35fb2a5c4d`

See more details on using hashes here.

File details

Details for the file promptlearn-0.5.0-py3-none-any.whl.

File metadata

Download URL: promptlearn-0.5.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for promptlearn-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a82cdbb58d9cde8dd33b9d219dee2aefd36634e338b5b6277927bebdf14efe23`
MD5	`7148f0aef5e5667d89db7b28ea8179ed`
BLAKE2b-256	`b86b80bbba0511a46f3019e544954118abd4dd0bfbf4cce664d885099df47423`

See more details on using hashes here.

promptlearn 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

promptlearn

📊 Outperforming Traditional Models with Built-In Knowledge

🤖 Estimators Powered by Language

🧪 LLM Feature Engineering (PromptFeatureEngineer)

🚀 Try It

📈 Benchmark: feature engineering across 10 OpenML datasets

🔌 Choose Your Provider

🕳 Zero-Example Learning

🧪 Native .sample() Support

🔎 Explain the Learned Rule

💾 Save and Reload with joblib

📚 Related Work

Scikit-LLM

🛠 Development

📁 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🧪 LLM Feature Engineering (`PromptFeatureEngineer`)

🧪 Native `.sample()` Support

💾 Save and Reload with `joblib`