RandomizedSearchCV/GridSearchCV with pandas.DataFrame interface
Project description
sklearn-cv-pandas
RandomizedSearchCV/GridSearchCV with pandas.DataFrame interface
Why do I want this?
- I usually prepare features as pandas.DataFrame
- Scikit learn input should be
array-like
. https://scikit-learn.org/stable/glossary.html#term-array-like. - Although it includes pandas.DataFrame, there are some issues;
- It does not support
Int64
data type - Output model does not remember which columns should be used
- It does not support
Solution
- Provide GridSearchCV / RandomizedSearchCV with pandas.DataFrame interface
- Internally preprocess DataFrame to be applicable for sklearn
- Output of
fit
command is now originalModel
object, which- stores column name information
- provides pandas.DataFrame interface for prediction
Installation
pip install sklearn_cv_pandas
Usage
Configure CV object
Instantiate CV in the same manner as original ones.
from scipy import stats
from sklearn import linear_model
from sklearn_cv_pandas import RandomizedSearchCV
estimator = linear_model.Lasso()
param_dist = dict(alpha=stats.loguniform(1e-5, 10))
cv = RandomizedSearchCV(estimator, param_dist, scoring="mean_absolute_error")
fit
with pandas.DataFrame
Our CV object has new methods fit_holdout_pandas
and fit_cv_pandas
.
Original ones requires x
and y
as numpy.array
.
Instead of numpy array, you can specify one pandas.DataFrame
and column names for x
(feature_columns
), and column name of y
(target_column
).
model = cv.fit_cv_pandas(
df, target_column="y", feature_columns=["x{}".format(i) for i in range(100)], n_fold=5
)
predict
with pandas.DataFrame
You can run prediction with pandas.DataFrame interface as well.
Output of fit_holdout_pandas
and fit_cv_pandas
stores feature_columns
and target_column
.
You can just input pandas.DataFrame
for prediction into the method predict
.
model.predict(df)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sklearn_cv_pandas-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ce37900b15b9faf98c11b6e415ab13e5b0fcfe27557e7471a56df1aea144b91 |
|
MD5 | c476a8ac43aaeb3e26f51e38cec76090 |
|
BLAKE2b-256 | 10999378c1b940380eef4f459f9dd5902496a16bfd5e2236d5e514ee841ac89a |