A helpful script to optimize a Pandas DataFrame.
Project description
pd-helper
A helpful package to streamline Pandas DataFrame optimization.
Save 50-75% on DataFrame memory usage by running the optimizer.
Autoconfigure dtypes for appropriate data types in each column with helper.
Generate a random DataFrame of controlled random variables for testing with maker.
Install
pip install pd-helper
Basic Usage to Iterate over DataFrame
from pd_helper.maker import MakeData
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
# MakeData() generates a fake dataframe, convenient for testing
df = faker.make_df()
df = optimize(df)
Better Usage With Multiprocessing
from pd_helper.maker import MakeData
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
# MakeData() generates a fake dataframe, convenient for testing
df = faker.make_df()
df = optimize(df, enable_mp=True)
Specify Special Mappings
from pd_helper.maker import MakeData
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
# MakeData() generates a fake dataframe, convenient for testing
df = faker.make_df()
special_mappings = {'string': ['object_id'],
'category': ['item_name']}
# special mappings will be applied instead of by optimize ruleset, they will be returned.
df = optimize(df
, enable_mp=True,
special_mappings=special_mappings
)
Sample Results with Helper
Starting with 175.63 MB memory.
After optmization.
Ending with 65.33 MB memory.
Generating a Randomly Imperfect DataFrame with Maker
Maker provides a class, MakeData(), to generate a table of made-up records.
Each row is an event where an item was retrieved.
Options to make the table imperfectly random in various ways.
Sample table below:
Retrieved Date | Item Name | Retrieved | Condition | Sector | |
---|---|---|---|---|---|
Example | 2019-01-01, 2019-03-4 | Toaster, Lighter | True, False | Junk, Excellent | 1, 2 |
Data Type | String | String | String | String | Integer |
References
-
Pandas Categorical: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html
-
Pandas Pickle: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html
-
Pandas CSV: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
-
Pandas Datetime: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
TODO
-
Improve efficiency of iterating on DataFrame.
-
Allow user to toggle logging.
-
Provide tools for imputing missing data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pd_helper-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffd42252fa4c1f2c69d43567bba2e3526910d01882779c34c674302c1f3ce657 |
|
MD5 | 59e12406d6a08ea6f7b73a3f088e8640 |
|
BLAKE2b-256 | 9e23e71854e166a8a70f9918ab5fb9a7eecee5b1e954bc6d6165f8bbb7ff7c07 |