Skip to main content

A Python package to reduce the memory usage of pandas DataFrames. It speeds up your workflow and reduces the risk of running out of memory.

Project description

Less memory usage - more speed

A Python package to reduce the memory usage of pandas DataFrames without changing the underlying data. It speeds up your workflow and reduces the risk of running out of memory.

Installation

pip install a-pandas-ex-less-memory-more-speed
from optimize_df import pd_add_less_memory_more_speed
pd_add_less_memory_more_speed()
import pandas as pd
df = pd.read_csv(    "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv",)
df.ds_reduce_memory_size()

Usage

df = pd.read_csv(    "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv",)
from random import choice

#Let's add some more data types
truefalse = lambda: choice([True, False])
df['truefalse'] = [truefalse() for x in range(len(df))]

df['onlynan'] = pd.NA

df['nestedlists'] = [[[1]*10]] * len(df)

mixedstuff = lambda: choice([True, False, 'right', 'wrong', 1,2,23,343.555,23.444, [442,553,44], [],''])
df['mixedstuff'] =[mixedstuff() for x in range(len(df))]

floatnumbers = lambda: choice([33.44,344.42424265,15.0,3222.33])
df['floatnumbers']=[floatnumbers() for x in range(len(df))]

floatnumbers0 = lambda: choice([33.0,344.0,15.0,3222.0])
df['floatnumbers0']=[floatnumbers0() for x in range(len(df))]

intwithnan = lambda: choice([1,2,3,4,5,pd.NA])
df['intwithnan']=[intwithnan() for x in range(len(df))]


df2 = optimize_dtypes(
    dframe=df,
    point_zero_to_int=True,
    categorylimit=15,
    verbose=True,
    include_na_strings_in_pd_na=True,
    include_empty_iters_in_pd_na=True,
    include_0_len_string_in_pd_na=True,
    convert_float=True,
    check_float_difference=True,
    float_tolerance_negative=-0.1,
    float_tolerance_positive=0.1,
)
print(df)
print(df2)
print(df.dtypes)
print(df2.dtypes)


Memory usage of dataframe is: 0.12333202362060547 MB
█████████████████████████████
Analyzing df.PassengerId
----------------
df.PassengerId Is numeric!
df.PassengerId Max: 891
df.PassengerId Min: 1
df.PassengerId: Only .000 in columns -> Using int - Checking which size fits best ...
df.PassengerId: Using dtype: np.uint16
█████████████████████████████
Analyzing df.Survived
----------------
df.Survived Is numeric!
df.Survived Max: 1
df.Survived Min: 0
df.Survived: Only .000 in columns -> Using int - Checking which size fits best ...
df.Survived: Using dtype: np.uint8
█████████████████████████████
Analyzing df.Pclass
----------------
df.Pclass Is numeric!
df.Pclass Max: 3
df.Pclass Min: 1
df.Pclass: Only .000 in columns -> Using int - Checking which size fits best ...
df.Pclass: Using dtype: np.uint8
█████████████████████████████
Analyzing df.Name
----------------
df.Name: Using dtype: string
█████████████████████████████
Analyzing df.Sex
----------------
df.Sex: Using dtype: category
█████████████████████████████
Analyzing df.Age
----------------
df.Age Is numeric!
df.Age Max: 80.0
df.Age Min: 0.42
df.Age: Using dtype: Float64
█████████████████████████████
Analyzing df.SibSp
----------------
df.SibSp Is numeric!
df.SibSp Max: 8
df.SibSp Min: 0
df.SibSp: Only .000 in columns -> Using int - Checking which size fits best ...
df.SibSp: Using dtype: np.uint8
█████████████████████████████
Analyzing df.Parch
----------------
df.Parch Is numeric!
df.Parch Max: 6
df.Parch Min: 0
df.Parch: Only .000 in columns -> Using int - Checking which size fits best ...
df.Parch: Using dtype: np.uint8
█████████████████████████████
Analyzing df.Ticket
----------------
df.Ticket: Using dtype: string
█████████████████████████████
Analyzing df.Fare
----------------
df.Fare Is numeric!
df.Fare Max: 512.3292
df.Fare Min: 0.0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Max. positive difference - limit 0.1
498   -0.05
305   -0.05
708   -0.05
Max. negative difference - limit -0.1
679    0.1708
258    0.1708
737    0.1708
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
------------- <class 'numpy.float16'> ------------- not right for df.Fare
Checking next dtype...
True -> within the desired range: 0.1 / -0.1
False      5
True     886
-------------------
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Max. positive difference - limit 0.1
0      0.0
587    0.0
588    0.0
Max. negative difference - limit -0.1
0      0.0
598    0.0
587    0.0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+++++++++++++ <class 'numpy.float32'> +++++++++++++ right for df.Fare
True -> within the desired range: 0.1 / -0.1
True    891
-------------------
df.Fare: Using dtype: np.float32
█████████████████████████████
Analyzing df.Cabin
----------------
df.Cabin: Using dtype: string
█████████████████████████████
Analyzing df.Embarked
----------------
df.Embarked: Using dtype: category
█████████████████████████████
Analyzing df.truefalse
----------------
df.truefalse: Using dtype: np.bool_
█████████████████████████████
Analyzing df.onlynan
----------------
df.onlynan Is numeric!
df.onlynan Max: nan
df.onlynan Min: nan
df.onlynan: Only nan in column, continue ...
█████████████████████████████
Analyzing df.nestedlists
----------------
█████████████████████████████
Analyzing df.mixedstuff
----------------
█████████████████████████████
Analyzing df.floatnumbers
----------------
df.floatnumbers Is numeric!
df.floatnumbers Max: 3222.33
df.floatnumbers Min: 15.0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Max. positive difference - limit 0.1
890   -0.33
597   -0.33
592   -0.33
Max. negative difference - limit -0.1
527    0.075757
190    0.075757
171    0.075757
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
------------- <class 'numpy.float16'> ------------- not right for df.floatnumbers
Checking next dtype...
True -> within the desired range: 0.1 / -0.1
False    219
True     672
-------------------
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Max. positive difference - limit 0.1
0      0.0
587    0.0
588    0.0
Max. negative difference - limit -0.1
0      0.0
598    0.0
587    0.0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+++++++++++++ <class 'numpy.float32'> +++++++++++++ right for df.floatnumbers
True -> within the desired range: 0.1 / -0.1
True    891
-------------------
df.floatnumbers: Using dtype: np.float32
█████████████████████████████
Analyzing df.floatnumbers0
----------------
df.floatnumbers0 Is numeric!
df.floatnumbers0 Max: 3222.0
df.floatnumbers0 Min: 15.0
df.floatnumbers0: Only .000 in columns -> Using int - Checking which size fits best ...
df.floatnumbers0: Using dtype: np.uint16
█████████████████████████████
Analyzing df.intwithnan
----------------
df.intwithnan Is numeric!
df.intwithnan Max: 5
df.intwithnan Min: 1
df.intwithnan: Only .000 in columns -> Using int - Checking which size fits best ...
df.intwithnan: Using dtype: Int64
█████████████████████████████
Memory usage of dataframe was: 0.12333202362060547 MB
Memory usage of dataframe is now: 0.07259273529052734 MB
This is  58.85959960718511 % of the initial size
█████████████████████████████
█████████████████████████████
     PassengerId  Survived  Pclass  ... floatnumbers floatnumbers0  intwithnan
0              1         0       3  ...    33.440000          33.0           4
1              2         1       1  ...  3222.330000          15.0           5
2              3         1       3  ...    33.440000          33.0           3
3              4         1       1  ...    15.000000          33.0           1
4              5         0       3  ...    15.000000         344.0           2
..           ...       ...     ...  ...          ...           ...         ...
886          887         0       2  ...   344.424243         344.0           5
887          888         1       1  ...    15.000000          15.0           4
888          889         0       3  ...   344.424243        3222.0           2
889          890         1       1  ...   344.424243        3222.0           4
890          891         0       3  ...  3222.330000        3222.0        <NA>
[891 rows x 19 columns]
     PassengerId  Survived  Pclass  ... floatnumbers floatnumbers0  intwithnan
0              1         0       3  ...    33.439999            33           4
1              2         1       1  ...  3222.330078            15           5
2              3         1       3  ...    33.439999            33           3
3              4         1       1  ...    15.000000            33           1
4              5         0       3  ...    15.000000           344           2
..           ...       ...     ...  ...          ...           ...         ...
886          887         0       2  ...   344.424255           344           5
887          888         1       1  ...    15.000000            15           4
888          889         0       3  ...   344.424255          3222           2
889          890         1       1  ...   344.424255          3222           4
890          891         0       3  ...  3222.330078          3222        <NA>
[891 rows x 19 columns]
PassengerId        int64
Survived           int64
Pclass             int64
Name              object
Sex               object
Age              float64
SibSp              int64
Parch              int64
Ticket            object
Fare             float64
Cabin             object
Embarked          object
truefalse           bool
onlynan           object
nestedlists       object
mixedstuff        object
floatnumbers     float64
floatnumbers0    float64
intwithnan        object
dtype: object
PassengerId        uint16
Survived            uint8
Pclass              uint8
Name               string
Sex              category
Age               Float64
SibSp               uint8
Parch               uint8
Ticket             string
Fare              float32
Cabin              string
Embarked         category
truefalse            bool
onlynan            object
nestedlists        object
mixedstuff         object
floatnumbers      float32
floatnumbers0      uint16
intwithnan          Int64
dtype: object

    Parameters:
        dframe: Union[pd.Series, pd.DataFrame]
            pd.Series, pd.DataFrame
        point_zero_to_int: bool
            Convert float to int if all float numbers in the column end with .0+
            (default = True)
        categorylimit: int
            Convert strings to category, when ratio len(df) / len(df.value_counts) >= categorylimit
            (default = 4)
        verbose: bool
            Keep track of what is happening
            (default = True)
        include_na_strings_in_pd_na: bool
            When True -> treated as nan:

            [
            "<NA>",
            "<NAN>",
            "<nan>",
            "np.nan",
            "NoneType",
            "None",
            "-1.#IND",
            "1.#QNAN",
            "1.#IND",
            "-1.#QNAN",
            "#N/A N/A",
            "#N/A",
            "N/A",
            "n/a",
            "NA",
            "#NA",
            "NULL",
            "null",
            "NaN",
            "-NaN",
            "nan",
            "-nan",
            ]

            (default =True)
        include_empty_iters_in_pd_na: bool
            When True -> [], {} are treated as nan (default = False )

        include_0_len_string_in_pd_na: bool
            When True -> '' is treated as nan (default = False )
        convert_float: bool
            Don't convert columns containing float numbers.
            Comparing the 2 dataframes from the example, one can see that float numbers frequently
            don't have the exact same value as the original float number.
            If decimal digits are important for your work, disable it!
            (default=True)
        check_float_difference: bool
            If a little difference between float dtypes is fine for you, use True
            Ignored if convert_float=False
            (default=True)
        float_tolerance_negative: float

            The negative tolerance you can live with, e.g.
            3222.330078 - 3222.330000 = 0.000078 is fine for you

            Ignored if convert_float=False
            (default= -0.05)

        float_tolerance_positive: float = 0.05,
            The positive tolerance you can live with
            3222.340078 - 3222.330000 = 0.010078 is fine for you
             Ignored if convert_float=False
            (default= 0.05)

    Returns:
        Union[pd.DataFrame, pd.Series]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page