Skip to main content

Make working with pandas dataframe and AWS DynamoDB easy.

Project description

Documentation Status

dynamo-pandas

Make working with pandas data and AWS DynamoDB easy.

Motivation

This package aims a making the transfer of data between pandas dataframes and DynamoDB as simple as possible. To meet this goal, the package offers two key features:

  1. Automatic conversion of pandas data types to DynamoDB supported data types.
  2. A simple, high level interface to put data from a dataframe into a DynamoDB table and get all or selected items from a table into a dataframe.

Documentation

The project's documentation is available at https://dynamo-pandas.readthedocs.io/.

Installation

Install dynamo-pandas from PyPI using pip:

python -m pip install dynamo-pandas

This will install the package and its dependencies except for boto3 which is not installed by default to avoid unnecessary installation when building Lambda layers.

To include boto3 as part of the installation, add the boto3 "extra" this way:

python -m pip install dynamo-pandas[boto3]

Example Usage

Consider the pandas DataFrame below.

>>> print(players_df)

      player_id           last_play       play_time  rating  bonus_points
0    player_one 2021-01-18 22:47:23 2 days 17:41:55     4.3             3
1    player_two 2021-01-19 19:07:54 0 days 22:07:34     3.8             1
2  player_three 2021-01-21 10:22:43 1 days 14:01:19     2.5             4
3   player_four 2021-01-22 13:51:12 0 days 03:45:49     4.8          <NA>

The columns of the dataframe use different data types, some of which are not natively supported by DynamoDB, like numpy.datetime64, timedelta64 and pandas' nullable integers.

>>> players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype          
   ---  ------        --------------  -----          
    0   player_id     4 non-null      object         
    1   last_play     4 non-null      datetime64[ns] 
    2   play_time     4 non-null      timedelta64[ns]
    3   rating        4 non-null      float64        
    4   bonus_points  3 non-null      Int8           
dtypes: Int8(1), datetime64[ns](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 264.0+ bytes

Storing the rows of this dataframe to DynamoDB requires multiple data type conversions.

>>> from dynamo_pandas import put_df, get_df, keys

The put_df function adds or updates the rows of a dataframe into the specified table, taking care of the required type conversions (the table must be already created and the primary key column(s) be present in the dataframe).

>>> put_df(players_df, table="players")

The get_df function retrieves the items matching the speficied key(s) from the table into a dataframe.

>>> df = get_df(table="players", keys=[{"player_id": "player_three"}, {"player_id": "player_one"}])
>>> print(df)

   bonus_points     player_id            last_play  rating        play_time
0             4  player_three  2021-01-21 10:22:43     2.5  1 days 14:01:19
1             3    player_one  2021-01-18 22:47:23     4.3  2 days 17:41:55

In the case where only a partition key is used, the keys function simplifies the generation of the keys list.

>>> df = get_df(table="players", keys=keys(player_id=["player_two", "player_four"]))
>>> print(df)

   bonus_points    player_id            last_play  rating        play_time
0           1.0   player_two  2021-01-19 19:07:54     3.8  0 days 22:07:34
1           NaN  player_four  2021-01-22 13:51:12     4.8  0 days 03:45:49

The data types returned by the get_df function are basic types and no automatic type conversion is attempted.

>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype  
   ---  ------        --------------  -----  
    0   bonus_points  1 non-null      float64
    1   player_id     2 non-null      object 
    2   last_play     2 non-null      object 
    3   rating        2 non-null      float64
    4   play_time     2 non-null      object 
dtypes: float64(2), object(3)
memory usage: 208.0+ bytes

The dtype parameter of the get_df function allows specifying the desired data types.

>>> df = get_df(
...     table="players",
...     keys=keys(player_id=["player_two", "player_four"]),
...     dtype={
...         "bonus_points": "Int8",
...         "last_play": "datetime64[ns, UTC]",
...         "play_time": "timedelta64[ns]"  # See note below.
...     }
... )
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype              
   ---  ------        --------------  -----              
    0   bonus_points  1 non-null      Int8               
    1   player_id     2 non-null      object             
    2   last_play     2 non-null      datetime64[ns, UTC]
    3   rating        2 non-null      float64            
    4   play_time     2 non-null      timedelta64[ns]    
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes

Note: Due to a known bug in pandas versions < 1.5, timedelta strings cannot be converted back to Timedelta type via this parameter (ref. https://github.com/pandas-dev/pandas/issues/38509). If using pandas < 1.5, use the pandas.to_timedelta function instead:

>>> df.play_time = pd.to_timedelta(df.play_time)
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
    #   Column        Non-Null Count  Dtype              
   ---  ------        --------------  -----              
    0   bonus_points  1 non-null      Int8               
    1   player_id     2 non-null      object             
    2   last_play     2 non-null      datetime64[ns, UTC]
    3   rating        2 non-null      float64            
    4   play_time     2 non-null      timedelta64[ns]    
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes

Omitting the keys parameter performs a scan of the table and returns all the items.

>>> df = get_df(table="players")
>>> print(df)

       bonus_points     player_id            last_play  rating        play_time
    0           4.0  player_three  2021-01-21 10:22:43     2.5  1 days 14:01:19
    1           NaN   player_four  2021-01-22 13:51:12     4.8  0 days 03:45:49
    2           3.0    player_one  2021-01-18 22:47:23     4.3  2 days 17:41:55
    3           1.0    player_two  2021-01-19 19:07:54     3.8  0 days 22:07:34

License

Released under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dynamo_pandas-1.4.1.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dynamo_pandas-1.4.1-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file dynamo_pandas-1.4.1.tar.gz.

File metadata

  • Download URL: dynamo_pandas-1.4.1.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for dynamo_pandas-1.4.1.tar.gz
Algorithm Hash digest
SHA256 b381493bf36527c6b7fd7efd883b89efd5b14f60f6097fb652217eac28451256
MD5 6f6a2cbc3ab254420a9147790645aed4
BLAKE2b-256 bc84f6535864273f207bdd17b8413aefd50526dbd395e03e21df27d0f6ef7873

See more details on using hashes here.

File details

Details for the file dynamo_pandas-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: dynamo_pandas-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for dynamo_pandas-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 043b9dc87098429e57cff0e83587a1425b1316cc5e61ef15c8695227cf5c0c39
MD5 5e4ddaf225350e897373d2936b212a39
BLAKE2b-256 c8a1fdaf9ec81241a8d1788ee9b0562b93bc8ea5cc1c66dd9214ef141a82ffbb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page