Skip to main content

pandabase links pandas DataFrames to SQL databases. Upsert, append, read, drop, describe...

Project description

pandabase_logo

pip install pandabase

Build Status Coverage Status

DEPRECATION WARNING:

pandabase is OK, but you should probably look at Pangres instead

Pangres is a similar tool, but better written. It is generally faster. It doesn't have as many database management features, but they would be pretty easy to add...

Description

pandabase links pandas DataFrames to SQL databases, supporting read, append, upsert, and basic database management operations. If your project doesn't need a full-on ORM, it might need pandabase. If your project currently stores data as .csv, you definitely need pandabase.

By default, pandabase uses DataFrame.index as the primary key. Using an explicit primary key makes better database schemas the obvious choice, and makes it easy to maintain clean data even when data is updated frequently.

Designed for machine learning applications, especially time-series datasets that are updated over time and used in-memory for computation.

Tested under:

  • Python >= 3.6
  • Pandas >= 0.24, including 1.0
  • SQLAlchemy >= 1.3
  • SQLite
  • Postgres
    • requires psycopg2 and postgres >= 8

Features

  • pandabase.to_sql(df, ...) replaces df.to_sql(...)
  • pandabase.read_sql(...) replaces pd.read_sql(...)
  • primary key support:
    • by default, uses df.index as table PK (must have name != None)
    • filter results with lowest/highest kwargs: lowest <= results.pk <= highest
    • (new in 0.4): support for multi-indexes
    • optionally, generate integer index (with parameter auto_index=True)
  • multiple insert modes: how='create_only', 'upsert', or 'append'
  • datatypes (all nullable):
    • boolean
    • int
    • float
    • datetime (UTC only)
    • string

Bonus Features

  • moderately smart insertion handles new records that 'almost correspond' with database schema automatically
  • to_sql can automatically add new columns to database as needed with kwarg: add_new_columns=True
  • supports arbitrary schemas in Postgres with kwarg: schema=name
  • comprehensive test suite (pytest)
  • companda(df1, df2) test tool: rich comparisons of DataFrames

Design Considerations

  • Minimal dependencies: Pandas (>= 0.24) & SQLAlchemy (>= 1.3, core only) are the only requirements
  • Database is the source of truth: pandabase will try to coerce incoming DataFrames to fit existing schema
    • also reasonably smart about how new tables are created from DataFrames
  • Not horrendously slow

License

MIT license

Thanks

Code partially stolen from: Dataset (nice, more general-purpose SQL interaction library) and pandas.sql

See also: Pangres which is like pandabase, but a) faster on postgres b) less features.

Installation

From your virtual environment of choice (including Conda):

~/$ pip install pandabase

For latest version:

~/$ git clone https://github.com/notsambeck/pandabase
~/$ cd pandabase
~/pandabase/$ pip install -r requirements.txt
~/pandabase/$ pip install .

Usage

# Python >= 3.6
>>> import pandas as pd
>>> import numpy as np
>>> import pandabase
>>> my_data = pd.DataFrame(index=range(7, 12), 
                           columns=['some_number'],
                           data=np.random.random((5,1)))
>>> my_data.index.name = 'my_index_name'        # index must be named to use as PK
>>> pandabase.to_sql(my_data, table_name='my_table', con='sqlite:///new_sqlite_db.sqlite', how='create_only')
Table('my_table', ...
>>> exit()

Your data is now persistently stored in a SQLite database, using my_data.index as primary key. To append or update data, replace 'create_only' with 'append' or 'upsert'. To store records without an explicit index, use 'autoindex=True'.

~/pandabase$ ls
new_sqlite_db.sqlite
>>> import pandabase
>>> df = pandabase.read_sql('my_table', con='sqlite:///new_sqlite_db.sqlite'))
>>> df
    some_number 
7   0.722416 
8   0.076045 
9   0.213118 
10  0.453716 
11  0.406995

Usage notes & recommendations:

Engines vs. strings

All methods accept either a string or sqlalchemy.Engine for argument 'con' (i.e. database connection). Using a string works, but the connection may not be returned to the connection pool at transaction end. Eventually, this may exhaust the connection pool. For applications, pass an engine object to pandabase.to_sql and pandabase.read_sql. Example:

>>> import pandabase
>>> engine = pandabase.engine_builder('postgresql+psycopg2://postgres:postgres@localhost:5432/testdb')
>>> pandabase.to_sql(df=df, con=engine, table_name='table0')   # to use default schema=None => 'public'
>>> pandabase.to_sql(df=df, con=engine, table_name='table0', schema='my_schema')   #  access my_schema.table

Caveat: safe names

pandabase.helpers.clean_name runs (silently) to clean all table and column names. It replaces spaces and punctuation with underscores, and uppercase letters with lowercase. If your incoming data has uppercase names, they will be changed; if your existing database has uppercase names, pandabase will not be able to access them.

Caveat: datatype parsing

pandabase.helpers.series_is_boolean tries to determine whether a series of (nominally) ints or floats might actually be boolean. This helps constrain data when it is correct; however, this function is very conservative to avoid e.g. making a column of all zeros boolean. Set the DataFrame's dtypes to avoid this potential pitfall.

Keyword arguments for pandabase.read_sql:

  • [lowest, highest]: minimum/maximum values for PK that will be retrieved.
    • Can be used independently of each other.
    • For multi-index tables, use a tuple of values in order.
    • e.g. pandabase.to_sql(con=con, table_name='multi_index_table', highest=(max_value_for_pk0, max_value_for_pk1, ), lowest=(min_value_for_pk0, min_value_for_pk1, )
  • schema: string, schema for Postgres
    • e.g. for pandabase.to_sql(con=con, table_name='bare_table', schema='my_schema') # myschema.bare_table
  • add_new_columns: bool, default False. if True, add columns to database as necessary to match incoming DataFrame
    • e.g. for pandabase.to_sql(con=con, table_name='table0', add_new_columns=True)
  • how: ['create_only', 'append', or 'upsert']

Usage note up to 0.4.5 release. selecting an empty subset of data will raise an error if type(lowest) != type(data), even if the types are comparable, e.g. float vs. int.

(New behavior will return a DataFrame with length zero for

Using Extra Features

Companda - rich comparisons of DataFrames. call companda on two DataFrames, get a Companda object back (that evaluates to True/False).

>>> from pandabse.companda import companda
>>> df = pandabase.read_sql('my_table', con='sqlite:///new_sqlite_db.sqlite'))

>>> companda(df, df.copy())
Companda(True, message='DataFrames are equal')
>>> bool(companda(df, df.copy()))
True

>>> df2 = df.copy()
>>> df2.iloc[1, 2] = -1000

>>> companda(df, df2)
Companda(False, message='Columns and indices are equal, but unequal values in columns [col_a]...')
>>> bool(companda(df, df2))
False

Table utility functions:

Under basic use cases, Pandabase can handle simple database administration tasks. All support schema=name kwarg in Postgres.

  • drop_db_table(table_name, con):
    • Drop table [table_name] from con - be careful with this!
  • get_db_table_names(con):
    • Get a list of table names from database.
  • get_table_column_names(con, table_name):
    • Get a list of column names from database, table.
  • describe_database(con):
    • Get a description of database content: {table_names: {table_info_dicts}}.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandabase-0.5.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

pandabase-0.5.0-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file pandabase-0.5.0.tar.gz.

File metadata

  • Download URL: pandabase-0.5.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.6.7

File hashes

Hashes for pandabase-0.5.0.tar.gz
Algorithm Hash digest
SHA256 24b00c8c1539b69037099dd05a47ef80c131db30482521268566565fc609f9b9
MD5 4fdcb3fc8c7b4b3ef1d17a3fb35f3793
BLAKE2b-256 e0352aed636a790cc19a13e43b3b5c3e4bfb9e61f9c92a4e1139cc78678bb2d8

See more details on using hashes here.

File details

Details for the file pandabase-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: pandabase-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.6.7

File hashes

Hashes for pandabase-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf7b2a6ef8a1d9c13ee5813e5cb7e9bce47ec9e27b56dff193dc808d9e75bc53
MD5 66c9aa706d6e8224dc86d27ef96bad5f
BLAKE2b-256 90610ba32dbdb0a7a9b5934147aaa081e63a1fc5b155858ceccd6508a40acef0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page