pd-to-mssql

Quick upload of pandas dataframes to Microsoft SQL Server

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.7
- Python :: 3.8
Topic
- Software Development :: Build Tools

Project description

pd_to_mssql : Quick upload of pandas dataframes to Microsoft SQL Server

Value Proposition

The incredible functionality afford by pandas can make automating ETL tasks quick and painless, if that task does not involve uploading data to a Microsoft SQL Server, as the standard to_sql fucntion is painfully slow. This package uses only pandas and pyodbc to achieve upload speeds comparable to SSIS packages or SQLAlchemy + pyodbc's fast_executemany=True option, while also keeping the existing table structure intact (i.e. dropping the target table and creating one with new column datatypes) and requiring only one line of code to do so.

Possible Alternatives to pd_to_mssql

I attempted many other solutions to SQL Server upload before writing this package. Here is a brief list of those options and my main issue with them:

Solution	Issue
Writing dataframes to csv and using the bcp utility	bcp cannot handle the delimiting character appearing within a field value, even if it is properly quoted. Also, too much tedious work outside of python is required for each upload.
Writing dataframes to excel and using SSIS packages launched by `subprocess.run()` and the dtexec utility	Works well, but requries a sql server instance to be installed in the deployment environment. Again, way too much tedious developement outside of python.
pyodbc + SQLAchemy's `fast_executemany=True` option	Definitely the best of these options. Should work for most people. Not a fan of the behavior of the `if_exists='replace'` parameter, but workarounds are available. I ran into some memory issues while attempting to upload some expetionally large dataframes (1 million+ rows, 10+ columns), which prevented this from becoming my go-to solution (and no, I was not using the old "SQL Server" odbc driver).

How to install pd_to_mssql

pip install pd_to_mssql

Dependencies

How to use pd_to_mssql

from pd_to_mssql import to_sql
to_sql(df_in, table_name, cnxn_string, schema='dbo', index=True, replace=False, chunk_size=1000, thread_count=5, ignore_truncation=False, ignore_missing=False)

Parameter	Required/Default Value	Description
df_in	Requried	Dataframe which will be uploaded to the SQL Server.
table_name	Required	Upload destination. Specify only the table name. Do not include the schema.
cnxn_string	Required	ODBC connection string. See here for more information.
schema	'dbo'	Specify the target table's schema if need be.
index	True	Upload the index to the target table. Will only be included if the index name matches a column in the target table.
replace	False	Truncate the target table before uploading the data contained within df_in
chunk_size	1000	Number of rows included in each insert statement. 1000 is the maximum number of rows allowed by MS SQL Server.
thread_count	5	Number of concurrent connections established for insertion. Increasing this value will speed up perfomance as long as connection latency is the main bottleneck.
ignore_truncation	False	Ignore string truncation when uploading string values with more characters than are allowed in the target column. This is accomplished by setting `ANSI_WARNINGS OFF`.
ignore_missing	False	Instead of raising a `MissingColumnsException`, an attempt will be made to insert null values into that column instead.

How pd_to_mssql works

To start, all data contained within the dataframe is stringified to accomodate creation of the insert statements. Then a number of threads (from the threading module) are spawned in accordance with the thread_count parameter. Each of those threads then receives a separate pyodbc connection. A temporary table is created in each connection, and insertion into each temp table is conducted concurrently. Once temp table insertion is complete on all threads, the temp tables are unloaded one-by-one into the target table. This last step is only completed if all temp table insertions complete successfully.

Column Mapping

For each non-identity column in the target table, a column with the same name must exist in the provided dataframe, but the data types of the those columns in the dataframe are irrelevant, as they will be cast to the correct data type based on the column specification. If the column does not exist in the provided dataframe, then a MissingColumnsException will be raised, unless ignore_missing=True, in which case null values will be inserted into the column, provided it is nullable.

Debugging

Some errors make it through the initial validation checks and are only caught once the insertion statement is generated. As such, the generated SQLExcetions generally contain very little useful information in identifying the issue. To assist in debugging, the first failing insert statement in each thread will be written to disk in a directory called __pd_to_mssql_exception located in the current working directory at the time of the upload. From there, you will be able to run those scripts in SSMS to easily identify where and why they are failing. In my experience, the most common issues which make it to this stage stem from invalid data conversions. E.G. unrecognized string representation of data formats (Convert to Datetime.Date(time) before upload) or string columns containing numeric data with a few instances of non-numeric characters.

Miscellaneous Notes

Insertion will only be commited to the database if there are no errors. If there is a SQLException thrown, then no rows will be inserted into the target database.
If replace=True then table truncation will only occur if there are no errors during the temp table insertion process.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.7
- Python :: 3.8
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

0.2.1

Dec 23, 2020

0.2.0

Jun 7, 2020

0.1.7

May 18, 2020

0.1.6

May 15, 2020

0.1.5

May 3, 2020

0.1.4

May 1, 2020

0.1.3

May 1, 2020

0.1.2

Mar 25, 2020

0.1.1

Mar 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pd_to_mssql-0.2.1.tar.gz (7.2 kB view details)

Uploaded Dec 23, 2020 Source

File details

Details for the file pd_to_mssql-0.2.1.tar.gz.

File metadata

Download URL: pd_to_mssql-0.2.1.tar.gz
Upload date: Dec 23, 2020
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for pd_to_mssql-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`0bf53c45398b81a29ff4b66baaac5810ce9619a5ef9e239796efbf3e1d55461d`
MD5	`e5c9dbd1764d95acc5db7fd4c89400a1`
BLAKE2b-256	`db4ad067944e43810a9bd4aa07b07703d318fdc2788cab15f0aaf445f32d722d`