Quick upload of pandas dataframes to Microsoft SQL Server
pd_to_mssql : Quick upload of pandas dataframes to Microsoft SQL Server
The incredible functionality afford by pandas can make automating ETL tasks quick and painless, if that task does not involve uploading data to a Microsoft SQL Server, as the standard
to_sql fucntion is painfully slow. This package uses only pandas and pyodbc to achieve upload speeds comparable to SSIS packages or SQLAlchemy + pyodbc's
fast_executemany=True option, while also keeping the existing table structure intact (i.e. dropping the target table and creating one with new column datatypes) and requiring only one line of code to do so.
Possible Alternatives to pd_to_mssql
I attempted many other solutions to SQL Server upload before writing this package. Here is a brief list of those options and my main issue with them:
|Writing dataframes to csv and using the bcp utility||bcp cannot handle the delimiting character appearing within a field value, even if it is properly quoted. Also, too much tedious work outside of python is required for each upload.|
|Writing dataframes to excel and using SSIS packages launched by
||Works well, but requries a sql server instance to be installed in the deployment environment. Again, way too much tedious developement outside of python.|
|pyodbc + SQLAchemy's
||Definitely the best of these options. Should work for most people. Not a fan of the behavior of the
How to install pd_to_mssql
pip install pd_to_mssql
How to use pd_to_mssql
from pd_to_mssql import to_sql to_sql(df_in, table_name, cnxn_string, schema='dbo', index=True, replace=False, chunk_size=1000, thread_count=5, ignore_truncation=False, ignore_missing=False)
|df_in||Requried||Dataframe which will be uploaded to the SQL Server.|
|table_name||Required||Upload destination. Specify only the table name. Do not include the schema.|
|cnxn_string||Required||ODBC connection string. See here for more information.|
|schema||'dbo'||Specify the target table's schema if need be.|
|index||True||Upload the index to the target table. Will only be included if the index name matches a column in the target table.|
|replace||False||Truncate the target table before uploading the data contained within df_in|
|chunk_size||1000||Number of rows included in each insert statement. 1000 is the maximum number of rows allowed by MS SQL Server.|
|thread_count||5||Number of concurrent connections established for insertion. Increasing this value will speed up perfomance as long as connection latency is the main bottleneck.|
|ignore_truncation||False||Ignore string truncation when uploading string values with more characters than are allowed in the target column. This is accomplished by setting
|ignore_missing||False||Instead of raising a
How pd_to_mssql works
To start, all data contained within the dataframe is stringified to accomodate creation of the insert statements. Then a number of threads (from the threading module) are spawned in accordance with the
thread_count parameter. Each of those threads then receives a separate pyodbc connection. A temporary table is created in each connection, and insertion into each temp table is conducted concurrently. Once temp table insertion is complete on all threads, the temp tables are unloaded one-by-one into the target table. This last step is only completed if all temp table insertions complete successfully.
For each non-identity column in the target table, a column with the same name must exist in the provided dataframe, but the data types of the those columns in the dataframe are irrelevant, as they will be cast to the correct data type based on the column specification. If the column does not exist in the provided dataframe, then a MissingColumnsException will be raised, unless
ignore_missing=True, in which case null values will be inserted into the column, provided it is nullable.
Some errors make it through the initial validation checks and are only caught once the insertion statement is generated. As such, the generated SQLExcetions generally contain very little useful information in identifying the issue. To assist in debugging, the first failing insert statement in each thread will be written to disk in a directory called
__pd_to_mssql_exception located in the current working directory at the time of the upload. From there, you will be able to run those scripts in SSMS to easily identify where and why they are failing. In my experience, the most common issues which make it to this stage stem from invalid data conversions. E.G. unrecognized string representation of data formats (Convert to Datetime.Date(time) before upload) or string columns containing numeric data with a few instances of non-numeric characters.
- Insertion will only be commited to the database if there are no errors. If there is a SQLException thrown, then no rows will be inserted into the target database.
replace=Truethen table truncation will only occur if there are no errors during the temp table insertion process.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size pd_to_mssql-0.2.1.tar.gz (7.2 kB)||File type Source||Python version None||Upload date||Hashes View|