Skip to main content

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

Project description

bcpandas

Code style: black PyPI version Conda-Forge version GitHub license PyPI pyversions PyPI status Awesome Badges

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

Quickstart

In [1]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: from bcpandas import SqlCreds, to_sql, read_sql

In [2]: creds = SqlCreds(
   ...:     'my_server',
   ...:     'my_db',
   ...:     'my_username',
   ...:     'my_password'
   ...: )

In [3]: df = pd.DataFrame(
   ...:         data=np.ndarray(shape=(10, 6), dtype=int), 
   ...:         columns=[f"col_{x}" for x in range(6)]
   ...:     )

In [4]: df
Out[4]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

In [5]: to_sql(df, 'my_test_table', creds, index=False, if_exists='replace')

In [6]: df2 = read_sql('my_test_table', creds)

In [7]: df2
Out[7]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

Benchmarks

# TODO

Requirements

Database

Any version of Microsoft SQL Server. Can be installed on-prem, in the cloud, on a VM, or the Azure SQL Database/Data Warehouse versions.

Python User

  • BCP Utility
  • SqlCmd Utility
  • Microsoft ODBC Driver 11, 13, 13.1, or 17 for SQL Server. See the pyodbc docs for details.
  • Python >= 3.6
  • pandas >= 0.19
  • sqlalchemy >= 1.1.4
  • pyodbc as the supported DBAPI
  • Windows as the client OS
    • Linux and MacOS are theoretically compatible, but never tested

Installation

Source Command
PyPI pip install bcpandas
Conda conda install -c conda-forge bcpandas

Usage

  1. Create creds (see next section)
  2. Replace any df.to_sql(...) in your code with bcpandas.to_sql(df, ...)
  3. Replace any pd.read_sql(...) with bcpandas.read_sql(...)

That's it!

Credential/Connection object

Bcpandas requires a bcpandas.SqlCreds object in order to use it, and also a sqlalchemy.Engine. The user has 2 options when constructing it.

  1. Create the bcpandas SqlCreds object with just the minimum attributes needed (server, database, username, password), and bcpandas will create a full Engine object from this. It will use pyodbc, sqlalchemy, and the Microsoft ODBC Driver for SQL Server, and will store it in the .engine attribute.

    In [1]: from bcpandas import SqlCreds
    
    In [2]: creds = SqlCreds('my_server', 'my_db', 'my_username', 'my_password')
    
    In [3]: creds.engine
    Out[3]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)
    
  2. Pass a full Engine object to the bcpandas SqlCreds object, and bcpandas will attempt to parse out the server, database, username, and password to pass to the command line utilities. If a DSN is used, this will fail.

    (continuing example above)

    In [4]: creds2 = SqlCreds.from_engine(creds.engine)
    
    In [5]: creds2.engine
    Out[5]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)
    
    In [6]: creds2
    Out[6]: SqlCreds(server='my_server', database='my_db', username='my_username', with_krb_auth=False, engine=Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password), password=[REDACTED])
    

Recommended Usage

# TODO When to use bcpandas vs. regular pandas.

Limitations

Here are some caveats and limitations of bcpandas. Hopefully they will be addressed in future releases

  • In the to_sql function:
    • If append is passed to the if_exists parameter, if the dataframe columns don't match the SQL table columns exactly by both name and order, it will fail.
    • If there is a NaN/Null in the last column of the dataframe it will throw an error. This is due to a BCP issue. See my issue with Microsoft about this here .
    • Bcpandas attempts to use a delimiter that is not present in the dataframe. This is because BCP does not ignore delimiter characters when surrounded by quotes, unlike CSVs - see here in the Microsoft docs. Therefore, if all possible delimiter characters are present in the dataframe and bcpandas cannot find a delimiter to use, it will throw an error.
      • Possible delimiters are specified in constants.py .

Background

Reading and writing data from pandas DataFrames to/from a SQL database is very slow using the built-in read_sql and to_sql methods, even with the newly introduced execute_many option. For Microsoft SQL Server, a far far faster method is to use the BCP utility provided by Microsoft. This utility is a command line tool that transfers data to/from the database and flat text files.

This package is a wrapper for seamlessly using the bcp utility from Python using a pandas DataFrame. Despite the IO hits, the fastest option by far is saving the data to a CSV file in the file system and using the bcp utility to transfer the CSV file to SQL Server. Best of all, you don't need to know anything about using BCP at all!

Existing Solutions

Much credit is due to bcpy for the original idea and for some of the code that was adopted and changed.

bcpy

bcpy has several flaws:

  • No support for reading from SQL, only writing to SQL
  • A convoluted, overly class-based internal design
  • Scope a bit too broad - deals with pandas as well as flat files This repository aims to fix and improve on bcpy and the above issues by making the design choices described earlier.
magical-sqlserver

magical-sqlserver is a library to make working with Python and SQL Server very easy. But it doesn't fit what I'm trying to do:

  • No built in support for pandas DataFrame
  • Larger codebase, I'm not fully comfortable with the dependency on the very heavy pymssql

Design and Scope

The only scope of bcpandas is to read and write between a pandas DataFrame and a Microsoft SQL Server database. That's it. We do not concern ourselves with reading existing flat files to/from SQL - that introduces way to much complexity in trying to parse and decode the various parts of the file, like delimiters, quote characters, and line endings. Instead, to read/write an exiting flat file, just import it via pandas into a DataFrame, and then use bcpandas.

The big benefit of this is that we get to precicely control all the finicky parts of the text file when we write/read it to a local file and then in the BCP utility. This lets us set library-wide defaults (maybe configurable in the future) and work with those.

For now, we are using the non-XML BCP format file type. In the future, XML format files may be added.

Testing

Testing uses pytest. A local SQL Server is spun up using Docker.

Contributing

Please, all contributions are very welcome!

I will attempt to use the pandas docstring style as detailed here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcpandas-0.2.2.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

bcpandas-0.2.2-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file bcpandas-0.2.2.tar.gz.

File metadata

  • Download URL: bcpandas-0.2.2.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.8.0

File hashes

Hashes for bcpandas-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9c09b9295dc43fa60e4a5f7bc183077ed17cb69b95938941d3522b115064ff8a
MD5 bc0493ba0c2c22f4809544fd03bc87e7
BLAKE2b-256 c37dc5b7b1806b1590c204e129c4ca33a3067b9d01ec1559da1349a99bc57439

See more details on using hashes here.

Provenance

File details

Details for the file bcpandas-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: bcpandas-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.8.0

File hashes

Hashes for bcpandas-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2a0950dd59a4ac7fa853dbb2f5e5efa9e4f21a8954548abcac691ae181e96621
MD5 e05e7956fe13f3239bc82ecbf64a239b
BLAKE2b-256 9bc841ffb964d668fa2fea5d64e32905b6a4cf3075fa9713f2ee8afb84ee1b8f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page