High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Financial and Insurance Industry
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Database

Project description

bcpandas

Test

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!! (pronounced BEE-CEE-Pandas)

Quickstart

In [1]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: from bcpandas import SqlCreds, to_sql, read_sql

In [2]: creds = SqlCreds(
   ...:     'my_server',
   ...:     'my_db',
   ...:     'my_username',
   ...:     'my_password'
   ...: )

In [3]: df = pd.DataFrame(
   ...:         data=np.ndarray(shape=(10, 6), dtype=int), 
   ...:         columns=[f"col_{x}" for x in range(6)]
   ...:     )

In [4]: df
Out[4]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

In [5]: to_sql(df, 'my_test_table', creds, index=False, if_exists='replace')

In [6]: df2 = read_sql('my_test_table', creds)

In [7]: df2
Out[7]: 
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

IMPORTANT - Read vs. Write

The big speedup benefit of bcpandas is in the to_sql function, as the benchmarks below show. However, the read_sql function actually performs slower than the pandas equivalent. So don't use it. Use bcpandas for the to_sql function only and to use native pandas in read_sql.

Also, read_sql is not fully tested for this reason, as it became apparant that it is not worth the effort to fix all of the edge cases.

Q: So why do we even have a read_sql function?

A: To complete the API, and in order to discover that there is no speedup for it in bcpandas. Now that this is determined, it will be removed in a future release.

Benchmarks

See figures below. All code is in the /benchmarks directory. To run the benchmarks, run python benchmark.py main and fill in the command line options that are presented.

Running this will output

PNG image of the graph
JSON file of the benchmark data
JSON file with the environment details of the machine that was used to generate it

to_sql

I didn't bother including the pandas non-multiinsert version here because it just takes way too long

to_sql benchmark graph

Why not just use the new pandas `method='multi'`?

Because it is still much slower
Because you are forced to set the chunksize parameter to a very small number for it to work - generally a bit less then 2100/<number of columns>. This is because SQL Server can only accept up to 2100 parameters in a query. See here and here for more discussion on this, and the recommendation to use a bulk insert tool such as BCP. It seems that SQL Server simply didn't design the regular INSERT statement to support huge amounts of data.

read_sql

As you can see, pandas native clearly wins here

read_sql benchmark graph

Requirements

Database

Any version of Microsoft SQL Server. Can be installed on-prem, in the cloud, on a VM, or the Azure SQL Database/Data Warehouse versions.

Python User

BCP Utility
Microsoft ODBC Driver 11, 13, 13.1, or 17 for SQL Server. See the pyodbc docs for details.
Python >= 3.6
pandas >= 0.19
sqlalchemy >= 1.1.4
pyodbc as the supported DBAPI
Windows as the client OS
- Linux and MacOS are theoretically compatible, but never tested

Installation

Source	Command
PyPI	`pip install bcpandas`
Conda	`conda install -c conda-forge bcpandas`

Usage

Create creds (see next section)
Replace any df.to_sql(...) in your code with bcpandas.to_sql(df, ...)

That's it!

Credential/Connection object

Bcpandas requires a bcpandas.SqlCreds object in order to use it, and also a sqlalchemy.Engine. The user has 2 options when constructing it.

Create the bcpandas SqlCreds object with just the minimum attributes needed (server, database, username, password), and bcpandas will create a full Engine object from this. It will use pyodbc, sqlalchemy, and the Microsoft ODBC Driver for SQL Server, and will store it in the .engine attribute.

In [1]: from bcpandas import SqlCreds

In [2]: creds = SqlCreds('my_server', 'my_db', 'my_username', 'my_password')

In [3]: creds.engine
Out[3]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)

Pass a full Engine object to the bcpandas SqlCreds object, and bcpandas will attempt to parse out the server, database, username, and password to pass to the command line utilities. If a DSN is used, this will fail.

(continuing example above)

In [4]: creds2 = SqlCreds.from_engine(creds.engine)

In [5]: creds2.engine
Out[5]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)

In [6]: creds2
Out[6]: SqlCreds(server='my_server', database='my_db', username='my_username', with_krb_auth=False, engine=Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password), password=[REDACTED])

Recommended Usage

General

Feature	Pandas native	BCPandas
Super speed	:x:	:white_check_mark:
Good for simple data types like numbers and dates	:x:	:white_check_mark:
Handle edge cases	:white_check_mark:	:x:
Handle messy string data	:white_check_mark:	:x:

`to_sql` specific

Feature	Pandas native	BCPandas
Super speed	:x:	:white_check_mark:
Only write to some columns in the SQL table	:white_check_mark:	:x:

`read_sql` specific

Use pandas native! (See earlier section IMPORTANT - Read vs. Write)

Feature	Pandas native	BCPandas
Speed and accuracy (basically, everything)	:white_check_mark:	:x:

built with the help of https://www.tablesgenerator.com/markdown_tables# and https://gist.github.com/rxaviers/7360908

Known Issues

Here are some caveats and limitations of bcpandas. Hopefully they will be addressed in future releases

In the to_sql function:
- Bcpandas has been tested with all ASCII characters 32-127. Unicode characters beyond that range have not been tested.
- For now, an empty string ("") in the dataframe becomes NULL in the SQL database instead of remaining an empty string. We will hopefully fix this soon.
- If append is passed to the if_exists parameter, if the dataframe columns don't match the SQL table columns exactly by both name and order, it will fail.
- ~~If there is a NaN/Null in the last column of the dataframe it will throw an error. This is due to a BCP issue. See my issue with Microsoft about this here.~~ This doesn't seem to be a problem based on the tests.
- Because bcpandas first outputs to CSV, it needs to use several specific characters to create the CSV, including a delimiter and a quote character. Bcpandas attempts to use characters that are not present in the dataframe for this, going through the possilbe delimiters and quote characters specified in constants.py. If all possible characters are present in the dataframe and bcpandas cannot find both a delimiter and quote character to use, it will throw an error.
  - The BCP utility does not ignore delimiter characters when surrounded by quotes, unlike CSVs - see here in the Microsoft docs.

Background

Writing data from pandas DataFrames to a SQL database is very slow using the built-in to_sql method, even with the newly introduced execute_many option. For Microsoft SQL Server, a far far faster method is to use the BCP utility provided by Microsoft. This utility is a command line tool that transfers data to/from the database and flat text files.

This package is a wrapper for seamlessly using the bcp utility from Python using a pandas DataFrame. Despite the IO hits, the fastest option by far is saving the data to a CSV file in the file system and using the bcp utility to transfer the CSV file to SQL Server. Best of all, you don't need to know anything about using BCP at all!

Existing Solutions

Much credit is due to bcpy for the original idea and for some of the code that was adopted and changed.

bcpy

bcpy has several flaws:

No support for reading from SQL, only writing to SQL
A convoluted, overly class-based internal design
Scope a bit too broad - deals with pandas as well as flat files This repository aims to fix and improve on bcpy and the above issues by making the design choices described earlier.

Design and Scope

The only scope of bcpandas is to read and write between a pandas DataFrame and a Microsoft SQL Server database. That's it. We do not concern ourselves with reading existing flat files to/from SQL - that introduces way to much complexity in trying to parse and decode the various parts of the file, like delimiters, quote characters, and line endings. Instead, to read/write an exiting flat file, just import it via pandas into a DataFrame, and then use bcpandas.

The big benefit of this is that we get to precicely control all the finicky parts of the text file when we write/read it to a local file and then in the BCP utility. This lets us set library-wide defaults (maybe configurable in the future) and work with those.

For now, we are using the non-XML BCP format file type. In the future, XML format files may be added.

Testing

Testing uses pytest. A local SQL Server is spun up using Docker.

Contributing

Please, all contributions are very welcome!

I will attempt to use the pandas docstring style as detailed here.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Financial and Insurance Industry
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Database

Release history Release notifications | RSS feed

2.6.3

Jul 8, 2024

2.6.2

May 26, 2024

2.6.1

Apr 4, 2024

2.6.0

Feb 25, 2024

2.5.0

Nov 12, 2023

2.4.2

Jul 21, 2023

2.4.1

Jul 12, 2023

2.4.0

Apr 8, 2023

2.3.0

Feb 10, 2023

2.2.1

Jan 24, 2023

2.2.0

Jan 24, 2023

2.1.0

Jan 1, 2023

2.0.0

Mar 18, 2022

1.4.0

Sep 13, 2021

1.3.0

Jul 1, 2021

1.2.0

May 31, 2021

1.1.0

May 19, 2021

1.0.1

Nov 20, 2020

1.0.0 yanked

Jun 29, 2020

Reason this release was yanked:

forgot to change version in setup.py, should be 0.7.1

0.7.1

Jun 29, 2020

0.6.0

May 26, 2020

0.5.0

May 6, 2020

0.4.1

May 6, 2020

This version

0.3.0

May 1, 2020

0.2.8

Apr 13, 2020

0.2.7

Feb 12, 2020

0.2.6

Jan 28, 2020

0.2.3

Nov 17, 2019

0.2.2

Nov 17, 2019

0.2.0

Nov 3, 2019

0.1.8

Sep 3, 2019

0.1.7

Aug 15, 2019

0.1.6

Aug 7, 2019

0.1.5

Aug 7, 2019

0.1.4

Aug 7, 2019

0.1.2

Aug 6, 2019

0.1.1

Aug 6, 2019

0.1.0

Aug 6, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcpandas-0.3.0.tar.gz (25.4 kB view hashes)

Uploaded May 1, 2020 Source

Built Distribution

bcpandas-0.3.0-py3-none-any.whl (23.6 kB view hashes)

Uploaded May 1, 2020 Python 3

Hashes for bcpandas-0.3.0.tar.gz

Hashes for bcpandas-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2bda3a81ed502a64746af2bd7fdb32e5bce784987d82a634657377f7a0fab1a0`
MD5	`65c87e49ce1f27437b6e9aa06555b1bb`
BLAKE2b-256	`fc6f8a84cb9ffda7658496e704b8bec03526a5a496adf02012232c045b4dae4a`

Hashes for bcpandas-0.3.0-py3-none-any.whl

Hashes for bcpandas-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a15bdbd51fab79797a907fd3180e24a66b6971331b2239eaff7825af01f7de6`
MD5	`e1a7139952f66c6af2ca694964051993`
BLAKE2b-256	`14ccfb17e5511c49e422cb60221ac65fef206a9a17182271e2b958557bf822fa`

bcpandas 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bcpandas

Quickstart

IMPORTANT - Read vs. Write

Benchmarks

to_sql

Why not just use the new pandas method='multi'?

read_sql

Requirements

Database

Python User

Installation

Usage

Credential/Connection object

Recommended Usage

General

to_sql specific

read_sql specific

Known Issues

Background

Existing Solutions

Design and Scope

Testing

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Why not just use the new pandas `method='multi'`?

`to_sql` specific

`read_sql` specific