Skip to main content

Transfer data between pyarrow and SQL Server using Microsoft's Bulk Copy Program

Project description

arrow-bcp

pip install arrow-bcp

When loading data into SQL Server from Python, the common methods are based on the ODBC interface and driver. While this is a tried and tested approach, the performance is not optimal. Reasons for this include:

  • minimal logging cannot be used by regular insert statements unless they use table-valued parameters, which are cumbersome to use.
  • General overhead caused by the ODBC specification. Most databases other than SQL Server provide their own connector in addition to an ODBC driver for this reason

Microsoft's Bulk Copy Program (bcp) command-line utility seems to use additional techniques beyond the ODBC specification and achieves better performance.

The purpose of this library is to (de-)serialize pyarrow dataframes into SQL Server's native format so that it can be understood and ingested by bcp, or equivalently, SQL Server's BULK INSERT statement.

Some wrappers around the bcp CLI are provided as well to simplify interaction with it.

For example usage including a benchmark for inserts, see the notebooks in the examples directory.

Prerequisites

  • An ODBC Driver Manager. On Linux/Mac, you may need to install unixODBC. On Windows, a driver manager is preinstalled.
  • The ODBC Driver for SQL Server, which also bundles the bcp executable (Link).
  • If the ODBC Driver for SQL Server was not installed using Microsoft's install script (e.g., by copy-pasting the binaries onto a system), you may need to configure odbc.ini (located in /etc/ on Linux) such that "ODBC Driver 17 for SQL Server," or whichever version you installed, points to the ODBC driver executable. If bcp is not on the PATH, you will also need to specify its location using arrow_bcp.set_bcp_executable(path).

Type mapping

Arrow to SQL Server

Arrow SQL Server
utf8 varchar(max) (utf8)
binary varbinary(max)
fixedbinary varbinary(max)
decimal128(p, s) decimal(p, s)
boolean bit
int8 smallint
uint8 tinyint
int16 smallint
uint16 int
int32 int
uint32 bigint
int64 bigint
uint64 decimal(20, 0)
float16 real
float32 real
float64 float
date32 date
date64 datetime2
timestamp(s/ms/us/ns) datetime2
time32(s/ms) time
time64(us/ns) time
timestamp(s/ms/us/ns+tz) datetimeoffset
null varbinary(max)

SQL Server to Arrow

SQL Server Arrow
(n)(var)char utf8
(n)text utf8
variant utf8
(var)binary binary
user defined type binary
uniqueidentifier fixedbinary(16)
decimal(p, s) decimal128(p, s) or null
numeric(p, s) decimal128(p, s) or null
(small)money decimal128
bit boolean
tinyint uint8
smallint int16
int int32
bigint int64
real float32
float float64
date date32
(small)datetime timestamp(us)
datetime2 timestamp(us)
time time64(ns)
datetimeoffset timestamp(us+tz) or null

Limitations

Reading data from SQL Server

Decimals / timestamps with timezones

In both Arrow and SQL Server's format, decimal size/precision and timezone offset are part of the column's datatype. However, bcp does not provide this information as metadata and instead specifies it separately for each cell (even though all cells must match). As a consequence, the datatype of an Arrow decimal/timezone column is only known once the first non-null cell is read. If all cells are null, we don't get this information, so the entire datatype is set to null as a workaround. This may lead to issues when another datatype is expected.

Datetime2 / datetimeoffset

Both of these datatypes store time with an accuracy of 100ns. However Arrow only offers datatype with accuracy 1ns or 1000ns (1us) and both have tradeoffs. The 1ns datatype cannot represent larger values like '9999-12-31' and the 1000ns datatype truncates the last digit, i.e. '9999-12-31 11:11:11.1234567' turns to '9999-12-31 11:11:11.1234560'. This library chooses to truncate.

Acknowledgements

Thanks to Adam Serafini for figuring out the build process for Zig extensions and providing an example module. He explains that it was quite a journey in his talk, which I appreciate, as even just figuring out that targeting x86_64-windows instead of x86_64-windows-msvc works better for Windows compilation took me several hours.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_bcp-0.9.2.tar.gz (24.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arrow_bcp-0.9.2-cp313-cp313-manylinux_2_34_x86_64.whl (751.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

arrow_bcp-0.9.2-cp312-cp312-manylinux_2_34_x86_64.whl (751.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

arrow_bcp-0.9.2-cp311-cp311-manylinux_2_34_x86_64.whl (746.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

arrow_bcp-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl (748.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file arrow_bcp-0.9.2.tar.gz.

File metadata

  • Download URL: arrow_bcp-0.9.2.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for arrow_bcp-0.9.2.tar.gz
Algorithm Hash digest
SHA256 8c5daf2e26c9b379504358597b508888cb4d542715d9f2cf7c2e2589cc9b9ef4
MD5 298ff2ab4b20576089d6d0f2a8f93d5c
BLAKE2b-256 675dfc26f8a26b8dee7cb57020b95705dc667908c1ebfa15a9d4f6bb07ba00a2

See more details on using hashes here.

File details

Details for the file arrow_bcp-0.9.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for arrow_bcp-0.9.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 07f19976ff307501f8c0c8cf6463b9855519c4667000a12f109d67daec8b597d
MD5 b888addcbda5f2cf8161f4d9befe04fd
BLAKE2b-256 53dceac28e4c72b4c7932e4daa3719c2cac5999ba41812d982ffd3ae1f109793

See more details on using hashes here.

File details

Details for the file arrow_bcp-0.9.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for arrow_bcp-0.9.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0ef35d44b59e8ff580954d7369938ca6d37f027f7fbda2a4fa632944e2e8c66d
MD5 767f2005c3bbab17a62a12053416f09f
BLAKE2b-256 ab3efdb621f25f6026199f8f2153f791af2bb2add480fb565ad7475697f9101d

See more details on using hashes here.

File details

Details for the file arrow_bcp-0.9.2-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for arrow_bcp-0.9.2-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6771ca94f869f4f8b8b46bd33569d327c0e1c86a31d4b0797cc98a6665bd84d1
MD5 5c89a0e6165c61d173a19039dbf2fec9
BLAKE2b-256 ba990f5100982484885982f9bbd654b5a264e3385c04b7f9f1fae35e9bdb3fae

See more details on using hashes here.

File details

Details for the file arrow_bcp-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for arrow_bcp-0.9.2-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5fe434270a3f3789a18438c99faeaba722f24a961f511c5c765696317b84e625
MD5 c505fd4916fb628ee984b1d2844e6a90
BLAKE2b-256 ea9fcf7b532c86eeca1cc2fae9e6d4f150ce5a4c9040bd77965ec8b336e796ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page