Skip to main content

Load data from databases to dataframes, the fastest way.

Project description

ConnectorX status discussions Downloads

Load data from to , the fastest way.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on numerical columns (cannot contain NULL) for SPJA queries.

Check out more detailed usage and examples here. A general introduction of the project can be found in this blog post.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time (3x less memory and 13x less time compared with Pandas.). More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set.

Then, if partition_on is specified, ConnectorX will issue SELECT MIN($partition_on), MAX($partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000). If the partition is not specified, the count query will be SELECT COUNT(*) FROM (SELECT * FROM lineitem).

Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

How to specify the partition number?

partition_num will determine how many queries we are going to split from the original one and issue to the database. Underlying, we use rayon as our parallel executor, which adopts a pool of threads to handle each partitioned query. The number of threads in the pool equals to the number of logical cores on the machine. It is recommended to set the partition_num to the number of available logical cores.

How to choose the partition column?

partition_on specifies on which column we will do the partition as above procedure. In order to achieve the best performance, it is ideal that each partitioned query will return the same number of rows. And since we partition the column evenly, it is recommended that the numerical partition_on column is evenly distributed. Whether a column has index or not might also affect the performance depends on the source database. You can give it a try if you have multiple candidates. Also, you can manually partition the query if our partition method cannot match your need. ConnectorX will still return a whole dataframe with all the results of the list of queries you input.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here. For more planned data sources, please check out our discussion.

Sources

  • Postgres
  • Mysql
  • Mariadb (through mysql protocol)
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server
  • Azure SQL Database (through mssql protocol)
  • Oracle
  • Big Query - Experimental: need docs and benchmark (also more tests)
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin (through Pandas)
  • Dask (through Pandas)
  • Polars (through PyArrow)

Detailed Usage and Examples

Rust docs: stable nightly

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI.
    • General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.
    • For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.
    • Google BigQuery requires absolute path of the authentication JSON file, example: bigquery:///home/user/path/auth.json
    • Please check out here for more connection uri parameters supported for each database (e.g. trusted_connection for Mssql, sslmode for Postgres)
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.
  • index_col: Optioinal[str]: The index column to set for the result dataframe. Only applicable when return_type is pandas, modin or dask.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
    

Next Plan

Checkout our discussion to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Developer's Guide

Please see Developer's Guide for information about developing ConnectorX.

Supports

You are always welcomed to:

  1. Ask questions in stackoverflow. Make sure to have #connectorx attached.
  2. Ask questions & propose new ideas in our forum.
  3. Ask questions & join the discussion & send direct messages to us in our discord (under CONNECTOR category)

Organizations and Projects using ConnectorX

To add your project/organization here, reply our post here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

connectorx-0.2.4-cp310-cp310-win_amd64.whl (7.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

connectorx-0.2.4-cp310-cp310-macosx_10_15_intel.whl (9.7 MB view details)

Uploaded CPython 3.10 macOS 10.15+ intel

connectorx-0.2.4-cp39-cp39-win_amd64.whl (7.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

connectorx-0.2.4-cp39-cp39-macosx_10_15_intel.whl (9.7 MB view details)

Uploaded CPython 3.9 macOS 10.15+ intel

connectorx-0.2.4-cp38-cp38-win_amd64.whl (7.4 MB view details)

Uploaded CPython 3.8 Windows x86-64

connectorx-0.2.4-cp38-cp38-macosx_10_15_intel.whl (9.7 MB view details)

Uploaded CPython 3.8 macOS 10.15+ intel

connectorx-0.2.4-cp37-cp37m-win_amd64.whl (7.4 MB view details)

Uploaded CPython 3.7m Windows x86-64

connectorx-0.2.4-cp37-cp37m-manylinux2014_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.7m

connectorx-0.2.4-cp37-cp37m-macosx_10_15_intel.whl (9.7 MB view details)

Uploaded CPython 3.7m macOS 10.15+ intel

File details

Details for the file connectorx-0.2.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d86efd1818be17907f2d63ed3f913cf61e7ea7cfdc0702c95c3cca9a495d3819
MD5 d134962f167f25823c54feebd3da7cba
BLAKE2b-256 1f3a0032849c3dd3ccd90d654b4ef95ece9838fea4b998f75024cfa07b5d6e13

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f92f5ed52a92668a380ab6c9733ff7147af1f744143c4c9155ca6796f3cb4eb3
MD5 bbbcc4621eed660271bd01dce3375cbc
BLAKE2b-256 97ea5877c58e2877b0689ee821f3f59d0908805fe2a79e9ebef3b7bd09242714

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp310-cp310-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp310-cp310-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.7 MB
  • Tags: CPython 3.10, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp310-cp310-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 eb10b742e031deb3a6d9a41aa1b9a7d172b71e4312a56a92d4b8a1b977eb1733
MD5 f96b4a808df758ee295fd57f9dff2a60
BLAKE2b-256 566db04e508e52c1404738d2bc7aa859033bc7d9f23aeedf7c656625a7c8e2b7

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2ba038c5c2694efad6b14a3247e930998f8bf5a5952cd2ba794da4bfa698c05f
MD5 96520d910b263b77e0958543176fbeca
BLAKE2b-256 3d80a130bd3d278d6847731ef7fad1a6c77b71e08c19aaa19ae09c405d65be18

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 800ec6aa501dc7e3a5b99292864b93c20510a423e520606fd9f366250a4fae7f
MD5 322e396b926deeb41fe3f39fa64036c5
BLAKE2b-256 27cf20c7897ec7b1766dae5e846ac3f224bf8acaee41e53e6b609719c7a24bfe

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp39-cp39-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp39-cp39-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.7 MB
  • Tags: CPython 3.9, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp39-cp39-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 526316875deca21a76c62bd073ace8d6269a87fe8d80ebed0c26f51525268b2b
MD5 435ecad762196a9763d07087ad5295e4
BLAKE2b-256 761133a6bd3bec484a3b719c29add8a89a69b05ac882677594d310671ce128c6

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9a2d1a769925c6e90b1c9fda42462734ae0f19f7af8551d8e0f656ca588bf36a
MD5 44476b86b1abd263deb43c3a5e6e6db5
BLAKE2b-256 9b06ed74fddab83af174ae28363b6077ce84bcb838f2959b44f443130f445d38

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 99d5cc4faca331cb7b8fae392792e4d30beadbd8592ab2bc3f8f0306680a859e
MD5 06e53a09402777e12df4275f70fa5321
BLAKE2b-256 d1e4add056ea70f2c0b0600fa4cece348607265f4873caf9d1e605a4bbb75419

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp38-cp38-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp38-cp38-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.7 MB
  • Tags: CPython 3.8, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp38-cp38-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 0a8f4fa1e7cb457b34740a92dae09e112e153c57e8dc16a5637254a078175644
MD5 786f39da1426c34d983ea7e2a76efd6d
BLAKE2b-256 0af20ce94a4bce7e0907a91fb1bbc33b0f835561bd6e34a13dfe1e081829f9ee

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 65399582110117577e5cc59d4fc9e9bc03ea005bbbdc3c47b7deba834b0d56cc
MD5 01f638b3aac7cee11b07dbabef18345f
BLAKE2b-256 2e6e5136de51a9173211c239d288644bfa0f23a4ad39accbad8b25efabde5a3e

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a5d7d9c5b8e3432bbe85d66f4b75e38fa53c5821f68d25a6b3e4e383876fb2d
MD5 76752b38f1b87ef8232a611194ef910f
BLAKE2b-256 00cda2e4ec6e495d44ed5b26f9f026843c336cca47b885ff881d8d99b1067c10

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.4-cp37-cp37m-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.4-cp37-cp37m-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.7 MB
  • Tags: CPython 3.7m, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.4-cp37-cp37m-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 01a648956df84804392917287280d6e994ffd164119e86b68ccd8e886e42ee75
MD5 4baf6f564951863822bce9bcd0aa9672
BLAKE2b-256 86117a73ccdd3583fbe37eb6ee5fa41326c80faff2d31a0776f48d0d6aaf33d7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page