Skip to main content

Load data from databases to dataframes, the fastest way.

Project description

ConnectorX status discussions

Load data from to , the fastest way.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on numerical columns (cannot contain NULL) for SPJA queries.

Check out more detailed usage and examples here. A general introduction of the project can be found in this blog post.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time (3x less memory and 13x less time compared with Pandas.). More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set.

Then, if partition_on is specified, ConnectorX will issue SELECT MIN($partition_on), MAX($partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000). If the partition is not specified, the count query will be SELECT COUNT(*) FROM (SELECT * FROM lineitem).

Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

How to specify the partition number?

partition_num will determine how many queries we are going to split from the original one and issue to the database. Underlying, we use rayon as our parallel executor, which adopts a pool of threads to handle each partitioned query. The number of threads in the pool equals to the number of logical cores on the machine. It is recommended to set the partition_num to the number of available logical cores.

How to choose the partition column?

partition_on specifies on which column we will do the partition as above procedure. In order to achieve the best performance, it is ideal that each partitioned query will return the same number of rows. And since we partition the column evenly, it is recommended that the numerical partition_on column is evenly distributed. Whether a column has index or not might also affect the performance depends on the source database. You can give it a try if you have multiple candidates. Also, you can manually partition the query if our partition method cannot match your need. ConnectorX will still return a whole dataframe with all the results of the list of queries you input.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here. For more planned data sources, please check out our discussion.

Sources

  • Postgres
  • Mysql
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server (no trusted_connection support yet)
  • Oracle
  • Big Query - In Progress
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin (through Pandas)
  • Dask (through Pandas)
  • Polars (through PyArrow)

Detailed Usage and Examples

Rust docs: stable nightly

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI.
    • General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.
    • For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.
  • index_col: Optioinal[str]: The index column to set for the result dataframe. Only applicable when return_type is pandas, modin or dask.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
    

Next Plan

Checkout our discussion to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Developer's Guide

Please see Developer's Guide for information about developing ConnectorX.

Supports

You are always welcomed to:

  1. Ask questions in stackoverflow. Make sure to have #connectorx attached.
  2. Ask questions & propose new ideas in our forum.
  3. Ask questions & join the discussion & send direct messages to us in our discord (under CONNECTOR category)

Organizations and Projects using ConnectorX

To add your project/organization here, reply our post here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

connectorx-0.2.2-cp310-cp310-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.10 Windows x86-64

connectorx-0.2.2-cp310-cp310-manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.10

connectorx-0.2.2-cp310-cp310-macosx_10_15_intel.whl (5.7 MB view details)

Uploaded CPython 3.10 macOS 10.15+ intel

connectorx-0.2.2-cp39-cp39-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

connectorx-0.2.2-cp39-cp39-manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9

connectorx-0.2.2-cp39-cp39-macosx_10_15_intel.whl (5.7 MB view details)

Uploaded CPython 3.9 macOS 10.15+ intel

connectorx-0.2.2-cp38-cp38-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.8 Windows x86-64

connectorx-0.2.2-cp38-cp38-manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.8

connectorx-0.2.2-cp38-cp38-macosx_10_15_intel.whl (5.7 MB view details)

Uploaded CPython 3.8 macOS 10.15+ intel

connectorx-0.2.2-cp37-cp37m-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.7m Windows x86-64

connectorx-0.2.2-cp37-cp37m-manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.7m

connectorx-0.2.2-cp37-cp37m-macosx_10_15_intel.whl (5.7 MB view details)

Uploaded CPython 3.7m macOS 10.15+ intel

File details

Details for the file connectorx-0.2.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 77dca3b9838dcdd2e1156cbc5826ac06699fa913fa9a8a0af07ed81dd8d6157b
MD5 d9b5f3faa6323b7182201f18752838da
BLAKE2b-256 b3f113e5442b9c483ef86711345cfaf5453a0d0a48c48469091bb3b136bef2f0

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 085b6d25f2a14448ad232e8af45edff2ee27873cd77e6ac9ece989e0e1f1a9e1
MD5 808a1dfd6b0f7b4639dd52921959bc08
BLAKE2b-256 f0b50c200434f7b743e988d1d12186642164f5755124967b245b6dad53e3d348

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp310-cp310-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp310-cp310-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.10, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp310-cp310-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 b9c228e899facd2b3bcd56cd539897039ada36c39697b5f0a83e36b788b3622c
MD5 899f64fdaa34f12e812091de9bc9c71d
BLAKE2b-256 448de305c9309425e84180f89b3e70bf703aa3e1a8fc1b09402a540a830fb765

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0f39a283385034297c8e543aad6bc5a908d2bd76470ee36bd6ad7b17803d947c
MD5 84d58e173fbbcbc752aec4fad89636c8
BLAKE2b-256 5a28f2b7bae432ec2b1a28788c4df868bf1e5ae4cbc9045485ca649cf8b66b19

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fdbf5120dddf318b768e07610c1915c3bf66c681b743418d3ac279e27e7f1a02
MD5 2058b917fa64ccc01631e8daf0d4d21f
BLAKE2b-256 737857369f22c690d11900e0e225d92add56521824a973969554402a63042b60

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp39-cp39-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp39-cp39-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.9, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp39-cp39-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 f51f1d863c7f76cb37768b5ad1d75d919ab2fd1fcf6e8adadaed9ebfd9a7e16a
MD5 6235a396946fd025dac89a13391a3177
BLAKE2b-256 6a51a18346cec05de82886ed82c66c2b478742ca57ffd50fccf87dd325f6fd79

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bb97ebe4e185e54a6fe42d8719dd7ba1cdbf2e24f8b497eb8588457bc10d7d37
MD5 f11c327e41d9a88e4583616ed2b550f7
BLAKE2b-256 2eb255215708cfe1dc310d9c8a6468fec5c7c83bcad145c42cedd86bc5f516bf

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 10019998db0aab752929cb09cc1a525495d4eb1f4b6d6a4256c8f58ef5b45ab0
MD5 7495ced5fafe5a5393e77345a35d5847
BLAKE2b-256 706a07c89ae166cd58aaf24344e8b3564c52c38cb21195e804e4a5bf431078f7

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp38-cp38-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp38-cp38-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.8, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp38-cp38-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 a848f238236c110549a1735cf8fc878dda913bf943672134bbb5a37c98cbb5b7
MD5 584021886d1f6aec4045235d53c2325b
BLAKE2b-256 0371961a7fb86b72c94d94d5ec46cc17afae3c3c69f9887993beb8716f98eb0a

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 704fb7638a2c1b80d87f36fec9388c94ac090e09dac36660a7f8527f7428eb7e
MD5 0335d6e02ccfa71a78f1693130920e7c
BLAKE2b-256 9e2e54030e2f4a0b601967dfedc7a850096a82ea2343493647e9104a542d3f37

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 636abf147fd7e381f89c500a1081afe7e18a98b97c4e7bbc4eef3ccda4262f3f
MD5 102bf4397b6ad1cd4f8fd81cc4f578c2
BLAKE2b-256 6b211eab45d698471219bad918bd846920d0ec9b9c028e6685348b6e6eafd302

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.2-cp37-cp37m-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.2-cp37-cp37m-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.7m, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.2-cp37-cp37m-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 83cc4e81e5e70c82a4268ef33e894292af48f0cb5e4cd9ca72528414456defb6
MD5 ac95a74b0e8db37af1c75336898c41b6
BLAKE2b-256 3ee368e40203dc862577cb563d87c11fe61a8cba2db7a29d36523d9425fa09c0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page