Skip to main content

Load data from databases to dataframes, the fastest way.

Project description

ConnectorX status discussions

Load data from to , the fastest way.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on integer columns for SPJA queries.

Check out more detailed usage and examples here. A general introduction of the project can be found in this blog post.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time. More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set.

Then, if partition_on is specified, ConnectorX will issue SELECT MIN($partition_on), MAX($partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000). If the partition is not specified, the count query will be SELECT COUNT(*) FROM (SELECT * FROM lineitem).

Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

This mechanism implies that having an index on the partition column is recommended to make full use of the parallel downloading power provided by ConnectorX.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here. For more planned data sources, please check out our discussion.

Sources

  • Postgres
  • Mysql
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server (no encryption support yet)
  • Oracle
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin (through Pandas)
  • Dask (through Pandas)
  • Polars (through PyArrow)

Detailed Usage and Examples

Rust docs: stable nightly

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI. Supported URI scheme: (postgres|postgressql|mysql|mssql|sqlite)://username:password@addr:port/dbname.
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
    

Next Plan

Checkout our discussion to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Developer's Guide

Please see Developer's Guide for information about developing ConnectorX.

Supports

You are always welcomed to:

  1. Ask questions in stackoverflow. Make sure to have #connectorx attached.
  2. Ask questions & propose new ideas in our forum.
  3. Ask questions & join the discussion & send direct messages to us in our discord (under CONNECTOR category)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

connectorx-0.2.1-cp39-cp39-win_amd64.whl (4.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

connectorx-0.2.1-cp39-cp39-macosx_10_15_intel.whl (5.6 MB view details)

Uploaded CPython 3.9 macOS 10.15+ intel

connectorx-0.2.1-cp38-cp38-win_amd64.whl (4.3 MB view details)

Uploaded CPython 3.8 Windows x86-64

connectorx-0.2.1-cp38-cp38-macosx_10_15_intel.whl (5.6 MB view details)

Uploaded CPython 3.8 macOS 10.15+ intel

connectorx-0.2.1-cp37-cp37m-win_amd64.whl (4.3 MB view details)

Uploaded CPython 3.7m Windows x86-64

connectorx-0.2.1-cp37-cp37m-manylinux2014_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.7m

connectorx-0.2.1-cp37-cp37m-macosx_10_15_intel.whl (5.6 MB view details)

Uploaded CPython 3.7m macOS 10.15+ intel

File details

Details for the file connectorx-0.2.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c4d6a33407aa2c21375c4455f8cafd03b76357a44b8789bcbbc3c7023f01c3d6
MD5 56f13d08cac217c741eb7f1d57906e20
BLAKE2b-256 304cba2053ebdf605a238075f962b0d31322be16e2eff5fe49c424b1b9e2988d

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6190b827275e07f11dbda6b27b337e887d315391609f6197db3a1c8c90c57a07
MD5 6cc1da57ec88735e72589aef56a1e752
BLAKE2b-256 74ea48f966e46cf9a8cd61060d8228abeeb8b173ff2c706e7f15d6a9c37f121e

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp39-cp39-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp39-cp39-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.9, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp39-cp39-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 34fb3c28754104f412a8226e60d7ad424c20e084c61d0509d95586064e815a3c
MD5 ac263f5ad0dedfc01af059d51f285ef7
BLAKE2b-256 5e7d1e24ebf283e49950beb445dbf763b375259a43f896673d4f6dec6b478a92

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 91b9c62709ab1d87f57e9e08b15531653ae31873ff2a65110b6320295e9e8ca0
MD5 2c1b67ff604435aa42935d4ae531fc37
BLAKE2b-256 4b388a7f74056ac9e7dc738c9d0beb78cc3b41a4071ca350a645ce15b583be8d

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1a37bab38622adc50dfe7cb56c77cd08ced670a3210fb7778fa3b8555f8fdf67
MD5 c427d3551c97d1ae70016db6f1e61683
BLAKE2b-256 ae2a0e4605aebeb9abde41ffadfd4c2025fb24fad1bc47378bbe4251b4c13da6

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp38-cp38-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp38-cp38-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.8, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp38-cp38-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 06a971404665822f29c56691bdd444220ec6c34266aad91925e0574a236216a9
MD5 9e67c9839a0adca4640538f1025693f9
BLAKE2b-256 26ab6e26cdb49ed7528c5a28c5a042a04439be199b07252e6d51feb0bc3ff076

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e9b1b8ba7d2fdb5dbef3d62f5871b2c00da29a3daf772c2c7363b25b0227b714
MD5 1d5e6ab4d1d68cc3ccd80712abc22bee
BLAKE2b-256 0d57d37935aad8d2199e3af1dd87532d6aa136be9cbd3188a15f60b014f57862

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bb8dfbb3f2b2fcd7dbea94b87db5ff7f8db304bfed9b02aaee19c7b07b4e7f70
MD5 54d8f01b8e0db0020c419e6008b459e9
BLAKE2b-256 85e27da7a051765f230b6b980603c2a19c78d081bbe3403287ed553398d0d39b

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.1-cp37-cp37m-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.1-cp37-cp37m-macosx_10_15_intel.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.1-cp37-cp37m-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 87c3438e7426e72ab745465517a0d75d89dbff077843c788a0e1318a785e6c5b
MD5 0c5213251c85c9dbc2c1f274612b8dbd
BLAKE2b-256 fdf8342e27909e13fb5682801eaee4eab1ae40f3eed1317d9f9b5e19b50cfc58

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page