Skip to main content

Load data from databases to dataframes, the fastest way.

Project description

ConnectorX status discussions

Load data from to , the fastest way.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on numerical columns (cannot contain NULL) for SPJA queries.

Check out more detailed usage and examples here. A general introduction of the project can be found in this blog post.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time (3x less memory and 13x less time compared with Pandas.). More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set.

Then, if partition_on is specified, ConnectorX will issue SELECT MIN($partition_on), MAX($partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000). If the partition is not specified, the count query will be SELECT COUNT(*) FROM (SELECT * FROM lineitem).

Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

How to specify the partition number?

partition_num will determine how many queries we are going to split from the original one and issue to the database. Underlying, we use rayon as our parallel executor, which adopts a pool of threads to handle each partitioned query. The number of threads in the pool equals to the number of logical cores on the machine. It is recommended to set the partition_num to the number of available logical cores.

How to choose the partition column?

partition_on specifies on which column we will do the partition as above procedure. In order to achieve the best performance, it is ideal that each partitioned query will return the same number of rows. And since we partition the column evenly, it is recommended that the numerical partition_on column is evenly distributed. Whether a column has index or not might also affect the performance depends on the source database. You can give it a try if you have multiple candidates. Also, you can manually partition the query if our partition method cannot match your need. ConnectorX will still return a whole dataframe with all the results of the list of queries you input.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here. For more planned data sources, please check out our discussion.

Sources

  • Postgres
  • Mysql
  • Mariadb (through mysql protocol)
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server
  • Azure SQL Database (through mssql protocol)
  • Oracle
  • Big Query - In Progress
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin (through Pandas)
  • Dask (through Pandas)
  • Polars (through PyArrow)

Detailed Usage and Examples

Rust docs: stable nightly

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI.
    • General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.
    • For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.
    • Please check out here for more connection uri parameters supported for each database (e.g. trusted_connection for Mssql, sslmode for Postgres)
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.
  • index_col: Optioinal[str]: The index column to set for the result dataframe. Only applicable when return_type is pandas, modin or dask.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
    

Next Plan

Checkout our discussion to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Developer's Guide

Please see Developer's Guide for information about developing ConnectorX.

Supports

You are always welcomed to:

  1. Ask questions in stackoverflow. Make sure to have #connectorx attached.
  2. Ask questions & propose new ideas in our forum.
  3. Ask questions & join the discussion & send direct messages to us in our discord (under CONNECTOR category)

Organizations and Projects using ConnectorX

To add your project/organization here, reply our post here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

connectorx-0.2.3.tar.gz (7.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

connectorx-0.2.3-cp310-cp310-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.10Windows x86-64

connectorx-0.2.3-cp310-cp310-manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.10

connectorx-0.2.3-cp310-cp310-macosx_10_15_intel.whl (9.4 MB view details)

Uploaded CPython 3.10macOS 10.15+ Intel (x86-64, i386)

connectorx-0.2.3-cp39-cp39-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.9Windows x86-64

connectorx-0.2.3-cp39-cp39-manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.9

connectorx-0.2.3-cp39-cp39-macosx_10_15_intel.whl (9.4 MB view details)

Uploaded CPython 3.9macOS 10.15+ Intel (x86-64, i386)

connectorx-0.2.3-cp38-cp38-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.8Windows x86-64

connectorx-0.2.3-cp38-cp38-manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.8

connectorx-0.2.3-cp38-cp38-macosx_10_15_intel.whl (9.4 MB view details)

Uploaded CPython 3.8macOS 10.15+ Intel (x86-64, i386)

connectorx-0.2.3-cp37-cp37m-win_amd64.whl (7.1 MB view details)

Uploaded CPython 3.7mWindows x86-64

connectorx-0.2.3-cp37-cp37m-manylinux2014_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.7m

connectorx-0.2.3-cp37-cp37m-macosx_10_15_intel.whl (9.4 MB view details)

Uploaded CPython 3.7mmacOS 10.15+ Intel (x86-64, i386)

File details

Details for the file connectorx-0.2.3.tar.gz.

File metadata

  • Download URL: connectorx-0.2.3.tar.gz
  • Upload date:
  • Size: 7.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1ad844a952f65b299fea28ac359ad1d0e507ba61e6cf55ae1f1d22937a78fc42
MD5 7f0b7f15bb8a95e6b24eee3fe7a051a9
BLAKE2b-256 058b8f8d5e6b31aa0352361179148be7e8f4dd1a0701e5185b4a7c5719ac482b

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 adc4082e0b3898d7d286f3a89beae9dd0d94a4fc4d51c96ee8d3a0b889e2604e
MD5 fc2cc867a73587231ff13653cb5cfa73
BLAKE2b-256 e07171dbed326f5771d381f388dbef386d7cae5ab21319544b5f4d136b6dc08b

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 364de653503cf85067701d60cad8c3c58ae7d85e38e5b997129fe8c32a2bb861
MD5 8a0cb8973eca1994b3bb8a2501bb2bb1
BLAKE2b-256 795190af676785a4c1002a70e53164ea972a7329aa6121e425b0f13baae77be3

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp310-cp310-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp310-cp310-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.4 MB
  • Tags: CPython 3.10, macOS 10.15+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp310-cp310-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 01b5604c6a0544174f6386067437cd24a4a9dbbbfa89677f30af6facb18a4d63
MD5 52a8354c9464afa2401d16acd7805438
BLAKE2b-256 bc181626e93531d2eb537852448c41d24e0fde05c99541add997dec70438b240

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8afc8d15c1496b8eebc11caacd341385b60d54d5357f907b3f32d8f7fb910718
MD5 27daa92be63dc2689bc20b4ea4c15000
BLAKE2b-256 207ce1049453e851e9ec3cd34156000867f873db23c5f73a3ae335b35e6447ae

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 302a072ad964cc5fc80bbb284b84f667f589bbf1dbc4753f10c9e88ba31278c6
MD5 6e2e4ac7551bb527c23fc6bdfd44fded
BLAKE2b-256 30c09232ffc311d30ff6a1b61cf5304f84e715741aa8839726800f534822f8fa

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp39-cp39-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp39-cp39-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.4 MB
  • Tags: CPython 3.9, macOS 10.15+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp39-cp39-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 944c04f06c250b17e8ec669d2764e7ea668d7bd5279137123244bc9db887bb04
MD5 fce2408045ad0fbe2bc3da002e8a2e96
BLAKE2b-256 f4265fde792e9ec7b8d4d1d328f4877bbff3519619aeb94b69e3e7e264c83f03

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 41fcfd89510f62dca19ada689b15ed99e912893a5e0b628bbb65146778fde6d7
MD5 f7ff8a293a65263e6447ebc59483d47a
BLAKE2b-256 0c2fb1a53cd3b4d1a0434338c546a6308e60de04609ec9d7d10f28798a9ccb2b

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 52f547541f50ca90a1c82f1e19c786bf00e895c02b385c407c5d9e6cb2af02fc
MD5 63d7079ef87169f485c340e8182aeb75
BLAKE2b-256 adf54363e7974ae6b61f3049b6a8e55fcbe1c31f0c82012f1bcece54640f1449

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp38-cp38-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp38-cp38-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.4 MB
  • Tags: CPython 3.8, macOS 10.15+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp38-cp38-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 9147ef7c694cf8a15e83ef26398a0620f02091df7d748ae69aeddfda990e2dc4
MD5 e2818495ecadacd7f63a3b52c2e24e85
BLAKE2b-256 11c9db2af212c0ebbef96bc54d78915422e5c4b13aeab2ad2bf5370f80378eab

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e09cef146bece3e35635140ccb3ee50b213f81508fff09c5cabca1599c9d03b7
MD5 d97be2c965dcdca4c46b172606da1782
BLAKE2b-256 5e2a310a46222235e5627f14012dd1723327ba56ee92674af4a41f66f739ca02

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 7.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 abd40f3ec8eed304c5297a61748d5dada2117222ea739e1638bed758407475e7
MD5 ca64038b48b303f599fb9981dc49f8ef
BLAKE2b-256 042babf5e0cc4fa2e95d2a747c0ee52755e66aa5a684ff26edd02dc430866f64

See more details on using hashes here.

File details

Details for the file connectorx-0.2.3-cp37-cp37m-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.3-cp37-cp37m-macosx_10_15_intel.whl
  • Upload date:
  • Size: 9.4 MB
  • Tags: CPython 3.7m, macOS 10.15+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.3-cp37-cp37m-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 35ce0834e4ca1c5987cadf55e325b98760a4498ac5e989419655eaa7e883bb9c
MD5 729bc2576d90da456232c8c42b8bf2af
BLAKE2b-256 9c5cee5edd57512dadaea2b183b17fd87649c61063f413343d1de32f4b1be331

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page