Skip to main content

Load data from databases to dataframes, the fastest way.

Project description

ConnectorX status docs

Load data from to , the fastest way.

For more data sources, please check out our discussion.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on integer columns for SPJA queries.

Check out more detailed usage and examples here.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time. More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set. Then, if partition_on is specified, ConnectorX will issue SELECT MIN(\$partition_on), MAX(\$partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE \$partition_on > 0 AND \$partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE \$partition_on > 0 AND \$partition_on < 10000). If the partition is not specified, the count query will be SELECT * FROM (SELECT * FROM lineitem). Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

This mechanism implies that having an index on the partition column is recommended to make full use of the parallel downloading power provided by ConnectorX.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here.

Sources

  • Postgres
  • Mysql
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server
  • Oracle
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin
  • Dask
  • Polars

Detailed Usage and Examples

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string uri. Currently only PostgreSQL is supported.
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
    

Next Plan

Checkout our discussions to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

connectorx-0.2.0-cp39-cp39-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.9 Windows x86-64

connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.9

connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl (4.2 MB view details)

Uploaded CPython 3.9 macOS 10.15+ intel

connectorx-0.2.0-cp38-cp38-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.8

connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl (4.2 MB view details)

Uploaded CPython 3.8 macOS 10.15+ intel

connectorx-0.2.0-cp37-cp37m-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.7m Windows x86-64

connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.7m

connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl (4.2 MB view details)

Uploaded CPython 3.7m macOS 10.15+ intel

File details

Details for the file connectorx-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c88cc404c6c0d86b50777c36d70eeee15c8089a01ddbf76ee58965a72f8608d9
MD5 70c1369efc944573bcad7f7f1eae39a0
BLAKE2b-256 2b76ebde78eec3cab69fe3c5b19d5b1ae623c8ae5e0f91fc82d3c23a25403443

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95d60238a2a992e12dc7f09d2e5c8654b43df4c0ccc1fe9b085b400e3b348784
MD5 bb83a970d2e78ace06461c4230c0df10
BLAKE2b-256 0d3bc4ed425b7ab3d16bb383ddca2dfaf1157f242845257c4dbba516445505bb

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: CPython 3.9, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 52d75ae9aefc079fb0f28b9df2072004b71c7424179af4e4750cec019ff22eb4
MD5 7eff9d42fee85a5c3cbabf30c3f7ab48
BLAKE2b-256 d8a7cbca8c46d7763f38ebc4d785b37ceb05787a7c5a0bfebf7b3cf2df726e94

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 426579388ca4714869839b710fa1c7c30d3607a97e7177231b3d2632cae7da5b
MD5 4a0248b9263a2d3df744030ff57f7766
BLAKE2b-256 5f9cf1812fc3350aadd8d13d712e22d11e319fd6fa449c996bd9bf75973135e3

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0411fc8d23c9d8589e54fab634427a8391447c025818ce64644425badf978779
MD5 489cf0ba259fc95f1508bde462e89cc4
BLAKE2b-256 f802bed50f6eee02bb0df40dfc69f66116ddcd964d643acbccfb32d5c2c8c00a

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: CPython 3.8, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 1752cb9aa84e19b0d136577fe644291f523a8b58872790fa1f94c2eb747f9441
MD5 2457f0bbdee3ce7a189908ba64cb65d0
BLAKE2b-256 efc9faf55e02dacfd4059cd2bfeadea3d7dd61a32fe8dd29c78818cdaec445bb

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 1edb4467d1e88ef90a7f85bcd279cc3eef0f95f61c3f7f4b4c2ccd8e06bce0b1
MD5 7f188d236d580f91915687d618f8eb3c
BLAKE2b-256 069e80076b90d7853a700296d3deb212546e556c451c9a035e91ea0b54716b0d

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81a835f3398ec76a0e4e7e80442566557312a7515681adebde6130c86026b567
MD5 0b87b8f3b792484fc48434382a4b0d22
BLAKE2b-256 680819283abeab18fc180366061b08e002d4eef30458bbe0bca7ad2db2e2d331

See more details on using hashes here.

Provenance

File details

Details for the file connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl.

File metadata

  • Download URL: connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: CPython 3.7m, macOS 10.15+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl
Algorithm Hash digest
SHA256 41d81a6ef63fedd6b9f6ea081537a5c53ad6654278427d822c6f5e1108547934
MD5 9d8a084ef724960204711537b5ad808e
BLAKE2b-256 aa67ae0aa711d826ee2c120b40413cb70d23fb2fa18ddd85d32282e527265563

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page