Load data from databases to dataframes, the fastest way.
Project description
ConnectorX
Load data from to , the fastest way.
For more data sources, please check out our discussion.
ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.
What you need is one line of code:
import connectorx as cx
cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")
Optionally, you can accelerate the data loading using parallelism by specifying a partition column.
import connectorx as cx
cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)
The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on integer columns for SPJA queries.
Check out more detailed usage and examples here.
Installation
pip install connectorx
Performance
We compared different solutions in Python that provides the read_sql
function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.
Time chart, lower is better.
Memory consumption chart, lower is better.
In conclusion, ConnectorX uses up to 3x less memory and 21x less time. More on here.
How does ConnectorX achieve a lightening speed while keeping the memory footprint low?
We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.
ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.
How does ConnectorX download the data?
Upon receiving the query, e.g. SELECT * FROM lineitem
, ConnectorX will first issue a LIMIT 1
query SELECT * FROM lineitem LIMIT 1
to get the schema of the result set.
Then, if partition_on
is specified, ConnectorX will issue SELECT MIN(\$partition_on), MAX(\$partition_on) FROM (SELECT * FROM lineitem)
to know the range of the partition column.
After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE \$partition_on > 0 AND \$partition_on < 10000
.
ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE \$partition_on > 0 AND \$partition_on < 10000
). If the partition
is not specified, the count query will be SELECT * FROM (SELECT * FROM lineitem)
.
Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.
Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.
This mechanism implies that having an index on the partition column is recommended to make full use of the parallel downloading power provided by ConnectorX.
Supported Sources & Destinations
Supported protocols, data types and type mappings can be found here.
Sources
- Postgres
- Mysql
- Sqlite
- Redshift (through postgres protocol)
- Clickhouse (through mysql protocol)
- SQL Server
- Oracle
- ...
Destinations
- Pandas
- PyArrow
- Modin
- Dask
- Polars
Detailed Usage and Examples
API
connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
Run the SQL query, download the data from database into a Pandas dataframe.
Parameters
conn: str
: Connection string uri. Currently only PostgreSQL is supported.query: Union[str, List[str]]
: SQL query or list of SQL queries for fetching data.return_type: str = "pandas"
: The return type of this function. It can bearrow
,pandas
,modin
,dask
orpolars
.protocol: str = "binary"
: The protocol used to fetch data from source, default isbinary
. Check out here to see more details.partition_on: Optional[str]
: The column to partition the result.partition_range: Optional[Tuple[int, int]]
: The value range of the partition column.partition_num: Optioinal[int]
: The number of partitions to generate.
Examples
-
Read a DataFrame from a SQL using a single thread
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" cx.read_sql(postgres_url, query)
-
Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (
partition_range
will be automatically queried if not given)import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
-
Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"] cx.read_sql(postgres_url, queries)
-
Read a DataFrame parallelly using 4 threads from a more complex query
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = f""" SELECT l_orderkey, SUM(l_extendedprice * ( 1 - l_discount )) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = 'BUILDING' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < DATE '1995-03-15' AND l_shipdate > DATE '1995-03-15' GROUP BY l_orderkey, o_orderdate, o_shippriority """ cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
Next Plan
Checkout our discussions to participate in deciding our next plan!
Historical Benchmark Results
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file connectorx-0.2.0-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c88cc404c6c0d86b50777c36d70eeee15c8089a01ddbf76ee58965a72f8608d9 |
|
MD5 | 70c1369efc944573bcad7f7f1eae39a0 |
|
BLAKE2b-256 | 2b76ebde78eec3cab69fe3c5b19d5b1ae623c8ae5e0f91fc82d3c23a25403443 |
Provenance
File details
Details for the file connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95d60238a2a992e12dc7f09d2e5c8654b43df4c0ccc1fe9b085b400e3b348784 |
|
MD5 | bb83a970d2e78ace06461c4230c0df10 |
|
BLAKE2b-256 | 0d3bc4ed425b7ab3d16bb383ddca2dfaf1157f242845257c4dbba516445505bb |
Provenance
File details
Details for the file connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp39-cp39-macosx_10_15_intel.whl
- Upload date:
- Size: 4.2 MB
- Tags: CPython 3.9, macOS 10.15+ intel
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52d75ae9aefc079fb0f28b9df2072004b71c7424179af4e4750cec019ff22eb4 |
|
MD5 | 7eff9d42fee85a5c3cbabf30c3f7ab48 |
|
BLAKE2b-256 | d8a7cbca8c46d7763f38ebc4d785b37ceb05787a7c5a0bfebf7b3cf2df726e94 |
Provenance
File details
Details for the file connectorx-0.2.0-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 426579388ca4714869839b710fa1c7c30d3607a97e7177231b3d2632cae7da5b |
|
MD5 | 4a0248b9263a2d3df744030ff57f7766 |
|
BLAKE2b-256 | 5f9cf1812fc3350aadd8d13d712e22d11e319fd6fa449c996bd9bf75973135e3 |
Provenance
File details
Details for the file connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0411fc8d23c9d8589e54fab634427a8391447c025818ce64644425badf978779 |
|
MD5 | 489cf0ba259fc95f1508bde462e89cc4 |
|
BLAKE2b-256 | f802bed50f6eee02bb0df40dfc69f66116ddcd964d643acbccfb32d5c2c8c00a |
Provenance
File details
Details for the file connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp38-cp38-macosx_10_15_intel.whl
- Upload date:
- Size: 4.2 MB
- Tags: CPython 3.8, macOS 10.15+ intel
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1752cb9aa84e19b0d136577fe644291f523a8b58872790fa1f94c2eb747f9441 |
|
MD5 | 2457f0bbdee3ce7a189908ba64cb65d0 |
|
BLAKE2b-256 | efc9faf55e02dacfd4059cd2bfeadea3d7dd61a32fe8dd29c78818cdaec445bb |
Provenance
File details
Details for the file connectorx-0.2.0-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1edb4467d1e88ef90a7f85bcd279cc3eef0f95f61c3f7f4b4c2ccd8e06bce0b1 |
|
MD5 | 7f188d236d580f91915687d618f8eb3c |
|
BLAKE2b-256 | 069e80076b90d7853a700296d3deb212546e556c451c9a035e91ea0b54716b0d |
Provenance
File details
Details for the file connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp37-cp37m-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81a835f3398ec76a0e4e7e80442566557312a7515681adebde6130c86026b567 |
|
MD5 | 0b87b8f3b792484fc48434382a4b0d22 |
|
BLAKE2b-256 | 680819283abeab18fc180366061b08e002d4eef30458bbe0bca7ad2db2e2d331 |
Provenance
File details
Details for the file connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl
.
File metadata
- Download URL: connectorx-0.2.0-cp37-cp37m-macosx_10_15_intel.whl
- Upload date:
- Size: 4.2 MB
- Tags: CPython 3.7m, macOS 10.15+ intel
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41d81a6ef63fedd6b9f6ea081537a5c53ad6654278427d822c6f5e1108547934 |
|
MD5 | 9d8a084ef724960204711537b5ad808e |
|
BLAKE2b-256 | aa67ae0aa711d826ee2c120b40413cb70d23fb2fa18ddd85d32282e527265563 |