Skip to main content

chDB is an in-process SQL OLAP Engine powered by ClickHouse

Project description

Build X86 PyPI Downloads Discord Twitter

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse [^1] For more details: The birth of chDB

Features

  • In-process SQL OLAP Engine, powered by ClickHouse
  • No need to install ClickHouse
  • Minimized data copy from C++ to Python with python memoryview
  • Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats, samples
  • Support Python DB API 2.0, example

Arch

Get Started

Get started with chdb using our Installation and Usage Examples


Installation

Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty

Data Input

The following methods are available to access on-disk and in-memory data formats:

🗂️ Query On File

(Parquet, CSV, JSON, Arrow, ORC and 60+)

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')

🗂️ Query On Table

(Pandas DataFrame, Parquet file/bytes, Arrow bytes)

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",
                  tbl1=df1, tbl2=df2)
print(ret_tbl)
# Query on the DataFrame Table
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))

🗂️ Query with Stateful Session

from chdb import session as chs

## Create DB, Table, View in temp session, auto cleanup when session is deleted.
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")
sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")
sess.query(
    "CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;"
)
print("Select from view:\n")
print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))

see also: test_stateful.py.

🗂️ Query with Python DB-API 2.0

import chdb.dbapi as dbapi
print("chdb driver version: {0}".format(dbapi.get_client_info()))

conn1 = dbapi.connect()
cur1 = conn1.cursor()
cur1.execute('select version()')
print("description: ", cur1.description)
print("data: ", cur1.fetchone())
cur1.close()
conn1.close()

🗂️ Query with UDF (User Defined Functions)

from chdb.udf import chdb_udf
from chdb import query

@chdb_udf()
def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

print(query("select sum_udf(12,22)"))

Some notes on chDB Python UDF(User Defined Function) decorator.

  1. The function should be stateless. So, only UDFs are supported, not UDAFs(User Defined Aggregation Function).
  2. Default return type is String. If you want to change the return type, you can pass in the return type as an argument. The return type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
  3. The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.
  4. The function will be called for each line of input. Something like this:
    def sum_udf(lhs, rhs):
        return int(lhs) + int(rhs)
    
    for line in sys.stdin:
        args = line.strip().split('\t')
        lhs = args[0]
        rhs = args[1]
        print(sum_udf(lhs, rhs))
        sys.stdout.flush()
    
  5. The function should be pure python function. You SHOULD import all python modules used IN THE FUNCTION.
    def func_use_json(arg):
        import json
        ...
    
  6. Python interpertor used is the same as the one used to run the script. Get from sys.executable

see also: test_udf.py.

🗂️ Python Table Engine

Query on Pandas DataFrame

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

Query on Arrow Table

import chdb
import pyarrow as pa
arrow_table = pa.table(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query(
    "SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b", "debug"
).show()

Query on chdb.PyReader class instance

  1. You must inherit from chdb.PyReader class and implement the read method.
  2. The read method should:
    1. return a list of lists, the first demension is the column, the second dimension is the row, the columns order should be the same as the first arg col_names of read.
    2. return an empty list when there is no more data to read.
    3. be stateful, the cursor should be updated in the read method.
  3. An optional get_schema method can be implemented to return the schema of the table. The prototype is def get_schema(self) -> List[Tuple[str, str]]:, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
import chdb

class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        print("Python func read", col_names, count, self.cursor)
        if self.cursor >= len(self.data["a"]):
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block

reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query(
    "SELECT b, sum(a) FROM Python(reader) GROUP BY b ORDER BY b"
).show()

see also: test_query_py.py.

Limitations

  1. Column types supported: pandas.Series, pyarrow.array, chdb.PyReader
  2. Data types supported: Int, UInt, Float, String, Date, DateTime, Decimal
  3. Python Object type will be converted to String
  4. Pandas DataFrame performance is all of the best, Arrow Table is better than PyReader

For more examples, see examples and tests.


Demos and Examples

Benchmark

Documentation

Events

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. There are something you can help:

  • Help test and report bugs
  • Help improve documentation
  • Help improve code quality and performance

Bindings

We welcome bindings for other languages, please refer to bindings for more details.

License

Apache 2.0, see LICENSE for more information.

Acknowledgments

chDB is mainly based on ClickHouse [^1] for trade mark and other reasons, I named it chDB.

Contact


[^1]: ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

chdb-2.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (108.0 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

chdb-2.0.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134.8 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (108.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

chdb-2.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (108.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

chdb-2.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (108.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

chdb-2.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (108.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

chdb-2.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134.8 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

File details

Details for the file chdb-2.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a97256a31722b96415d0013871c1a3908ce227292773844237468e1039ded2f5
MD5 c64e4c9fb0d7326327a3f19e300dd6b6
BLAKE2b-256 d207b90b09bd3d34ae4cdf58cbad988d8f99800d8296e0809f595d4a9c670bbd

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5e530092dfd2ea2591dd5abcfdf05c77c8a9f75420976cc8d319753f6c04673
MD5 873253b72d5a3854481baccf8a9833d8
BLAKE2b-256 7621b96676c93ded51fe0202ab32cf4616c34778a4d7285f60c6fc0260ac0f6a

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1e80c715ca68a08b362ff42b40b73aebb95edbaf6267ed5f4a7422263fe378ad
MD5 8970d6255160caa0ee3a84fec1d45eb8
BLAKE2b-256 43622abcf6c35b778510c3cee81bbf65cb08a647cb2c58b9087678ff8f673439

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cb397b3aafd239c2d5b13013b8a87cc8c637e89728cecb97ebd4b0ca1f7bc3d3
MD5 377019b14580e4722183b126bcdcd5a1
BLAKE2b-256 4467d3efd37396ac1b738cbc661f3b3a8f2b3e6d39a697522343e9aae396e24c

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 42a577f4cdbb42e451b541601422148cd72949fdd241a5a033e9bc5036eda2f1
MD5 58f9f5864a7d33ae822f701e43a0402b
BLAKE2b-256 59af10037e74767f6a2ed90d9f0609e1a0e6a66b67185874b9f28ad05fd7130c

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 16b69405929580da293d19d8c8715927679cecc8bf022719cf3fbcf4bcc78ad5
MD5 a17b4961ce7bd7841d1adc0de6e41b06
BLAKE2b-256 a9215e93577507a8bcbc34c5c4cba138050a524d500e1ecdfc41505cf347c4d1

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 311f6f83e621a1fa0afc004f5274badd3321acabf3f69ffd374f6493540fd021
MD5 bdf3d515910d898d489b7273cef82111
BLAKE2b-256 e5cad344f381b4adfdbd5ac70c8cac344275f3677577c0618c70e1d5b6252b40

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fad9a9ecd7078ef4d3ef475288c3daa1b780a774468ae8ea6d8679731aebbb45
MD5 fe0c9c2c0cda13ece4f5e464539603ef
BLAKE2b-256 a941bf78908f63291a0f43d3848496c1027bdd49ffedd9ba01b09307becefaa5

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 98ffae1afa82d9da851e1cc5e13efe4e201406c78210073243cbbcd0b3a171aa
MD5 6785e43e60bd78821f636367f5d6e9ac
BLAKE2b-256 d07eee535cacacdeaceedb45080200bd8a50a8bb27142538a75752a3e4508972

See more details on using hashes here.

File details

Details for the file chdb-2.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 081a28d8ef1778fc793fbca3ad9daa7c933e090c715441bb90caa4f72c3af5a8
MD5 e7e2a84c3af3748737ae8ab7714fc5b6
BLAKE2b-256 d62d1e7b513692a6e1ba4f6f078774f2fe7ca4b9ccecd5d318cec410fda620fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page