Skip to main content

chDB is an in-process SQL OLAP Engine powered by ClickHouse

Project description

Build X86 PyPI Downloads Discord Twitter

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse [^1] For more details: The birth of chDB

Features

  • In-process SQL OLAP Engine, powered by ClickHouse
  • No need to install ClickHouse
  • Minimized data copy from C++ to Python with python memoryview
  • Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats, samples
  • Support Python DB API 2.0, example

Arch

Get Started

Get started with chdb using our Installation and Usage Examples


Installation

Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty

Data Input

The following methods are available to access on-disk and in-memory data formats:

🗂️ Query On File

(Parquet, CSV, JSON, Arrow, ORC and 60+)

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')

🗂️ Query On Table

(Pandas DataFrame, Parquet file/bytes, Arrow bytes)

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",
                  tbl1=df1, tbl2=df2)
print(ret_tbl)
# Query on the DataFrame Table
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))

🗂️ Query with Stateful Session

from chdb import session as chs

## Create DB, Table, View in temp session, auto cleanup when session is deleted.
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")
sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")
sess.query(
    "CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;"
)
print("Select from view:\n")
print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))

see also: test_stateful.py.

🗂️ Query with Python DB-API 2.0

import chdb.dbapi as dbapi
print("chdb driver version: {0}".format(dbapi.get_client_info()))

conn1 = dbapi.connect()
cur1 = conn1.cursor()
cur1.execute('select version()')
print("description: ", cur1.description)
print("data: ", cur1.fetchone())
cur1.close()
conn1.close()

🗂️ Query with UDF (User Defined Functions)

from chdb.udf import chdb_udf
from chdb import query

@chdb_udf()
def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

print(query("select sum_udf(12,22)"))

Some notes on chDB Python UDF(User Defined Function) decorator.

  1. The function should be stateless. So, only UDFs are supported, not UDAFs(User Defined Aggregation Function).
  2. Default return type is String. If you want to change the return type, you can pass in the return type as an argument. The return type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
  3. The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.
  4. The function will be called for each line of input. Something like this:
    def sum_udf(lhs, rhs):
        return int(lhs) + int(rhs)
    
    for line in sys.stdin:
        args = line.strip().split('\t')
        lhs = args[0]
        rhs = args[1]
        print(sum_udf(lhs, rhs))
        sys.stdout.flush()
    
  5. The function should be pure python function. You SHOULD import all python modules used IN THE FUNCTION.
    def func_use_json(arg):
        import json
        ...
    
  6. Python interpertor used is the same as the one used to run the script. Get from sys.executable

see also: test_udf.py.

🗂️ Python Table Engine

Query on Pandas DataFrame

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

Query on Arrow Table

import chdb
import pyarrow as pa
arrow_table = pa.table(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query(
    "SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b", "debug"
).show()

Query on chdb.PyReader class instance

  1. You must inherit from chdb.PyReader class and implement the read method.
  2. The read method should:
    1. return a list of lists, the first demension is the column, the second dimension is the row, the columns order should be the same as the first arg col_names of read.
    2. return an empty list when there is no more data to read.
    3. be stateful, the cursor should be updated in the read method.
  3. An optional get_schema method can be implemented to return the schema of the table. The prototype is def get_schema(self) -> List[Tuple[str, str]]:, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
import chdb

class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        print("Python func read", col_names, count, self.cursor)
        if self.cursor >= len(self.data["a"]):
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block

reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query(
    "SELECT b, sum(a) FROM Python(reader) GROUP BY b ORDER BY b"
).show()

see also: test_query_py.py.

Limitations

  1. Column types supported: pandas.Series, pyarrow.array, chdb.PyReader
  2. Data types supported: Int, UInt, Float, String, Date, DateTime, Decimal
  3. Python Object type will be converted to String
  4. Pandas DataFrame performance is all of the best, Arrow Table is better than PyReader

For more examples, see examples and tests.


Demos and Examples

Benchmark

Documentation

Events

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. There are something you can help:

  • Help test and report bugs
  • Help improve documentation
  • Help improve code quality and performance

Bindings

We welcome bindings for other languages, please refer to bindings for more details.

License

Apache 2.0, see LICENSE for more information.

Acknowledgments

chDB is mainly based on ClickHouse [^1] for trade mark and other reasons, I named it chDB.

Contact


[^1]: ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

chdb-2.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140.7 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.1.0-cp312-cp312-macosx_11_0_arm64.whl (78.1 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

chdb-2.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140.7 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.1.0-cp311-cp311-macosx_11_0_arm64.whl (78.1 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

chdb-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.1.0-cp310-cp310-macosx_11_0_arm64.whl (78.1 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

chdb-2.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

chdb-2.1.0-cp39-cp39-macosx_11_0_arm64.whl (78.1 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

chdb-2.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

File details

Details for the file chdb-2.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ec72b2966fc03b30410a88f23ec8254994c4a7e64d62f9559d371e6a9f3888f
MD5 d8546fca01350d2161e44a039b160f2f
BLAKE2b-256 4dae73cb5dcf08295d0c4de20e2fa34d14d9b3b7132922fc13ab945944b75206

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eb2e0acce7236d7a0d3ae7caba0cf35227d452ea1cb7f39dd54c3a034d955d28
MD5 258c1bd9a74d177fc3d21f25225c52c1
BLAKE2b-256 ae58802f6f3ff2767d33e9705c4d92957d4e95775d9380bb422442c8d285a9cd

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 090f4ce39b006a9e15cd23b96555d173efdcb7b0856909904c92fd516ad7121a
MD5 b7eaf047be4b7778df23f6d0550558ea
BLAKE2b-256 f998fcb3bd8a832b3531187d2d8f46471494f9247a3ae921916a938a36e2cf05

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a8e6d4c2ce5e8a951e141147ea037e9841882eaa6cca5245ab0478028db94a6c
MD5 e4aca6db267eba5e424e14809e8dd5c1
BLAKE2b-256 f4dcfbd86accfbdfaddd2119571c93b2c1527a223012d39944a96559454c2c23

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bc19147b082a1535a5ee2551335a3d74952d7f993ead093866adfd86bdf9bd9
MD5 5fafb31993521bbb78992e42e2fe89b8
BLAKE2b-256 6d890db377aaac3440fb1028ccdb42768eba7e18875410efe7de48b55ebaac18

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3130c279b04da176abdcc85ce6f7898ecd3becabbdcf5509f79eee87aa3b7c81
MD5 645449e2671ba68626250c2c7961f14b
BLAKE2b-256 b5b57ac847b69cfb8442880da541a824f28723ebacce244965cb6a2375d6283d

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a2e10b081d32ec6540cd1adce96db4c036ea975cc3169d924d8dcbb5c88edfd6
MD5 72af973f5d1ab5836a832b365b3d015f
BLAKE2b-256 19ffca031434358d49d8f619c6ed5837c41efa637d46c8329d54274cc1802368

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc5169cffb11a49a90d483026cf47d795945fc5c1fec3beeec79d7f466cf7355
MD5 14bcb6f420ff334b833e5653a0061370
BLAKE2b-256 901106a77e1b810a2147f90acd307fca78951f6f9add443dd7a835f6d890b0ad

See more details on using hashes here.

File details

Details for the file chdb-2.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chdb-2.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ff2a3a0dabbf1635623df403b9aac1a15a8f899b8c6e41d30ea960b00fd8fc93
MD5 e016e18102a73daa4df74b9d67e3ae85
BLAKE2b-256 c64857107ef794c5a7aa7f910f2399db8a73227d3fecd23b1780add05ae25bac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page