Cassandra client-side join operation for Python
Project description
Cassandra-Join-Library
Cassandra Join Operation Extension for Python. Client-sided and runs on single machine.
Latest Version : 0.3.1
Requirements :
- Python >= 3.7
- Supporting Only Cassandra 4.0
Installation Guide
To install the package you will need pip
, then run pip install cassandra-joinlib
.
Pip install will automatically install all dependencies and you are ready to go. 🍻
Usage Guide
It is actually pretty simple to use this library, this is the general steps for you to follow:
-
This library works as an executor for join operation, so it focuses on
JoinExecutor
class which has 2 types that you can utilize based on your needs:- If you need an equi join, it is better to use
HashJoinExecutor
.HashJoinExecutor
is importable fromcassandra_joinlib.hash_join
- For non equi-join, you may use
NestedJoinExecutor
(Hash join doesn't support non equi-join). ImportNestedJoinExecutor
fromcassandra_joinlib.nested_join
.
- If you need an equi join, it is better to use
-
Both
HashJoinExecutor
andNestedJoinExecutor
requires two parameters to be initialized, your connection with cassandrasession
and the name of the keyspacekeyspace_name
.Example:
HashJoinExecutor(session, keyspace_name)
. -
Once you have initialized the executor, you may add join operation/operations using these functions:
join(TableInfo left_table_info, TableInfo, right_table_info, str operator)
for inner joinrightJoin(TableInfo left_table_info, TableInfo right_table_info, str operator)
for right outer joinleftJoin(TableInfo, left_table_info, TableInfo right_table_info, str operator)
for left outer joinfullOuterJoin(TableInfo left_table_info, TableInfo right_table_info, str operator)
for full outer join
-
We need
TableInfo
object for our join function. To initialize it, we need 2 params which are name of the table and name of the join column for that table.TableInfo
can be imported fromcassandra_joinlib.commands
.Example: Let's say we want to (inner) join table
user
and tablepayment_received
based on join columnemail
usingHashJoinExecutor
.left_table_info = TableInfo('user', 'email') right_table_info = TableInfo('payment_received', 'email') HashJoinExecutor(session, keyspace_name).join(left_table_info, right_table_info)
-
Each of the join function returns join executor itself, so you may add a chained join operation like this:
NestedJoinExecutor(session, keyspace_name) \ .join(left_table_info, right_table_info, "<") \ .fullOuterJoin(left_table_info_2, right_table_info2, "=")
-
After you add join operation/operations to the executor, it actually queue the join command(s). You may execute the join operation when you think you are ready by using
.execute()
from the join executor and do not forget to save the join result with.save_result(filename)
.Example:
HashJoinExecutor(session, keyspace_name) \ .join(left_tbl_info, right_tbl_info) \ .execute() \ .save_result("my_first_join")
Note that the result will be saved in
.txt
file in JSON Format -
Run the python script and wait until it finishes execution.
-
You may also print the join result by using
printJoinResult(filename, max_buffer_size)
which can be imported fromcassandra_joinlib.utils
. Parammax_buffer_size
is the number of rows that will be printed pertabulate
and has10000
as a default value. Print function should be called after.execute()
function. -
That's it, you're done 🥳 🥳 🥳
Hey... you can also get join execution time with .get_time_elapsed()
called on the join executor after .execute()
function.
Important things you need to know
This library executes any chained-join with left deep join method. It means that result of first join operation would be the left table of the second join operation, the result of second join operation would be the left table of the third join operation, and so on...
For NON-first join, make sure you use the result from the previous join as left table.
Example:
tableinfo1_L = TableInfo("user", "email") tableinfo1_R = TableInfo("payment_received", "email") tableinfo2_L = TableInfo("user", "userid") tableinfo2_R = TableInfo("user_item_like", "userid") HashJoinExecutor(session, keyspace_name) \ .fullOuterJoin(tableinfo1_L, tableinfo1_R) \ .join(tableinfo2_L, tableinfo2_R) \ .execute() \ .save_result("chained_hash_join")
Notice that on left table info for second join operation tableinfo2_L
, we can use user
table since user
and payment_received
are in the result of the previous join operation.
Cautions and Notes
cassandra-joinlib
will save some/all the incoming data from Cassandra due to the big size that cannot be fit into memory entirely.
Both HashJoinExecutor
and NestedJoinExecutor
have partitioning mechanism and save the partitioned data inside a temporary folder called tmpfolder
.
The result of the join operation will be saved inside the result
folder.
Note that both tmpfolder
and result
will be created inside your current working directory, please make sure you have enough space available for them.
Author : Rafi Adyatma
Initially developed for my final project at Bandung Institute of Technology
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cassandra_joinlib-0.3.1.tar.gz
.
File metadata
- Download URL: cassandra_joinlib-0.3.1.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc0ae017b7c2cd73f66bea6e75bd4524540fd82403b6bb373281ed02e76dced4 |
|
MD5 | 9ec17ba214b057907b942120c793e2b1 |
|
BLAKE2b-256 | ca37c1b01a4f5855cd219f3afac3debb1ad00e62b35dbf8ed16fc3132a84f6e4 |
File details
Details for the file cassandra_joinlib-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: cassandra_joinlib-0.3.1-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87d834e23bed24fa76d177f669d701606567d124b7e6487ff58810d7b26694e0 |
|
MD5 | aabedc04c8ea0e2b4a5086b70496c479 |
|
BLAKE2b-256 | fd10bff5c2c7515ade56328edaa007c563084e3dff4cc899fdca42cbbdc54c05 |