Skip to main content

Cassandra client-side join operation for Python

Project description

Cassandra-Join-Library

Cassandra Join Operation Extension for Python. Client-sided and runs on single machine.

Latest Version : 0.3.1

Requirements :

  1. Python >= 3.7
  2. Supporting Only Cassandra 4.0

Installation Guide

To install the package you will need pip, then run pip install cassandra-joinlib. Pip install will automatically install all dependencies and you are ready to go. 🍻

Usage Guide

It is actually pretty simple to use this library, this is the general steps for you to follow:

  1. This library works as an executor for join operation, so it focuses on JoinExecutor class which has 2 types that you can utilize based on your needs:

    • If you need an equi join, it is better to use HashJoinExecutor. HashJoinExecutor is importable from cassandra_joinlib.hash_join
    • For non equi-join, you may use NestedJoinExecutor (Hash join doesn't support non equi-join). Import NestedJoinExecutor from cassandra_joinlib.nested_join.
  2. Both HashJoinExecutor and NestedJoinExecutor requires two parameters to be initialized, your connection with cassandra session and the name of the keyspace keyspace_name.

    Example: HashJoinExecutor(session, keyspace_name).

  3. Once you have initialized the executor, you may add join operation/operations using these functions:

    • join(TableInfo left_table_info, TableInfo, right_table_info, str operator) for inner join
    • rightJoin(TableInfo left_table_info, TableInfo right_table_info, str operator) for right outer join
    • leftJoin(TableInfo, left_table_info, TableInfo right_table_info, str operator) for left outer join
    • fullOuterJoin(TableInfo left_table_info, TableInfo right_table_info, str operator) for full outer join
  4. We need TableInfo object for our join function. To initialize it, we need 2 params which are name of the table and name of the join column for that table. TableInfo can be imported from cassandra_joinlib.commands.

    Example: Let's say we want to (inner) join table user and table payment_received based on join column email using HashJoinExecutor.

    left_table_info = TableInfo('user', 'email')
    right_table_info = TableInfo('payment_received', 'email')
    HashJoinExecutor(session, keyspace_name).join(left_table_info, right_table_info)
    
  5. Each of the join function returns join executor itself, so you may add a chained join operation like this:

    NestedJoinExecutor(session, keyspace_name) \
    .join(left_table_info, right_table_info, "<") \
    .fullOuterJoin(left_table_info_2, right_table_info2, "=")
    
  6. After you add join operation/operations to the executor, it actually queue the join command(s). You may execute the join operation when you think you are ready by using .execute() from the join executor and do not forget to save the join result with .save_result(filename).

    Example:

    HashJoinExecutor(session, keyspace_name) \
    .join(left_tbl_info, right_tbl_info) \
    .execute() \
    .save_result("my_first_join")
    

    Note that the result will be saved in .txt file in JSON Format

  7. Run the python script and wait until it finishes execution.

  8. You may also print the join result by using printJoinResult(filename, max_buffer_size) which can be imported from cassandra_joinlib.utils. Param max_buffer_size is the number of rows that will be printed per tabulate and has 10000 as a default value. Print function should be called after .execute() function.

  9. That's it, you're done 🥳 🥳 🥳


Hey... you can also get join execution time with .get_time_elapsed() called on the join executor after .execute() function.

Important things you need to know

This library executes any chained-join with left deep join method. It means that result of first join operation would be the left table of the second join operation, the result of second join operation would be the left table of the third join operation, and so on... Left deep join tree

For NON-first join, make sure you use the result from the previous join as left table.

Example:

tableinfo1_L = TableInfo("user", "email")
tableinfo1_R = TableInfo("payment_received", "email")

tableinfo2_L = TableInfo("user", "userid")
tableinfo2_R = TableInfo("user_item_like", "userid")

HashJoinExecutor(session, keyspace_name) \
.fullOuterJoin(tableinfo1_L, tableinfo1_R) \
.join(tableinfo2_L, tableinfo2_R) \
.execute() \
.save_result("chained_hash_join")

Notice that on left table info for second join operation tableinfo2_L, we can use user table since user and payment_received are in the result of the previous join operation.

Cautions and Notes

cassandra-joinlib will save some/all the incoming data from Cassandra due to the big size that cannot be fit into memory entirely. Both HashJoinExecutor and NestedJoinExecutor have partitioning mechanism and save the partitioned data inside a temporary folder called tmpfolder. The result of the join operation will be saved inside the result folder.

Note that both tmpfolder and result will be created inside your current working directory, please make sure you have enough space available for them.

Author : Rafi Adyatma

Initially developed for my final project at Bandung Institute of Technology

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cassandra_joinlib-0.3.1.tar.gz (26.0 kB view hashes)

Uploaded Source

Built Distribution

cassandra_joinlib-0.3.1-py3-none-any.whl (27.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page