Skip to main content

COCOA: COrrelation COefficient-Aware Data Augmentation

Project description

COCOA

COrrelation COefficient-Aware Data Augmentation

Table of Contents

Prerequisites

Python version and packages

This packages requires python version >= 3.6. Additionally, the following packages have to be installed:

pandas
numpy

Database schemas

To run the system, the following set of tables need to be created in the database (e.g. Postgres):

  • main_tokenized: Inverted index, tokenized -> table, col, row
  • distinct_tokens: Like main_tokenized, but maps only distinct tokenized -> table_col_id to compute overlap faster
  • order_index: Index for each column. The index is created offline and stored in the db.
  • max_column: Maps table-> number of columns for quicker access

Schemas:

CREATE TABLE main_tokenized (
    tokenized TEXT,
    tableid INT NOT NULL,
    rowid INT NOT NULL,
    table_col_id TEXT NOT NULL,
);

CREATE TABLE distinct_tokens (
    tokenized TEXT, 
    table_col_id TEXT NOT NULL
);

CREATE TABLE order_index (
    table_col_id TEXT NOT NULL
    is_numeric BOOLEAN,
    min_index INT NOT NULL,
    order_list TEXT,
    binary_list TEXT,
);

CREATE TABLE max_column (
    tableid INT NOT NULL,
    max_col_id INT NOT NULL,
    PRIMARY KEY (tableid)
);

To fill distinct_tokens and max_column, run the following queries:

INSERT INTO distinct_tokens
SELECT DISTINCT tokenized, table_col_id
FROM main_tokenized;

INSERT INTO max_column 
SELECT tableid, MAX(colid)
FROM main_tokenized
GROUP BY tableid;

The order index can be created by calling

COCOA.create_index(values)

where values is a column (list of values). The index then has to be stored in the order_index table in the db for every column in the dataset.

Installation

Run the following command to install COCOA to your python environment:

pip install cocoa

Usage

import pandas as pd
import psycopg2
from cocoa-system import DataAugmentation

CONN_INFO = {
    'host': '127.0.0.1',
    'dbname': 'db',
    'user': 'postgres',
    'password': 'password',
}

DB_TABLES = {
    'dt': 'distinct_tokens',
    'mt': 'main_tokenized_overlap',
    'mc': 'max_column',
    'oi': 'order_index',
}

DATASET_PATH = 'datasets/query.csv'
dataset = pd.read_csv(DATASET_PATH)

conn = psycopg2.connect(**CONN_INFO)
cocoa = DataAugmentation.COCOAHandler(conn, DB_TABLES)
result = cocoa.enrich(dataset, k_c, k_t, 'my_query_column', 'my_target_column')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoa-system-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

cocoa_system-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file cocoa-system-0.1.0.tar.gz.

File metadata

  • Download URL: cocoa-system-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for cocoa-system-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d008dc3e1377c361dc4d1320cd53a111564dc4a5f47e57d325cb32f70f267d44
MD5 1c6980f8495ac57e1c26951f5facaa6f
BLAKE2b-256 97ef9463b19ea5a3cbe481844a3fb128d9023f1897a65b0e7e4e0bbaccfae0f5

See more details on using hashes here.

File details

Details for the file cocoa_system-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cocoa_system-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for cocoa_system-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2bcfd41dc326b72b315c41ee7b3a142a154c1f84793e41ef4c77fee23abfa43
MD5 b619d1d8464e5c49e5ba58896f00abad
BLAKE2b-256 db428380d5cfea372bf7b82d8538ef73f4eaada3d556441876527adf87f78da3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page