Skip to main content

COCOA: COrrelation COefficient-Aware Data Augmentation

Project description

COCOA

COrrelation COefficient-Aware Data Augmentation

Table of Contents

Prerequisites

Python version and packages

This packages requires python version >= 3.6. Additionally, the following packages have to be installed:

pandas
numpy

Database schemas

To run the system, the following set of tables need to be created in the database (e.g. Postgres):

  • main_tokenized: Inverted index, tokenized -> table, col, row
  • distinct_tokens: Like main_tokenized, but maps only distinct tokenized -> table_col_id to compute overlap faster
  • order_index: Index for each column. The index is created offline and stored in the db.
  • max_column: Maps table-> number of columns for quicker access

Schemas:

CREATE TABLE main_tokenized (
    tokenized TEXT,
    tableid INT NOT NULL,
    rowid INT NOT NULL,
    table_col_id TEXT NOT NULL,
);

CREATE TABLE distinct_tokens (
    tokenized TEXT, 
    table_col_id TEXT NOT NULL
);

CREATE TABLE order_index (
    table_col_id TEXT NOT NULL
    is_numeric BOOLEAN,
    min_index INT NOT NULL,
    order_list TEXT,
    binary_list TEXT,
);

CREATE TABLE max_column (
    tableid INT NOT NULL,
    max_col_id INT NOT NULL,
    PRIMARY KEY (tableid)
);

To fill distinct_tokens and max_column, run the following queries:

INSERT INTO distinct_tokens
SELECT DISTINCT tokenized, table_col_id
FROM main_tokenized;

INSERT INTO max_column 
SELECT tableid, MAX(colid)
FROM main_tokenized
GROUP BY tableid;

The order index can be created by calling

COCOA.create_index(values)

where values is a column (list of values). The index then has to be stored in the order_index table in the db for every column in the dataset.

Installation

Run the following command to install COCOA to your python environment:

pip install cocoa

Usage

import pandas as pd
import psycopg2
from cocoa import DataAugmentation

CONN_INFO = {
    'host': '127.0.0.1',
    'dbname': 'db',
    'user': 'postgres',
    'password': 'password',
}

DB_TABLES = {
    'dt': 'distinct_tokens',
    'mt': 'main_tokenized_overlap',
    'mc': 'max_column',
    'oi': 'order_index',
}

DATASET_PATH = 'datasets/query.csv'
dataset = pd.read_csv(DATASET_PATH)

conn = psycopg2.connect(**CONN_INFO)
cocoa = DataAugmentation.COCOAHandler(conn, DB_TABLES)
result = cocoa.enrich(dataset, k_c, k_t, 'my_query_column', 'my_target_column')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoa-system-0.0.5.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

cocoa_system-0.0.5-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file cocoa-system-0.0.5.tar.gz.

File metadata

  • Download URL: cocoa-system-0.0.5.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for cocoa-system-0.0.5.tar.gz
Algorithm Hash digest
SHA256 d8cebef76f9051e29434f437b7058918c185730e4174b9fdaee5650316bd4cf4
MD5 16dbbcec7514fcdde5325fff0b02211d
BLAKE2b-256 ec2b53d5b9740160d829351c726a20d8ec24af36906f56b3d0425ab79414aa26

See more details on using hashes here.

File details

Details for the file cocoa_system-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: cocoa_system-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for cocoa_system-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f515d0941763135d99b44d07912dc16afad4b1bc0bacbd1d6c20a0e32263604a
MD5 327e73c7767fb077bffbed7313049956
BLAKE2b-256 8148e42668c1190589ffcc97bb499848b1603166f8833bc59fe4f6068a5b74a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page