COCOA: COrrelation COefficient-Aware Data Augmentation
Project description
COCOA
COrrelation COefficient-Aware Data Augmentation
Table of Contents
Prerequisites
Python version and packages
This packages requires python version >= 3.6. Additionally, the following packages have to be installed:
pandas
numpy
Database schemas
To run the system, the following set of tables need to be created in the database (e.g. Postgres):
main_tokenized
: Inverted index, tokenized -> table, col, rowdistinct_tokens
: Likemain_tokenized
, but maps only distinct tokenized -> table_col_id to compute overlap fasterorder_index
: Index for each column. The index is created offline and stored in the db.max_column
: Maps table-> number of columns for quicker access
Schemas:
CREATE TABLE main_tokenized (
tokenized TEXT,
tableid INT NOT NULL,
rowid INT NOT NULL,
table_col_id TEXT NOT NULL,
);
CREATE TABLE distinct_tokens (
tokenized TEXT,
table_col_id TEXT NOT NULL
);
CREATE TABLE order_index (
table_col_id TEXT NOT NULL
is_numeric BOOLEAN,
min_index INT NOT NULL,
order_list TEXT,
binary_list TEXT,
);
CREATE TABLE max_column (
tableid INT NOT NULL,
max_col_id INT NOT NULL,
PRIMARY KEY (tableid)
);
To fill distinct_tokens
and max_column
, run the following queries:
INSERT INTO distinct_tokens
SELECT DISTINCT tokenized, table_col_id
FROM main_tokenized;
INSERT INTO max_column
SELECT tableid, MAX(colid)
FROM main_tokenized
GROUP BY tableid;
The order index can be created by calling
COCOA.create_index(values)
where values is a column (list of values). The index then has to be stored in the order_index
table in the db for
every column in the dataset.
Installation
Run the following command to install COCOA to your python environment:
pip install cocoa
Usage
import pandas as pd
import psycopg2
from cocoa-system import DataAugmentation
CONN_INFO = {
'host': '127.0.0.1',
'dbname': 'db',
'user': 'postgres',
'password': 'password',
}
DB_TABLES = {
'dt': 'distinct_tokens',
'mt': 'main_tokenized_overlap',
'mc': 'max_column',
'oi': 'order_index',
}
DATASET_PATH = 'datasets/query.csv'
dataset = pd.read_csv(DATASET_PATH)
conn = psycopg2.connect(**CONN_INFO)
cocoa = DataAugmentation.COCOAHandler(conn, DB_TABLES)
result = cocoa.enrich(dataset, k_c, k_t, 'my_query_column', 'my_target_column')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cocoa-system-0.1.0.tar.gz
(11.9 kB
view hashes)
Built Distribution
Close
Hashes for cocoa_system-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2bcfd41dc326b72b315c41ee7b3a142a154c1f84793e41ef4c77fee23abfa43 |
|
MD5 | b619d1d8464e5c49e5ba58896f00abad |
|
BLAKE2b-256 | db428380d5cfea372bf7b82d8538ef73f4eaada3d556441876527adf87f78da3 |