COCOA: COrrelation COefficient-Aware Data Augmentation
Project description
COCOA
COrrelation COefficient-Aware Data Augmentation
Table of Contents
Prerequisites
Python version and packages
This packages requires python version >= 3.6. Additionally, the following packages have to be installed:
pandas
numpy
Database schemas
To run the system, the following set of tables need to be created in the database (e.g. Postgres):
main_tokenized
: Inverted index, tokenized -> table, col, rowdistinct_tokens
: Likemain_tokenized
, but maps only distinct tokenized -> table_col_id to compute overlap fasterorder_index
: Index for each column. The index is created offline and stored in the db.max_column
: Maps table-> number of columns for quicker access
Schemas:
CREATE TABLE main_tokenized (
tokenized TEXT,
tableid INT NOT NULL,
rowid INT NOT NULL,
table_col_id TEXT NOT NULL,
);
CREATE TABLE distinct_tokens (
tokenized TEXT,
table_col_id TEXT NOT NULL
);
CREATE TABLE order_index (
table_col_id TEXT NOT NULL
is_numeric BOOLEAN,
min_index INT NOT NULL,
order_list TEXT,
binary_list TEXT,
);
CREATE TABLE max_column (
tableid INT NOT NULL,
max_col_id INT NOT NULL,
PRIMARY KEY (tableid)
);
To fill distinct_tokens
and max_column
, run the following queries:
INSERT INTO distinct_tokens
SELECT DISTINCT tokenized, table_col_id
FROM main_tokenized;
INSERT INTO max_column
SELECT tableid, MAX(colid)
FROM main_tokenized
GROUP BY tableid;
The order index can be created by calling
COCOA.create_index(values)
where values is a column (list of values). The index then has to be stored in the order_index
table in the db for
every column in the dataset.
Installation
Run the following command to install COCOA to your python environment:
pip install cocoa
Usage
import pandas as pd
import psycopg2
from cocoa import DataAugmentation
CONN_INFO = {
'host': '127.0.0.1',
'dbname': 'db',
'user': 'postgres',
'password': 'password',
}
DB_TABLES = {
'dt': 'distinct_tokens',
'mt': 'main_tokenized_overlap',
'mc': 'max_column',
'oi': 'order_index',
}
DATASET_PATH = 'datasets/query.csv'
dataset = pd.read_csv(DATASET_PATH)
conn = psycopg2.connect(**CONN_INFO)
cocoa = DataAugmentation.COCOAHandler(conn, DB_TABLES)
result = cocoa.enrich(dataset, k_c, k_t, 'my_query_column', 'my_target_column')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cocoa-system-0.0.5.tar.gz
.
File metadata
- Download URL: cocoa-system-0.0.5.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8cebef76f9051e29434f437b7058918c185730e4174b9fdaee5650316bd4cf4 |
|
MD5 | 16dbbcec7514fcdde5325fff0b02211d |
|
BLAKE2b-256 | ec2b53d5b9740160d829351c726a20d8ec24af36906f56b3d0425ab79414aa26 |
File details
Details for the file cocoa_system-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: cocoa_system-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f515d0941763135d99b44d07912dc16afad4b1bc0bacbd1d6c20a0e32263604a |
|
MD5 | 327e73c7767fb077bffbed7313049956 |
|
BLAKE2b-256 | 8148e42668c1190589ffcc97bb499848b1603166f8833bc59fe4f6068a5b74a9 |