Utilities for data manipulation including creation of DAGs and tables
Project description
pyg-sql
pyg-sql creates sql_table, a think wrapper on sql-alchemy (sa.Table), providing three different functionailities:
- simplified create/filter/sort/access of a table
- maintainance of a table where records are unique per specified primary keys. Auto-archive of old data
- creation of a full no-sql like document-store
Simplification is achieved since: - sql_table holds both the Table and the engine objects so we can merge these operations into one call. - In addition, we maintain the 'statement' so that a final selction statement can be built gradually rather than all at once
Primary Keys are applied if primary keys (pk) are specified, sql_table also ensure we always have at most one entry of that key. We create a parallel deleted_database.table to auto-archive
Document store: Simple choose doc = True, the table will then be viewed as a document-store.
:Example: table creation
------------------------
>>> from pyg_base import *
>>> from pyg_sql import *
>>> import datetime
>>> t = get_sql_table(db = 'test', table = 'students', non_null = ['name', 'surname'],
_id = dict(_id = int, created = datetime.datetime),
nullable = dict(doc = str, details = str, dob = datetime.date, age = int, grade = float))
>>> t = t.delete()
>>> assert len(t) == 0; assert t.count() == 0
:Example: table insertion
-------------------------
>>> t = t.insert(name = 'yoav', surname = 'git', age = 48)
>>> t = t.insert(name = 'anna', surname = 'git', age = 37)
>>> assert len(t) == 2
>>> t = t.insert(name = ['ayala', 'itamar', 'opher'], surname = 'gate', age = [17, 11, 16])
>>> assert len(t) == 5
:Example: simple access
-----------------------
>>> assert t.sort('age')[0].name == 'itamar' # youngest
>>> assert t.sort('age')[-1].name == 'yoav' # access of last record
>>> assert t.sort(dict(age=-1))[0].name == 'yoav' # sort in descending order
>>> assert t.sort('name')[::].name == ['anna', 'ayala', 'itamar', 'opher', 'yoav']
>>> assert t.sort('name')[['name', 'surname']][::].shape == (5, 2) ## access of specific column(s)
>>> assert t.distinct('surname') == ['gate', 'git']
>>> assert t['surname'] == ['gate', 'git']
>>> assert t[dict(name = 'yoav')] == t.inc(name = 'yoav')[0]
:Example: simple filtering
--------------------------
>>> assert len(t.inc(surname = 'gate')) == 3
>>> assert len(t.inc(surname = 'gate').inc(name = 'ayala')) == 1 # you can build filter in stages
>>> assert len(t.inc(surname = 'gate', name = 'ayala')) == 1 # or build in one step
>>> assert len(t.inc(surname = 'gate').exc(name = 'ayala')) == 2
>>> assert len(t > dict(age = 30)) == 2
>>> assert len(t <= dict(age = 37)) == 4
>>> assert len(t.inc(t.c.age > 30)) == 2 # can filter using the standard sql-alchemy .c.column objects
>>> assert len(t.where(t.c.age > 30)) == 2 # can filter using the standard sql-alchemy "where" statement
:Example: primary keys and deleted records
------------------------------------------
The table as set up can have multiple items so:
>>> t = t.delete()
>>> t = t.insert(name = 'yoav', surname = 'git', age = 46)
>>> t = t.insert(name = 'yoav', surname = 'git', age = 47)
>>> t = t.insert(name = 'yoav', surname = 'git', age = 48)
>>> assert len(t) == 3
>>> t = t.delete()
>>> t = get_sql_table(db = 'test', table = 'students', non_null = ['name', 'surname'],
_id = dict(_id = int, created = datetime.datetime),
nullable = dict(doc = str, details = str, dob = datetime.date, age = int, grade = float),
pk = ['name', 'surname']) ## <<<------- We set primary keys
>>> t = t.delete()
>>> t = t.insert(name = 'yoav', surname = 'git', age = 46)
>>> t = t.insert(name = 'yoav', surname = 'git', age = 47)
>>> t = t.insert(name = 'yoav', surname = 'git', age = 48)
>>> assert len(t) == 1
>>> assert t[0].age == 48
Where did the data go to? We automatically archive the deleted old records for dict(name = 'yoav', surname = 'git') here:
>>> t.deleted
t.deleted is a table by same name
- exists on deleted_test database,
- same table structure with added 'deleted' column
>>> assert len(t.deleted.inc(name = 'yoav', age = 46)) > 0
>>> t.deleted.delete()
:Example: doc management
------------------------
We now suppose that we are not sure what records we want to keep for each student
>>> from pyg import *
>>> import datetime
>>> t = get_sql_table(db = 'test', table = 'unstructured_students', non_null = ['name', 'surname'],
_id = dict(_id = int, created = datetime.datetime),
nullable = dict(doc = str, details = str, dob = datetime.date, age = int, grade = float),
pk = ['name', 'surname'],
doc = True) ##<---- The table will actually be a document store
We are now able to keep varied structure per each record. We are only able to filter against the columns specified above
>>> t = t.delete()
>>> doc = dict(name = 'yoav', surname = 'git', age = 30, profession = 'coder', children = ['ayala', 'opher', 'itamar'])
>>> inserted_doc = t.insert_one(doc)
>>> assert t.inc(name = 'yoav', surname = 'git')[0].children == ['ayala', 'opher', 'itamar']
>>> doc2 = dict(name = 'anna', surname = 'git', age = 28, employer = 'Cambridge University', hobbies = ['chess', 'music', 'swimming'])
>>> _ = t.insert_one(doc2)
>>> assert t[dict(age = 28)].hobbies == ['chess', 'music', 'swimming'] # Note that we can filter or search easily using the column 'age' that was specified in table. We cannot do this on 'employer'
:Example: document store containing pd.DataFrames.
----------
>>> from pyg import *
>>> doc = dict(name = 'yoav', surname = 'git', age = 35,
salary = pd.Series([100,200,300], drange(2)),
costs = pd.DataFrame(dict(transport = [0,1,2], food = [4,5,6], education = [10,20,30]), drange(2)))
>>> t = get_sql_table(db = 'test', table = 'unstructured_students', non_null = ['name', 'surname'],
_id = dict(_id = int, created = datetime.datetime),
nullable = dict(doc = str, details = str, dob = datetime.date, age = int, grade = float),
pk = ['name', 'surname'],
writer = 'c:/temp/%name/%surname.parquet', ##<---- The location where pd.DataFrame/Series are to be stored
doc = True)
>>> inserted = t.insert_one(doc)
>>> import os
>>> assert 'costs.parquet' in os.listdir('c:/temp/yoav/git') and ('salary.parquet' in os.listdir('c:/temp/yoav/git'))
We can now access the data seemlessly:
>>> read_from_db = t.inc(name = 'yoav')[0]
>>> read_from_file = pd_read_parquet('c:/temp/yoav/git/salary.parquet')
>>> assert list(read_from_db.salary.values) == [100, 200, 300]
>>> assert list(read_from_file.values) == [100, 200, 300]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.