vengeance

Library focusing on row-major organization of tabular data and control over the Excel application

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

For example usage, see:

https://github.com/michael-ross-ven/vengeance_example/blob/main/vengeance_example/flux_example.py
https://github.com/michael-ross-ven/vengeance_example/blob/main/vengeance_example/excel_example.py

Managing data stored as rows and columns shouldn't be complicated.

When given a list of lists in Python, your first instinct is to loop over rows and modify column values, in-place. It's the most natural way to think about the data, because conceptually, each row is some entity, and each column is a property of that row, much like a list of objects.

A headache when dealing with list of lists however, is having to keep track of columns by integer elements; it would be nice to replace the indices on each row with named attributes, and have these applied even when the columns are not known ahead of time, such as when pulling data from a sql table or csv file.

for row in matrix:
    row[17]            # what's in that 18th column again?

for row in matrix:
    row.customer_id    # oh, duh

Doesn't the pandas DataFrame already already solve this?

In a DataFrame, data is taken out of its native nested list format and is organized in column-major order, which comes with some advantages as well as drawbacks.

Row-major order:

    [['attribute_a', 'attribute_b', 'attribute_c'],
     ['a',           'b',           3.0],
     ['a',           'b',           3.0],
     ['a',           'b',           3.0]]

Column-major order:

    {'attribute_a': array(['a', 'a', 'a'], dtype='<U1'),
     'attribute_b': array(['b', 'b', 'b'], dtype='<U1'),
     'attribute_c': array([3.,   3.,  3.], dtype=float64)}

In column-major order, values in a single column are usually all of the same datatype, so can be packed into consecutive addresses in memory as an actual array. These contiguous elements along a single column can be iterated very quickly. But in a DataFrame, the ability to organize data where each row is some entity, and each column is a property of that row, is mind-numbingly slow. (DataFrame.iterrows() incurs a huge performance penalty, and can be 1,000x times slower to iterate than a built-in list)

DataFrames also take advantage of vectorization, where operations can be applied to an entire set of values at once. But removal of explicit loops requires specialized methods for almost every operation and modification, which makes the syntax more challenging. DataFrame transformations can be counter-inituitive to write and effortful to read, especially when method-chaining is overused.

# wait, what exactly does this do again?
df['column'] = np.sign(df.column.diff().fillna(0)).shift(-1).fillna(0) \
               .apply(lambda x: (x['column'].head(1),
                                 x.shape[0],
                                 x['start'].iloc[-1] - x['start'].iloc[0]))

DataFrame Advantages:

vectorized operations on contiguous arrays are very fast

DataFrame Disadvantages:

syntax doesnt always drive intuition
iteration by rows is almost completely out of the question
(& working with json files is notoriously difficult)
managing datatypes can sometimes be problematic
harder to debug / inspect when vectorized operations return an error

I mean, but why are we working in Python to begin with?

emphasis on code readability
less concerned about hyper-optimized execution times
datatypes and array allocation are abstracted away
so does the DataFrame really reinforce what makes Python so great?

"Explicit is better than implicit"
"Simple is better than complex"
"Sparse is better than dense"
"Readability counts"
"There should be one– and preferably only one –obvious way to do it"

vengeance.flux_cls

similar idea behind a pandas DataFrame, but is more closely aligned with Python's design philosophy
when you're willing to trade for a little bit of speed for a lot simplicity
a lightweight, pure-python wrapper class around list of lists
applies named attributes to rows; attribute values are mutable during iteration
provides convenience aggregate operations (sort, filter, groupby, etc)
excellent for prototyping and data-wrangling

row-major

# organized like csv data, attribute names are provided in first row
matrix = [['attribute_a', 'attribute_b', 'attribute_c'],
          ['a',           'b',           3.0],
          ['a',           'b',           3.0],
          ['a',           'b',           3.0]]
flux = vengeance.flux_cls(matrix)

# row attributes can be accessed by name or by index
for row in flux:
    a = row.attribute_a
    a = row['attribute_a']
    a = row[-1]
    a = row.values[:-2]

    row.attribute_a    = None
    row['attribute_a'] = None
    row[-1]            = None
    row.values[:2]     = [None, None]

# transformations are compositional and self-documenting
for row in flux:
    row.hypotenuse = math.sqrt(row.side_a**2 +,
                               row.side_b**2)

matrix = list(flux.values())

columns

# entire columns can be referenced with __getitem__ / __setitem__ syntax
column = flux['attribute_a']

flux.rename_columns({'attribute_a': 'renamed_a',
                     'attribute_b': 'renamed_b'})
flux.insert_columns((0, 'inserted_a'),
                    (2, 'inserted_b'))
flux.delete_columns('inserted_a',
                    'inserted_b')

sort / filter / apply

flux.sort('attribute_c')
flux.filter(lambda row: row.attribute_b != 'c')
u = flux.unique('attribute_a', 'attribute_b')

# apply functions like you'd normally do in Python: with comprehensions
flux['attribute_new'] = [some_function(v) for v in flux['attribute_a']]

groupby

matrix = [['year', 'month', 'random_float'],
          ['2000', '01',     random.uniform(0, 9)],
          ['2000', '02',     random.uniform(0, 9)],
          ['2001', '01',     random.uniform(0, 9)],
          ['2001', '01',     random.uniform(0, 9)],
          ['2001', '01',     random.uniform(0, 9)],
          ['2002', '01',     random.uniform(0, 9)]]
flux = flux_cls(matrix)

dict_1 = flux.map_rows_append('year', 'month')
countifs = {k: len(rows) for k, rows in dict_1.items()}
sumifs   = {k: sum(row.random_float for row in rows)
                                    for k, rows in dict_1.items()}

dict_2 = flux.map_rows_nested('year', 'month')
rows_1 = dict_1[('2001', '01')]
rows_2 = dict_2['2001']['01']

read / write files

flux.to_csv('file.csv')
flux = flux_cls.from_csv('file.csv')

flux.to_json('file.json')
flux = flux_cls.from_json('file.json')

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.1.38

Aug 19, 2022

1.1.37

Jul 29, 2022

1.1.36

Jun 16, 2022

1.1.35

Jun 8, 2022

1.1.34

Jun 3, 2022

1.1.33

May 29, 2022

1.1.32

May 29, 2022

1.1.31

May 27, 2022

1.1.30

May 25, 2022

1.1.29

May 19, 2022

1.1.28

May 6, 2022

1.1.27

May 6, 2022

1.1.26

Apr 13, 2022

This version

1.1.25

Apr 11, 2022

1.1.24

Apr 11, 2022

1.1.23

Apr 8, 2022

1.1.22

Apr 3, 2022

1.1.21

Sep 9, 2021

1.1.20

Jul 24, 2021

1.1.19

Jul 24, 2021

1.1.18

Jul 24, 2021

1.1.17

Jul 18, 2021

1.1.16

Jul 17, 2021

1.1.15

May 1, 2021

1.1.14

Mar 20, 2021

1.1.13

Mar 18, 2021

1.1.12

Feb 26, 2021

1.1.11

Feb 26, 2021

1.1.10

Feb 7, 2021

1.1.9

Jan 16, 2021

1.1.8

Jan 7, 2021

1.1.7

Dec 28, 2020

1.1.6

Dec 27, 2020

1.1.5

Dec 23, 2020

1.1.4

Dec 20, 2020

1.1.3

Dec 18, 2020

1.1.2

Aug 27, 2020

1.1.1

Jul 24, 2020

1.1.0

Jul 22, 2020

1.0.44

Apr 6, 2020

1.0.42

Jan 31, 2020

1.0.41

Jan 21, 2020

1.0.40

Jan 20, 2020

1.0.39

Jan 18, 2020

1.0.38

Jan 18, 2020

1.0.37

Dec 2, 2019

1.0.36

Oct 25, 2019

1.0.35

Sep 16, 2019

1.0.34

Sep 8, 2019

1.0.33

Sep 3, 2019

1.0.31

Aug 28, 2019

1.0.30

Aug 25, 2019

1.0.29

Aug 19, 2019

1.0.28

Aug 8, 2019

1.0.27

Jun 27, 2019

1.0.26

May 30, 2019

1.0.25

May 24, 2019

1.0.24

May 19, 2019

1.0.23

May 19, 2019

1.0.22

May 14, 2019

1.0.21

May 7, 2019

1.0.20

May 1, 2019

1.0.19

Apr 24, 2019

1.0.18

Apr 20, 2019

1.0.17

Apr 19, 2019

1.0.16

Apr 12, 2019

1.0.15

Apr 11, 2019

1.0.14

Apr 5, 2019

1.0.13

Apr 1, 2019

1.0.12

Apr 1, 2019

1.0.11

Mar 29, 2019

1.0.10

Mar 28, 2019

1.0.9

Mar 28, 2019

1.0.8

Mar 23, 2019

1.0.7

Mar 22, 2019

1.0.6

Mar 22, 2019

1.0.5

Mar 22, 2019

1.0.4

Mar 22, 2019

1.0.3

Mar 22, 2019

1.0.2

Mar 21, 2019

1.0.1

Mar 21, 2019

1.0.0

Mar 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vengeance-1.1.25.tar.gz (70.2 kB view hashes)

Uploaded Apr 11, 2022 Source

Built Distribution

vengeance-1.1.25-py3-none-any.whl (78.3 kB view hashes)

Uploaded Apr 11, 2022 Python 3

Hashes for vengeance-1.1.25.tar.gz

Hashes for vengeance-1.1.25.tar.gz
Algorithm	Hash digest
SHA256	`55fbebb56921dbc4a43efdb3a2428e7a619cb8ceb52d835aa739b64329f6bd6c`
MD5	`e470c4fbd604a0fd30d660580f2b1879`
BLAKE2b-256	`c63967f7888337aefc38f1754fb7e3b0a83f49a1d85f91aba2be326a3163c8ca`

Hashes for vengeance-1.1.25-py3-none-any.whl

Hashes for vengeance-1.1.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e40ecd50564b83e37738a9c6de53277c43bdb1269f6cdd5bd573b8a28abd61b`
MD5	`c72d1b067f12167349ee47f2b9637e21`
BLAKE2b-256	`62a99afe94f46c33e30cf836ee4c91f8d7c9a1e458084c02a7fb2fc79af7cf19`