Wrapper around awk to use as a dataframe implementation in python
Project description
awk_dataframe
This library is intended to use for fast and low RAM memory consumption of very large .csv files. It works by accumulating a sequence of awk commands that will be executed through pipes in bash when the actual values of the dataframe are to be consumed.
First things first
This is an wrapper around AWK for its use as a dataframe implementation, therefore, it won't work unless you are using a Linux distribution that can run BASH and GAWK. It could work on a macOS but I haven't tested it. I am personally running Ubuntu 22.04. It also compiles two C++ files (to read and save) and needs g++ in the system in order to compile them.
It will also not work if your regional settings use the comma as the decimal separator. One way to change this is to set the regional settings to the UK as follows:
sudo update-locale LC_NUMERIC="en_GB.UTF-8"
and then logout of your system.
Disk usage
Not only this library runs directly from the hard drive trying to avoid RAM limitations which will degrade your hard drive.
It also stores temporary files into the ~/.tmp
folder. Altough they should be deleted automatically, please check from time to time, since there might be some bug that leaves files behind. Also, although it will be automatically created, make sure that it already exists in your disk, in order to avoid failures at creation time due to permissions.
Most commands do not take time, but be advised that df.shape()
, print(df)
, df.head()
, df.values()
, df.to_pandas()
, df.to_npd()
and df.to_csv(path)
will run the full set of commands and might take time, avoid using them as much as possible. The best implementation would be to generate all conditions for subsetting and then either going to a pandas/numpy_dataframe object if you want to further manipulate in RAM or using df.to_csv(path)
to save the results to the hard drive.
Authors and acknowledgment
Idea and implementation by Carlos Molinero.
License
MIT license.
Project status
Currently this is an early implementation, meaning that it is in a very unstable state, and the syntax might change and bugs may arise. I do not recommend installing it, I am publishing it for my personal use.
Usage:
In the following lines I paste examples of the main commands, as some sort of documentation until a more complete documentation is available.
INIT
import os
import sys
import numpy as np
import numpy_dataframe as npd
import pandas as pd
import awk_dataframe as ad
import importlib
import time
import random
import string
Functions
def get_random_string(length):
# thanks to https://pynative.com/python-generate-random-string/
# choose from all lowercase letter
letters = string.ascii_lowercase
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
create test csv
n = 20
names = np.array([get_random_string(5) for x in range(5)])
t = npd.DataFrame()
t.values = np.random.rand(n)
t.some_ints = np.floor(np.random.rand(n)*50).astype(int)
t.name = names[np.random.choice(range(5),n)]
path = os.path.expanduser("~") + "/.tmp/example.csv"
t.to_csv(path)
Gawk dataframe
contruct object / read csv
path = os.path.expanduser("~") + "/.tmp/example.csv"
df = ad.read_csv(path)
or
path = os.path.expanduser("~") + "/.tmp/example.csv"
df = ad.DataFrame()
df.read_csv(path)
select rows/columns
df[0:10,:]
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
df.some_ints
some_ints
21
32
24
44
2
31
44
49
44
36
9
11
12
31
7
40
26
38
35
26
df["some_ints"]
some_ints
21
32
24
44
2
31
44
49
44
36
9
11
12
31
7
40
26
38
35
26
df[:,"some_ints"]
some_ints
21
32
24
44
2
31
44
49
44
36
9
11
12
31
7
40
26
38
35
26
df[:,["some_ints","name"]]
some_ints,name
21,mlyas
32,efcsm
24,tcbtr
44,efcsm
2,bgmqd
31,bgmqd
44,mlyas
49,efcsm
44,tcbtr
36,tcbtr
9,mlyas
11,lmqgd
12,bgmqd
31,mlyas
7,efcsm
40,tcbtr
26,mlyas
38,efcsm
35,mlyas
26,mlyas
df[:,1:2]
some_ints
21
32
24
44
2
31
44
49
44
36
9
11
12
31
7
40
26
38
35
26
df[:,1:]
some_ints,name
21,mlyas
32,efcsm
24,tcbtr
44,efcsm
2,bgmqd
31,bgmqd
44,mlyas
49,efcsm
44,tcbtr
36,tcbtr
9,mlyas
11,lmqgd
12,bgmqd
31,mlyas
7,efcsm
40,tcbtr
26,mlyas
38,efcsm
35,mlyas
26,mlyas
df[:10,:]
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
df[3,2]
name
efcsm
df[3,:]
values,some_ints,name
0.7678670658611503,44,efcsm
df[:,1]
some_ints
21
32
24
44
2
31
44
49
44
36
9
11
12
31
7
40
26
38
35
26
df[[0,3],1]
some_ints
21
44
df[:,[0,1]]
values,some_ints
0.8633713864104114,21
0.7797461486214806,32
0.9415640935512127,24
0.7678670658611503,44
0.8223095493965592,2
0.06832888693954564,31
0.717887788740719,44
0.6559538612897098,49
0.08927654448880817,44
0.7338088959472877,36
0.701592457357365,9
0.04338298448626843,11
0.3297427254572507,12
0.1011755876260031,31
0.8235585667750851,7
0.5829664850991141,40
0.14016453615592928,26
0.43436632265162967,38
0.4097725083991255,35
0.6379834519420476,26
df[range(5),:]
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
df.get_rows(range(10))
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
df.get_cols(range(2))
values,some_ints
0.8633713864104114,21
0.7797461486214806,32
0.9415640935512127,24
0.7678670658611503,44
0.8223095493965592,2
0.06832888693954564,31
0.717887788740719,44
0.6559538612897098,49
0.08927654448880817,44
0.7338088959472877,36
0.701592457357365,9
0.04338298448626843,11
0.3297427254572507,12
0.1011755876260031,31
0.8235585667750851,7
0.5829664850991141,40
0.14016453615592928,26
0.43436632265162967,38
0.4097725083991255,35
0.6379834519420476,26
names, shape, head
df.names()
array(['values', 'some_ints', 'name'], dtype='<U9')
df.shape()
array([20, 3])
df.head()
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
0.701592457357365,9,mlyas
0.04338298448626843,11,lmqgd
df.head(5)
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
where
df.where("values < 0.8 && values > .6")
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name
0.7797461486,32,efcsm
0.7678670659,44,efcsm
0.7178877887,44,mlyas
0.6559538613,49,efcsm
0.7338088959,36,tcbtr
0.7015924574,9,mlyas
0.6379834519,26,mlyas
df.where(""" name == "mlyas" """)
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name
0.8633713864104114,21,mlyas
0.717887788740719,44,mlyas
0.701592457357365,9,mlyas
0.1011755876260031,31,mlyas
0.14016453615592928,26,mlyas
0.4097725083991255,35,mlyas
0.6379834519420476,26,mlyas
df.where(""" name in ["efcsm","tcbtr"] """)
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
0.8235585667750851,7,efcsm
0.5829664850991141,40,tcbtr
0.43436632265162967,38,efcsm
unique
df.unique("name")
name
lmqgd
efcsm
mlyas
bgmqd
tcbtr
add column
df_with_new_column = df.add_column("new_column")
df_with_new_column
values,some_ints,name,new_column
0.8633713864104114,21,mlyas,
0.7797461486214806,32,efcsm,
0.9415640935512127,24,tcbtr,
0.7678670658611503,44,efcsm,
0.8223095493965592,2,bgmqd,
0.06832888693954564,31,bgmqd,
0.717887788740719,44,mlyas,
0.6559538612897098,49,efcsm,
0.08927654448880817,44,tcbtr,
0.7338088959472877,36,tcbtr,
0.701592457357365,9,mlyas,
0.04338298448626843,11,lmqgd,
0.3297427254572507,12,bgmqd,
0.1011755876260031,31,mlyas,
0.8235585667750851,7,efcsm,
0.5829664850991141,40,tcbtr,
0.14016453615592928,26,mlyas,
0.43436632265162967,38,efcsm,
0.4097725083991255,35,mlyas,
0.6379834519420476,26,mlyas,
modify
df.modify("values = values + some_ints * 3")
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name
63.86337139,21,mlyas
96.77974615,32,efcsm
72.94156409,24,tcbtr
132.7678671,44,efcsm
6.822309549,2,bgmqd
93.06832889,31,bgmqd
132.7178878,44,mlyas
147.6559539,49,efcsm
132.0892765,44,tcbtr
108.7338089,36,tcbtr
27.70159246,9,mlyas
33.04338298,11,lmqgd
36.32974273,12,bgmqd
93.10117559,31,mlyas
21.82355857,7,efcsm
120.5829665,40,tcbtr
78.14016454,26,mlyas
114.4343663,38,efcsm
105.4097725,35,mlyas
78.63798345,26,mlyas
The previous operation does not modify the values in df
df
values,some_ints,name
0.8633713864104114,21,mlyas
0.7797461486214806,32,efcsm
0.9415640935512127,24,tcbtr
0.7678670658611503,44,efcsm
0.8223095493965592,2,bgmqd
0.06832888693954564,31,bgmqd
0.717887788740719,44,mlyas
0.6559538612897098,49,efcsm
0.08927654448880817,44,tcbtr
0.7338088959472877,36,tcbtr
0.701592457357365,9,mlyas
0.04338298448626843,11,lmqgd
0.3297427254572507,12,bgmqd
0.1011755876260031,31,mlyas
0.8235585667750851,7,efcsm
0.5829664850991141,40,tcbtr
0.14016453615592928,26,mlyas
0.43436632265162967,38,efcsm
0.4097725083991255,35,mlyas
0.6379834519420476,26,mlyas
You need to assign it in order for the modifications to be "saved"
df_mod = df.modify("values = values + some_ints * 3")
df_mod
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name
63.86337139,21,mlyas
96.77974615,32,efcsm
72.94156409,24,tcbtr
132.7678671,44,efcsm
6.822309549,2,bgmqd
93.06832889,31,bgmqd
132.7178878,44,mlyas
147.6559539,49,efcsm
132.0892765,44,tcbtr
108.7338089,36,tcbtr
27.70159246,9,mlyas
33.04338298,11,lmqgd
36.32974273,12,bgmqd
93.10117559,31,mlyas
21.82355857,7,efcsm
120.5829665,40,tcbtr
78.14016454,26,mlyas
114.4343663,38,efcsm
105.4097725,35,mlyas
78.63798345,26,mlyas
Modify accepts conditional equations, and its results are 1 for true and 0 for false, which allows to construct functions such as sign() or abs(), etc as an example:
df_mod = df_with_new_column.modify("new_column = values - .5")
df_mod
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name,new_column
0.8633713864104114,21,mlyas,0.3633713864
0.7797461486214806,32,efcsm,0.2797461486
0.9415640935512127,24,tcbtr,0.4415640936
0.7678670658611503,44,efcsm,0.2678670659
0.8223095493965592,2,bgmqd,0.3223095494
0.06832888693954564,31,bgmqd,-0.4316711131
0.717887788740719,44,mlyas,0.2178877887
0.6559538612897098,49,efcsm,0.1559538613
0.08927654448880817,44,tcbtr,-0.4107234555
0.7338088959472877,36,tcbtr,0.2338088959
0.701592457357365,9,mlyas,0.2015924574
0.04338298448626843,11,lmqgd,-0.4566170155
0.3297427254572507,12,bgmqd,-0.1702572745
0.1011755876260031,31,mlyas,-0.3988244124
0.8235585667750851,7,efcsm,0.3235585668
0.5829664850991141,40,tcbtr,0.0829664851
0.14016453615592928,26,mlyas,-0.3598354638
0.43436632265162967,38,efcsm,-0.06563367735
0.4097725083991255,35,mlyas,-0.0902274916
0.6379834519420476,26,mlyas,0.1379834519
now, let's get the absolute value of new column and put it in a new column
df_mod_2 = df_mod.add_column("abs_new_column").modify("abs_new_column = new_column * ((new_column >= 0) * 2 - 1)")
df_mod_2
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name,new_column,abs_new_column
0.8633713864104114,21,mlyas,0.3633713864,0.3633713864
0.7797461486214806,32,efcsm,0.2797461486,0.2797461486
0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936
0.7678670658611503,44,efcsm,0.2678670659,0.2678670659
0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494
0.06832888693954564,31,bgmqd,-0.4316711131,0.4316711131
0.717887788740719,44,mlyas,0.2178877887,0.2178877887
0.6559538612897098,49,efcsm,0.1559538613,0.1559538613
0.08927654448880817,44,tcbtr,-0.4107234555,0.4107234555
0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959
0.701592457357365,9,mlyas,0.2015924574,0.2015924574
0.04338298448626843,11,lmqgd,-0.4566170155,0.4566170155
0.3297427254572507,12,bgmqd,-0.1702572745,0.1702572745
0.1011755876260031,31,mlyas,-0.3988244124,0.3988244124
0.8235585667750851,7,efcsm,0.3235585668,0.3235585668
0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851
0.14016453615592928,26,mlyas,-0.3598354638,0.3598354638
0.43436632265162967,38,efcsm,-0.06563367735,0.06563367735
0.4097725083991255,35,mlyas,-0.0902274916,0.0902274916
0.6379834519420476,26,mlyas,0.1379834519,0.1379834519
Modify accepts conditions, there are two ways:
df_mod_2.modify("values = 0",condition = "new_column < 0")
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name,new_column,abs_new_column
0.8633713864104114,21,mlyas,0.3633713864,0.3633713864
0.7797461486214806,32,efcsm,0.2797461486,0.2797461486
0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936
0.7678670658611503,44,efcsm,0.2678670659,0.2678670659
0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494
0,31,bgmqd,-0.4316711131,0.4316711131
0.717887788740719,44,mlyas,0.2178877887,0.2178877887
0.6559538612897098,49,efcsm,0.1559538613,0.1559538613
0,44,tcbtr,-0.4107234555,0.4107234555
0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959
0.701592457357365,9,mlyas,0.2015924574,0.2015924574
0,11,lmqgd,-0.4566170155,0.4566170155
0,12,bgmqd,-0.1702572745,0.1702572745
0,31,mlyas,-0.3988244124,0.3988244124
0.8235585667750851,7,efcsm,0.3235585668,0.3235585668
0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851
0,26,mlyas,-0.3598354638,0.3598354638
0,38,efcsm,-0.06563367735,0.06563367735
0,35,mlyas,-0.0902274916,0.0902274916
0.6379834519420476,26,mlyas,0.1379834519,0.1379834519
conditional_equation = ad.Conditional_equation()
conditional_equation.condition = "new_column < 0"
conditional_equation.equation = "values = 0"
df_mod_2.modify(conditional_equation)
gawk: warning: escape sequence `\/' treated as plain `/'
gawk: warning: escape sequence `\/' treated as plain `/'
values,some_ints,name,new_column,abs_new_column
0.8633713864104114,21,mlyas,0.3633713864,0.3633713864
0.7797461486214806,32,efcsm,0.2797461486,0.2797461486
0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936
0.7678670658611503,44,efcsm,0.2678670659,0.2678670659
0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494
0,31,bgmqd,-0.4316711131,0.4316711131
0.717887788740719,44,mlyas,0.2178877887,0.2178877887
0.6559538612897098,49,efcsm,0.1559538613,0.1559538613
0,44,tcbtr,-0.4107234555,0.4107234555
0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959
0.701592457357365,9,mlyas,0.2015924574,0.2015924574
0,11,lmqgd,-0.4566170155,0.4566170155
0,12,bgmqd,-0.1702572745,0.1702572745
0,31,mlyas,-0.3988244124,0.3988244124
0.8235585667750851,7,efcsm,0.3235585668,0.3235585668
0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851
0,26,mlyas,-0.3598354638,0.3598354638
0,38,efcsm,-0.06563367735,0.06563367735
0,35,mlyas,-0.0902274916,0.0902274916
0.6379834519420476,26,mlyas,0.1379834519,0.1379834519
A future update of the library will allow to include several conditional equations in the same modify statement
add_index
df_mod = df.add_index()
df_mod
index,values,some_ints,name
0,0.8633713864104114,21,mlyas
1,0.7797461486214806,32,efcsm
2,0.9415640935512127,24,tcbtr
3,0.7678670658611503,44,efcsm
4,0.8223095493965592,2,bgmqd
5,0.06832888693954564,31,bgmqd
6,0.717887788740719,44,mlyas
7,0.6559538612897098,49,efcsm
8,0.08927654448880817,44,tcbtr
9,0.7338088959472877,36,tcbtr
10,0.701592457357365,9,mlyas
11,0.04338298448626843,11,lmqgd
12,0.3297427254572507,12,bgmqd
13,0.1011755876260031,31,mlyas
14,0.8235585667750851,7,efcsm
15,0.5829664850991141,40,tcbtr
16,0.14016453615592928,26,mlyas
17,0.43436632265162967,38,efcsm
18,0.4097725083991255,35,mlyas
19,0.6379834519420476,26,mlyas
sort_by
df_mod.sort_by("values")
index,values,some_ints,name
11,0.04338298448626843,11,lmqgd
5,0.06832888693954564,31,bgmqd
8,0.08927654448880817,44,tcbtr
13,0.1011755876260031,31,mlyas
16,0.14016453615592928,26,mlyas
12,0.3297427254572507,12,bgmqd
18,0.4097725083991255,35,mlyas
17,0.43436632265162967,38,efcsm
15,0.5829664850991141,40,tcbtr
19,0.6379834519420476,26,mlyas
7,0.6559538612897098,49,efcsm
10,0.701592457357365,9,mlyas
6,0.717887788740719,44,mlyas
9,0.7338088959472877,36,tcbtr
3,0.7678670658611503,44,efcsm
1,0.7797461486214806,32,efcsm
4,0.8223095493965592,2,bgmqd
14,0.8235585667750851,7,efcsm
0,0.8633713864104114,21,mlyas
2,0.9415640935512127,24,tcbtr
to_pandas
pdf = df_mod.to_pandas()
pdf
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
index | values | some_ints | name | |
---|---|---|---|---|
0 | 0 | 0.863371 | 21 | mlyas |
1 | 1 | 0.779746 | 32 | efcsm |
2 | 2 | 0.941564 | 24 | tcbtr |
3 | 3 | 0.767867 | 44 | efcsm |
4 | 4 | 0.822310 | 2 | bgmqd |
5 | 5 | 0.068329 | 31 | bgmqd |
6 | 6 | 0.717888 | 44 | mlyas |
7 | 7 | 0.655954 | 49 | efcsm |
8 | 8 | 0.089277 | 44 | tcbtr |
9 | 9 | 0.733809 | 36 | tcbtr |
10 | 10 | 0.701592 | 9 | mlyas |
11 | 11 | 0.043383 | 11 | lmqgd |
12 | 12 | 0.329743 | 12 | bgmqd |
13 | 13 | 0.101176 | 31 | mlyas |
14 | 14 | 0.823559 | 7 | efcsm |
15 | 15 | 0.582966 | 40 | tcbtr |
16 | 16 | 0.140165 | 26 | mlyas |
17 | 17 | 0.434366 | 38 | efcsm |
18 | 18 | 0.409773 | 35 | mlyas |
19 | 19 | 0.637983 | 26 | mlyas |
to_numpy_dataframe
ndf = df_mod.to_npd()
print(ndf)
print(ndf.shape)
index values some_ints name
0 0 0.863371 21 mlyas
1 1 0.779746 32 efcsm
2 2 0.941564 24 tcbtr
3 3 0.767867 44 efcsm
4 4 0.822310 2 bgmqd
5 5 0.068329 31 bgmqd
6 6 0.717888 44 mlyas
7 7 0.655954 49 efcsm
8 8 0.089277 44 tcbtr
9 9 0.733809 36 tcbtr
10 10 0.701592 9 mlyas
11 11 0.043383 11 lmqgd
12 12 0.329743 12 bgmqd
13 13 0.101176 31 mlyas
14 14 0.823559 7 efcsm
15 15 0.582966 40 tcbtr
16 16 0.140165 26 mlyas
17 17 0.434366 38 efcsm
18 18 0.409773 35 mlyas
19 19 0.637983 26 mlyas
(20, 4)
to_csv
path_output = os.path.expanduser("~") + "/.tmp/example_mod.csv"
df_mod.to_csv(path_output)
For future development
add_rows(n)
keep_na = False
rename
Internal gawk command example (where command):
function is_digit_field(field){
if (field ~ /^[\\+-]?[0-9]+$|^[\\+-]*[0-9]*[\\.]?[0-9]+$|^[\\+-]*[0-9]*[\\.]?[0-9]+[eE]?[\\+-]?[0-9]+$|#N\/A|#N\/A|N\/A|#NA|-1.#IND|-1.#QNAN|-NaN|-nan|1.#IND|1.#QNAN|<NA>|<na>|NA|NULL|NaN|n\/a|nan|null||Nan|-Nan/){
return 1
}else{
return 0
}
}
function needs_quotes(field){
#postgres_standard: delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, t
regex = "(" FS ")+|(\\n)+|(" input_quote ")+|\r+"
if (field ~ regex){
return 1
}else{
return 0
}
}
function transform_field(field,column){
if (is_first_command && !is_last_command && original_level_simplicity_csv != 2){
gsub(/\n/,delimiter_internal_notation "n" delimiter_internal_notation,field)
gsub(FS,delimiter_internal_notation "d" delimiter_internal_notation,field)
gsub(input_quote, delimiter_internal_notation "q" delimiter_internal_notation, field)
gsub("\r", delimiter_internal_notation "r" delimiter_internal_notation, field)
}
if (is_last_command && !is_first_command && original_level_simplicity_csv != 2){
gsub(delimiter_internal_notation "n" delimiter_internal_notation,"\n",field)
gsub(delimiter_internal_notation "d" delimiter_internal_notation,FS,field)
gsub(delimiter_internal_notation "q" delimiter_internal_notation,input_quote, field)
gsub(delimiter_internal_notation "r" delimiter_internal_notation,"\r", field)
}
if (is_digit_field(field) || (level_simplicity_csv == 2 && (!is_last_command || original_level_simplicity_csv == 2))){
if (!is_digit_field()){
field = transform_string(field,column)
if (input_quote != output_quote){
gsub(input_quote,output_quote,field)
}
}else{
return transform_digit(field,column)
}
return field
}else{
field = transform_string(field,column)
if (quotes_have_been_removed){
if (full_string_quoting){
return output_quote field output_quote
}else{
if (needs_quotes(field) && minimal_string_quoting || field == delimiter_internal_notation "NULL" delimiter_internal_notation){
if (field == delimiter_internal_notation "NULL" delimiter_internal_notation){
gsub(delimiter_internal_notation "NULL" delimiter_internal_notation,"NULL",field)
}
return output_quote field output_quote
}else{
return field
}
}
}else{
return field
}
}
}
function print_field(field,column){
field = transform_field(field,column)
if (FNR == NR && find_unique){
unique[field] = 1
}else{
if (print_fields){
output_separator = OFS
if (i == max_col() && !add_new_column){
output_separator = ORS
}
if (i == min_col()){
if (add_index){
if (!save){
printf("%s" output_separator,print_line_number() + index_based_on)
}else{
printf("%s" output_separator,print_line_number() + index_based_on) > path_output
}
}
}
if (!save){
printf "%s" output_separator,field
}else{
printf "%s" output_separator,field > path_output
}
}
}
}
function print_line(line){
line = transform_line(line)
if (print_fields){
if (add_index){
if (!save){
printf("%s" OFS,print_line_number() + index_based_on)
}else{
printf("%s" OFS,print_line_number() + index_based_on) > path_output
}
}
if (input_quote != output_quote){
gsub(input_quote,output_quote,line)
}
if (!save){
if (sort){
sort_command = "sort -t'" OFS "' --parallel " num_cores " -nk" sort_column
printf "%s" ORS,line | sort_command
}else{
printf "%s" ORS,line
}
}else{
printf "%s" ORS,line > path_output
}
}
}
function print_line_number(){
if (output_header){
return FNR-records_skipped - 2
}else{
return FNR-records_skipped - 1
}
}
function line_number(){
if (header){
return FNR-records_skipped - 1
}else{
return FNR-records_skipped
}
}
function print_header(){
if (add_index){
field = transform_field("index",1)
if (!save){
printf("%s" OFS,field)
}else{
printf("%s" OFS,field) > path_output
}
}
if (fields_based_code){
for (i=min_col();i<=max_col();i++) {
if ((number_files_input == 1 || input_file_rows) || (input_file_cols && v[i])){
if (is_first_command && !is_last_command && level_simplicity_csv != 2){
gsub(/\n/,delimiter_internal_notation "n" delimiter_internal_notation,$i)
gsub(FS,delimiter_internal_notation "d" delimiter_internal_notation,$i)
gsub(input_quote, delimiter_internal_notation "q" delimiter_internal_notation, $i)
gsub("\r", delimiter_internal_notation "r" delimiter_internal_notation, $i)
}
if (is_last_command && !is_first_command && level_simplicity_csv != 2){
gsub(delimiter_internal_notation "n" delimiter_internal_notation,"\n",$i)
gsub(delimiter_internal_notation "d" delimiter_internal_notation,FS,$i)
gsub(delimiter_internal_notation "q" delimiter_internal_notation,input_quote, fldStr)
gsub(delimiter_internal_notation "r" delimiter_internal_notation,"\r", $i)
}
output_separator = OFS
if (i == max_col() && !add_new_column){
output_separator = ORS
}
if (!save){
printf "%s" output_separator,transform_header($i,i)
}else{
printf "%s" output_separator,transform_header($i,i) > path_output
}
}
}
add_column()
}else{
gsub(input_quote,output_quote,$0)
if (!save){
printf "%s" ORS,$0
}else{
printf "%s" ORS,$0 > path_output
}
add_column()
}
}
function min_col(){
if (1 > col_start){
return 1
}else{
return col_start
}
}
function max_col(){
if (col_end == INF){
return NF
}else{
if (NF < col_end){
return NF
}else{
return col_end
}
}
}
function add_column(){
if (add_new_column){
if (line_number() == 0){
printf "%s" ORS,name_new_column
}else{
printf "%s" ORS,""
}
}
}
function transform_header(field,column){
return field
}
#################################
#################################
function transform_column(field,column){
return field
}
function transform_digit(field,column){
field = transform_column(field,column)
return field
}
function transform_string(field,column){
field = transform_column(field,column)
return field
}
function transform_line(line){
return line
}
function condition(){
statement = 1
statement = ($3 ~ /^"?crlsb"?$/)
return statement
}
BEGIN {
PREC=100
OFMT="%.10g"
CONVFMT="%.10g"
is_first_command = 0
is_last_command = 0
delimiter_internal_notation = ":"
if (FS == delimiter_internal_notation){
delimiter_internal_notation = "#"
}
input_quote = "\""
output_quote = "\""
level_simplicity_csv = 2
FS =","
OFS=","
original_level_simplicity_csv = 0
if (level_simplicity_csv <= 1){
FPAT="([^" FS "]*)|(" input_quote "([^" input_quote "]|" input_quote input_quote ")+" input_quote ")"
}
records_skipped = 0
minimal_string_quoting = 1
full_string_quoting = !minimal_string_quoting
is_unix_input = 1
is_unix_output = 1
if (is_unix_input){
RS = "\n"
}else{
RS = "\r\n"
}
if (is_unix_output){
ORS = "\n"
}else{
ORS = "\r\n"
}
null_input_string = "NULL"
null_output_string = ""
header = 1
output_header = 1
if (!header){
output_header = 0
}
add_index = 0
index_based_on = 0
line_based_code = 0
fields_based_code = !line_based_code
col_start = 1
col_end = INF
row_start = 1
row_end = INF
number_files_input = 1
input_file_rows = 1
input_file_cols = !input_file_rows
find_unique = 0
print_fields = 1
save = 0
path_output = ""
sort = 0
sort_column = 1
num_cores = 4
sort_in_parallel = 1
calculate_shape = 0
check_complexity = 0
add_new_column = 0
name_new_column = ""
if (check_complexity){
has_maximum_complexity = 0
print_fields = 0
}
quotes_have_been_removed = 0
number_fields = 0
continue_next_line = 0
}
{
if (!calculate_shape){
if (check_complexity || number_files_input == 1){
if (check_complexity){
if (line_number() == 0){
number_fields = max_col()
has_maximum_complexity = 0
}else{
if (max_col() < number_fields){
has_maximum_complexity = 1
exit
}
}
}else{
if (line_number() == 0 && header){
number_fields = NF
if (output_header){
print_header()
}
}else{
if (fields_based_code){
if (line_number() >= row_start){
for (i=min_col();i<=max_col();i++) {
if (condition()){
print_field($i,i)
}
}
i=min_col()
if (min_col() == max_col() && $i == ""){
print(":n:")
}
add_column()
}
}else{
if (line_number() >= row_start){
if (condition()){
print_line($0)
}
}
}
if (row_end != INF && line_number() > row_end){
exit
}
}
}
}else{
if (FNR == NR){
v[$0] = 1
}else{
if (FNR!=NR && FNR == 1){
records_skipped = 0
}
if (line_number() == 0 && header){
number_fields = max_col()
if (output_header){
print_header()
}
}else{
if (fields_based_code){
if (line_number() >= row_start){
for (i=min_col();i<=max_col();i++) {
if (condition()){
field = $i
if (input_file_rows && v[line_number()]){
print_field(field,i)
}else{
if (input_file_cols && v[i]){
print_field(field,i)
}
}
}
}
i=min_col()
if (min_col() == max_col() && $i == ""){
print(":n:")
}
add_column()
}
}else{
if (line_number() >= row_start){
if (condition()){
if (input_file_rows && v[line_number()]){
print_line($0)
}
}
}
}
if (row_end != INF && line_number() > row_end){
exit
}
}
}
}
}
}
END{
if (find_unique){
for (key in unique){
print(key)
}
}
if (calculate_shape){
print print_line_number()
print NF
}
if (check_complexity){
print has_maximum_complexity
}
}
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file awk_dataframe-0.2.11.tar.gz
.
File metadata
- Download URL: awk_dataframe-0.2.11.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72ab96a373031b09988f7d1fc5eeeed0f715869dd8b36a9bb717b0bbdf9f7afa |
|
MD5 | 351c473b61b0d6c3e9d2d1761ccb49b8 |
|
BLAKE2b-256 | 9f7a9781efde656f5a1aefa755b403a8a5c1a42e875bef8bd56d011452cec530 |
File details
Details for the file awk_dataframe-0.2.11-py3-none-any.whl
.
File metadata
- Download URL: awk_dataframe-0.2.11-py3-none-any.whl
- Upload date:
- Size: 26.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ce030e336f94eaf6a44f3af13dce4b0dab3aa3116d85c9d9ffc0ec651114187 |
|
MD5 | fe688072fac2e45a022729961d5a65a0 |
|
BLAKE2b-256 | 17a94517d79c8138135aabbfd9757d63bcc56bccfe8f8aa1e4c14389e6c104f6 |