Skip to main content

A Pandas Enhancement

Project description

An enhancement to pandas module.
This is kungfu, with monkey-patched common methods to (Data)Frame and Series in pandas.
=============================================
jerryzhujian9_at_gmail.com
Tested under python 2.7
To see your python version
in terminal: python -V
or in python: import sys; print (sys.version)
=============================================
Install:
https://pypi.python.org/pypi/kungfu
pip install kungfu
The above command will auto take care of the following requirements
Requires pandas 0.12.0 (tested 0.12.0-2) which will also install python-dateutil(dateutil), numpy, pytz, six
Requires openpyxl for writing excel (tested with 1.5.8, version 1.6.1 or higher, but lower than 2.0.0 may also work.)
xlrd for reading excel, xlwt for writing .xls (old format) file
(pip install pandas==0.12.0; pip install openpyxl==1.5.8; pip install xlrd; pip install xlwt)


Usage:
Want pandas functions index?
http://pandas.pydata.org/pandas-docs/version/0.12.0/

Call methods
All the monkey-patched methods use UpperCase names; the original ones use lower_case names.
Generally all of the calling (monkey-patched or not) returns something and the original frame or series remains unchanged.
If user wants the original frame or series to be changed, assign the returns back.

Visualize a series as a column of a frame with the series name being the column name.
Visualize a single list as a series and therefore a column of a frame when converting a series or frame.
However, for a list of lists, Visualize each list of the list (i.e. sublist) as a row!
Memorization: list=series=column

from pandas import isnull as isna
from pandas import isnull as isnull
Frame.read = Frame.Read
Frame.save = Frame.Save
Frame.write = Frame.Save
Frame.pprint = Frame.Print
Frame.info = Frame.Print
Frame.play = Frame.Play
Frame.sel = Frame.Sel
Frame.selcol = Frame.SelCol
Frame.selrow = Frame.SelRow
Frame.delete = Frame.Del
Frame.groupv = Frame.GroupV
Frame.splith = Frame.SplitH
Frame.recols = Frame.ReorderCols
Frame.rerows = Frame.ReorderRows
Frame.rncols = Frame.RenameCols
Frame.newcol = Frame.NewCol
Frame.findval = Frame.FindVal
Frame.countval = Frame.CountVal
Frame.cnames = Frame.Columns
Frame.names = Frame.Columns
Frame.rnames = Frame.Indices
Frame.num = Frame.ToNum
Frame.maskout = Frame.Maskout
# Frame.fillna = Frame.FillNA

Series.play = Series.Play
Series.pprint = Series.Print
Series.info = Series.Print
Series.sel = Series.Sel
Series.countval = Series.CountVal
# Series.unique = Series.Uniques --existing method
Series.len = Series.Size
Series.names = Series.Indices
Series.rnames = Series.Indices
Series.cames = Series.Indices
Series.num = Series.ToNum
Series.str = Series.ToStr
Series.maskout = Series.Maskout
# Series.fillna = Series.FillNA

mergelr = MergeLR
concatvh = ConcatVH
Frame.tolist(), Frame.list() <--homebrew method
Series.tolist(), Series.list() <--exisiting method in pandas

# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Frame IO, Frame info
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
frame = Frame.Read(path, sep=",", header=0)
Read data into a frame. This is a class method.
Args:
path, a text or csv file
sep, character used to separate the file, e.g., ' '
header, the row number with header (0=first row, None=no header at all)
Returns:
a Frame object with the data.
Raises:
None

frame = Frame.Readx(path, sheetname='Sheet1', header=0)
Read xlsx, xls file into a frame
header, the row number with header (0=first row, None=no header at all)

frame.Print([column=None])
Print out common useful information of a frame.
Args:
optional column name or index; if passed, print out information of only that column instead of that of whole frame.
Returns:
None
Raises:
None

series.Print()
Print out useful information of a series.
Args:
None
Returns:
None
Raises:
None

frame.Save(outputFile[, columns=None])
Save the content of a frame to an excel file or csv file.
Args:
the path to the excel file (xlsx), or comma separated (csv); explicitly specify .xlsx or .csv
optional columns, the order and names of columns to save
1) can reorder or omit some of the frame's original columns
2) if skipped, use the frame's original order and names
3) example: columns=["sbj","Wordpair","UResp","recalled","stage"]
Returns:
None
Raises:
None

f = Frame.Play()
Generate a predefine frame for testing, debugging, playing and etc. This is a class method.
Args:
None
Returns:
a Frame object
Raises:
None

s = Series.Play()
Generate a predefine series for testing, debugging, playing and etc. This is a class method.
Args:
None
Returns:
a Frame object
Raises:
None


# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Selection, grouping
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
frame.Sel(*args)
Select certain columns and rows from a frame.
Args:
1) [] for columns FIRST, {} or [] for rows SECOND
2) if only pass [], treated as columns; if only {}, as rows
3) additionally, a single int(or str) can be parsed to a list of that int(or str)
4) int alone or in a list treated as (column/row) index; str alone or in a list as (column/row) name
5) does NOT support slice; however, one can use the function range(start,stop) to generate a list
6) better see some examples:
('Session') column "Session"
(1) column 1 (index based, starts from 0)
(['Session', 1])
columns "Session" and 1
the order of selected columns does not have to follow the order in the original frame! could be [2,4,0,1]
({'Session': 1}) all columns where "Session" == 1
('Session', '0') <---raise error because of '0'
('Session', 0) column "Session" and row 0 (index based, starts from 0)
('Session', [0, 1]) column "Session" and rows 0 and 1
('Session', {'Session': 1}) column "Session" where "Session" == 1
(0, '0') <---raise error because of '0'
(0, 0) column 0 and row 0
(0, [0, 1]) column 0 and rows 0 and 1
(0, {'Session': 1}) column 0 where "Session" == 1
(['Session', 1], '0') <---raise error because of '0'
(['Session', 1], 0) columns "Session" and 1, and row 0
(['Session', 1], [0, 1]) columns "Session" and 1, and, rows 0 and 1
(['Session', 1], {"Subject":5101,"Procedure[Trial]":["PresentPair","PresentPair1"]})
columns "Session" and 1 where "Subject" == 5101 and ("Procedure[Trial]" == "PresentPair" or "Procedure[Trial]" == "PresentPair1")
([],0) all columns and row 0
Returns:
a Frame object even for a frame with only one column or row
to select a column or row as a series, use SelCol() or SelRow()
However, if only a cell is selected, returns the value of that cell with being its own data type.
Raises:
None

frame.SelCol(column)
Select a single column from frame.
Args:
a int representing column index or a string representing column name
Returns:
a single column as a series
Raises:
None

frame.SelRow(row)
Select a single row from frame.
Args:
a int representing row index or a string representing row name
Returns:
a single row as a series
Raises:
None

frame.Del(*args):
Delete columns and/or rows from frame.
Args:
Same as those for Sel
if a to-be-deleted column or row does not exist in the frame, it will be ignored
Returns:
a Frame object without the passed columns and rows
when passing an empty columns/rows, this method deletes nothing and returns a Frame object of the same shape
Raises:
None

frame.GroupV(edgeMatchSeries, groupColumnName='AutoGroup')
Edge condition based grouping along the vertical direction

frame.SplitH(subFrameSize=1, resetIndex=True)
number based splitting along the horizontal direction

series.Sel(elements=[])
Select elements from a series.
Args:
choose one of the three options:
a int representing index
a string representing name
a list of int, a list of str, or a list of int and str
Returns:
a list of selected elements. If the list has only one value, return that value instead of a list.
Raises:
None



# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Reorganize
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
General notes on "join":
when joining along an axis, the index of each frame does not have to in the same order
e.g. ["a","b","c","f"] for left frame, ["b","c","a","e"] for right frame
join will match them and return the combined frame (in a certain order)

MergeLR(left, right, join='union', onKeys=[], sort=True)
Merge 2 frames in the horizontal direction.
Args:
left frame, right frame
join: "left", "right", "union","outer","inter", "inner"
when a frame column has duplicated values, it will be confusing (i.e. Cartesian product?)
onKeys:
1) a list of 2 elements, the first for the left frame, the second for the right
2) if only one element passed to the list, it is the shared column name in both frames. e.g. ["pair"], or ["@INDEX"]
3) a list is also considered as one element e.g. onKeys=[["sbj","pair"]]
4) the element could be a column name in each frame for the join to match
5) could be the same or different, e.g. ["subject","subject"], or ["name","word"]
6) a special element name "@INDEX" uses the index of the frame, e.g. ["subject","@INDEX"] or ["@INDEX","@INDEX"]
sort: whether to sort the final merged frame based on the join-key
Returns:
a merged frame
when merge on a column key rather than an index, in the merged frame, the index will be reset from 0 to n
Raises:
None

ConcatVH(frameList, axis=0, join="union", sort=False)
Concat in the vertical or horizontal direction.
When concat in the vertical direction, i.e. along index, the horizontal (i.e. along columns) is defined as join direction.
When concat in the horizontal columns direction, the join direction is the vertical index direction.
Args:
frameList should be a list of frame; if want to add a list such as [1,2,3] or a series, convert them first to a frame
always use a list when considering concat! can concat more than two frames at a time
axis: the concat direction, 0 or 1
join:
1) how the direction other than the concat direction should be handled.
2) Only handle/match index! (think of it as a specific case of merge method)
3) possible value: union,outer,inter,inner or a list
union/outer: match shared ones, preserve unmatched
inter/inner: mach shared ones, discard unmatched
or pass a list representing an index, e.g. ["a","b","c"] or frm.Indices().
With a list passed, it will perform only union with the predefined index; that is, it will ignore join being union or inter.
sort: True or False
1) The built-in concat function features:
When the concating frames have different sequences in the join direction, the join direction is sorted automatically.
When the concating frames have the same sequence in the join direction, it is not sorted.
That is, "by default" it will try to sort different, i.e.sort=True.
2) Hereby, I hacked a bit by providing this sort keyword which does not exist in the built-in concat.
Set sort=False to disable this feature. So the results are always not sorted, i.e. preserving the original order as much as possible.
Example: Frame1 is CBDA, Frame2 is CBEDA, concated is then CBDAE (E shows up later in Frame2, but the final order first adopts the order of the Frame1).
3) See github discussion https://github.com/pydata/pandas/issues/4588
Returns:
a Frame object.
Raises:
None

frame.ReorderCols(columns=[])
Reorder columns of a frame.
Args:
a list that has equal size to the original columns
Returns:
a Frame object
Raises:
None

frame.ReorderRows(indices=[])
Reorder the rows of a frame.
Args:
a list that has equal size to the original indices
Returns:
a Frame object
Raises:
None

frame.RenameCols(newColumns=[])
Rename the names of each column of a frame.
Args:
a list that has equal size to the original columns
a new column name could be the same as the old one (i.e. not rename)
Returns:
a Frame object
Raises:
None

frame.NewCol([newColumnName="NewColumn"[, newColumnValue=NA]])
Append a new column to the frame.
Args:
new column name in string
new column default value
e.g.frame = frame.NewCol("Wordpair",frame.SelCol("W1") + "-" + frame.SelCol("W2"))
Returns:
a Frame object
this time, the original frame is changed so don't have to assign back to a new frame. but it doesn't hurt
Raises:
The new column name should not exist already, otherwise it would overwrite the values of the existing column.



# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Stats, Processing
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
IsNA(object)
Checks whether a string, int, frame, series and etc is np.nan; returns 0 or 1
Attention, None is a builtin python datatype, IsNA(None) returns true, but series.isin([NA]) will return false

frame.FindVal(valToFind)
Print out all columns containing valToFind; only allows a single value to be passed each time

frame.CountVal(valToCount)
Count all occurrence of the valToCount within the frame, can count np.nan, only allows a single value to be passed each time

frame.Columns()
Returns a list of column names, takes no argument

frame.Indices()
Returns a list of indices, takes no argument

frame.ToNum()
Converts possible numbers to num type if they are not, takes no argument
1) if a column has any number or number-like string, the whole column will be converted to number. Anything that is not a number will be NA
>>> s=Series(["2323","a"])
>>> s
Out[10]:
0 2323
1 a
dtype: object
>>> s.ToNum()
Out[11]:
0 2323
1 NaN
dtype: float64
2) if a column is purely string, it will remain as string.
>>> t=Series(["adb","s3","sae"])
>>> t
Out[13]:
0 adb
1 s3
2 sae
dtype: object
>>> t.ToNum()
Out[14]:
0 adb
1 s3
2 sae
dtype: object

frame.Maskout(condition)
1) if a frame cell's value matches the condition, then masked as NA; if not, preserve the original value
2) returned as a copy; the original frame remains unchanged
3) condition could be:
a str, int, list, list of str and int, dict, condition array such as frame > 0
only one parameter should be passed to the function
Here are some examples:
1 --> parsed to frame.isin([1])
"pad" --> parsed to frame.isin(["pad"])
[1,"pad"] --> parsed to frame.isin([1,"pad"])
{"ColumnName":123} --> parsed to frame.isin({"ColumnName":[123]})
{"ColumnName":["pad","think"]} --> parsed to frame.isin({"ColumnName":["pad","think"]})
frame > 0 (attendition: if a cell is a string then a string is larger than a number)
don't pass NA, i.e. Maskout(NA). Why would you do this?

frame.FillNA( *args, **kwargs)
a re-wrapper of the same frame.fillna()

frame.mean(axis=0),frame.median(axis=0),frame.sum(axis=0)
axis : index (0), columns (1), by default NA is skipped when calculating, which is nice

frame.corr(method='')
compute a correlation matrix between all two possible columns; NA excluded
method could be 'pearson', 'kendall', 'spearman'

series.CountVal(valToCount)
count all occurrence, can count np.nan as well

series.Uniques()
returns a list of unique values in a series, takes no argument. Frame does not have a unique method

series.Size()
returns the number of values in a series (i.e. series length), takes no argument. Frame does not have a size method

series.Indices()
returns a list of indices of the series, takes no argument

series.ToNum()
convert possible numbers to num type if they are not
refer to frame.ToNum()

series.Maskout(condition)
internally use the Frame.Maskout(); so condition is of the same type

series.FillNA( *args, **kwargs)
a re-wrapper of the same series.fillna()

series.mean(axis=0),series.median(axis=0),series.sum(axis=0)
axis : index (0) only, by default NA is skipped which is nice

series.corr(other, method='')
computer the correlation of a series with another series; NA excluded
method could be 'pearson', 'kendall', 'spearman'

series = series.ToStr()
returns the string representation of each element in a series. Frame does not have theses methods.
then can apply stringmethods, Maskout method to further process
e.g. series.replace(pattern,replace)
note: some of these methods are being deprecated
cat()
center()
contains()
count()
decode()
encode()
endswith()
extract()
findall()
get()
join()
len()
lower()
lstrip()
match()
pad()
repeat()
replace()
rstrip()
slice()
slice_replace()
split()
startswith()
strip()
title()
upper()

Loop how to:
for columnName, columnSeries in Frame.iteritems():
columnIndex = Frame.Columns().index(colName)
columnUniques = columnSeries.Uniques()
for rowIndex, rowSeries in Frame.iterrows():
for index, value in Series.iteritems():

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kungfu-1.3.2.tar.gz (25.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page