Package for generating and evaluating patterns in quantitative reports
Project description
data-patterns
Package for generating and evaluating data-patterns in quantitative reports
Free software: MIT/X license
Documentation: https://data-patterns.readthedocs.io.
Features
Here is what the package does:
Generating and evaluating patterns in structured datasets and exporting to Excel and JSON
Transforming generated patterns into Pandas code
Quick overview
To install the package
pip install data_patterns
To introduce the features of the this package define the following Pandas DataFrame:
df = pd.DataFrame(columns = ['Name', 'Type', 'Assets', 'TV-life', 'TV-nonlife' , 'Own funds', 'Excess'], data = [['Insurer 1', 'life insurer', 1000, 800, 0, 200, 200], ['Insurer 2', 'non-life insurer', 4000, 0, 3200, 800, 800], ['Insurer 3', 'non-life insurer', 800, 0, 700, 100, 100], ['Insurer 4', 'life insurer', 2500, 1800, 0, 700, 700], ['Insurer 5', 'non-life insurer', 2100, 0, 2200, 200, 200], ['Insurer 6', 'life insurer', 9000, 8800, 0, 200, 200], ['Insurer 7', 'life insurer', 9000, 0, 8800, 200, 200], ['Insurer 8', 'life insurer', 9000, 8800, 0, 200, 200], ['Insurer 9', 'non-life insurer', 9000, 0, 8800, 200, 200], ['Insurer 10', 'non-life insurer', 9000, 0, 8800, 200, 199.99]]) df.set_index('Name', inplace = True)
Start by defining a PatternMiner:
miner = data_patterns.PatternMiner(df)
To generate patterns use the find-function of this object:
df_patterns = miner.find({'name' : 'equal values', 'pattern' : '=', 'parameters': {"min_confidence": 0.5, "min_support" : 2, "decimal" : 8}})
The result is a DataFrame with the patterns that were found. The first part of the DataFrame now contains
id |
pattern_id |
pattern_def |
support |
exceptions |
confidence |
---|---|---|---|---|---|
0 |
equal values |
{Own funds} = {Excess} |
9 |
1 |
0.9 |
The miner finds one patterns; it states that the ‘Own funds’-column is identical to the ‘Excess’-column in 9 of the 10 cases (with a confidence of 90 %, there is one case where the equal-pattern does not hold).
To analyze data with the generated set of data-patterns use the analyze function with the dataframe with the data as input:
df_results = miner.analyze(df)
The result is a DataFrame with the results. If we select result_type = False then the first part of the output contains
index |
result_type |
pattern_id |
pattern_def |
support |
exceptions |
confidence |
P values |
Q values |
Insurer 10 |
False |
equal values |
{Own funds} = {Excess} |
9 |
1 |
0.9 |
200 |
199.99 |
Other patterns you can use are ‘>’, ‘<’, ‘<=’, ‘>=’, ‘!=’, ‘sum’, and ‘–>’.
Read the documentation for more features.
Upload to Pypi (for developers)
Change the version in setup.py and setup.cfg
Go to github.com and navigate to the repository. Next, click on the tab “releases” and then on “Create a new release”. Now, define a Tag verion (it is best to use the same number as you used in your setup.py version-field: v0.1.15 for example). Then click on “publish release”.
Make a Pypi account here: https://pypi.org/manage/projects/
Download twine by typing in your command prompt:
pip install twine
Get admin rights of the owner of the data_patterns package.
Delete the old files in the dist folder
Open your command prompt and go to the folder of data_patterns. Then type
python setup.py sdist
twine upload dist/*
A good reference is here: https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56
History
0.1.0 (2019-10-27)
Development release.
0.1.11 (2019-11-6)
First release on PyPI.
< 0.1.17 (2020-10-6)
Expression
You can now use expressions to find patterns. This is a string such as ‘{.*}={.*}’ (this one will find columns that are equal to eachother). See example in usage as how to do it, also with unknown values.
Patterns of the for IF THEN will be done through a pandas expression and quantitative patterns will be found using numpy (quicker). Expression will be split up in parts if it is quantitative
Function
Added the function correct_data. This corrects data based on the most common value if grouped with another column, e.g. changes the names in a column if there are multiple names per LEI code.
Other
Added P and Q values to analyze
highest_conf option to find the pattern with the highest conf based on P value.
Possible to use with EVA2 rules
0.1.17 (2020-10-6)
Parameters
‘window’ (boolean): Only compares columns in a window of n, so [column-n, column+n].
‘disable’ (boolean): If you set this to True, it will disable all tqdm progress bars for finding and analyzing patterns.
‘expres’ (boolean): If you use an expression, it will only directly work with the expression if it is an IF THEN statement. Otherwise it is a quantitative pattern and it will be split up in parts and it uses numpy to find the patterns (this is quicker). However sometimes you want to work with an expression directly, such as the difference between two columns is lower than 5%. If you set expres to True, it will work directly with the expression.
Expression
You can use ABS in expressions. This calculates the absolute value. So something like ‘ABS({‘X’} - {‘Y’}) = {‘Z’})’
cluster
You can now add the column name on which you want to cluster
Function
Convert_to_time: merge periodes together by adding suffix to columns (t-1) and (t).
convert_columns_to_time: Make the periods into columns so that you have years as columns.
Other
Add tqdm progress bars
0.1.18 (16-11-2020)
variables to miner
You can now add a boolean to the miner. If you give the boolean True to the miner, it will get rid of all the “ and ‘ in the string data. This is needed for some data where name have those characters in their name. This will give errors later on if not removed.
Function to read overzicht
Changed the IF THEN expression so that we can use decimals when numeric
Parameters
‘notNaN’ (boolean): Only takes not NaN columns
Function changes
Convert_to_time: add boolean set_year. If true then only use the years (this is for yearly data), otherwise keep whole date. Set to True standard
update_statistics: Remove patterns that contain columns which are not in the data. This is necessary for some insurers so that they do not get errors
0.1.19 (10-2-2020)
Bug fixes with expressions including regex
0.1.20 (29-4-2021)
Suppress Pandas slice error is some cases
Deleted logging.basicConfig (to avoid that initial config is overwritten)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file data_patterns-0.1.24-py2.py3-none-any.whl
.
File metadata
- Download URL: data_patterns-0.1.24-py2.py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | be0bade28fd4b55458f1ba85c0db432d857485cc7f7119f30486c0acbd1d7acb |
|
MD5 | 54bff2f13cbdad21b61915a73441de6b |
|
BLAKE2b-256 | a27c753ef9bd64dbdc706a2aa56211965bea0a71204fba31ffb7ba8bf77935be |