Skip to main content

pysubgroup is a Python library for the data analysis task of subgroup discovery.

Project description

Build status ReadTheDocs Coveralls PyPI-Server Conda-Forge Monthly Downloads

pysubgroup

pysubgroup is a Python package that enables subgroup discovery in Python+pandas (scipy stack) data analysis environment. It provides for a lightweight, easy-to-use, extensible and freely available implementation of state-of-the-art algorithms, interestingness measures and presentation options.

This library is still in a prototype phase. It has, however, been already successfully employed in active application projects.

Subgroup Discovery

Subgroup Discovery is a well established data mining technique that allows you to identify patterns in your data. More precisely, the goal of subgroup discovery is to identify descriptions of data subsets that show an interesting distribution with respect to a pre-specified target concept. For example, given a dataset of patients in a hospital, we could be interested in subgroups of patients, for which a certain treatment X was successful. One example result could then be stated as:

"While in general the operation is successful in only 60% of the cases", for the subgroup of female patients under 50 that also have been treated with drug d, the success rate was 82%."

Here, a variable operation success is the target concept, the identified subgroup has the interpretable description female=True AND age<50 AND drug_D = True. We call these single conditions (such as female=True) selection expressions or short selectors. The interesting behavior for this subgroup is that the distribution of the target concept differs significantly from the distribution in the overall general dataset. A discovered subgroup could also be seen as a rule:

female=True AND age<50 AND drug_D = True ==> Operation_outcome=SUCCESS

Computationally, subgroup discovery is challenging since a large number of such conjunctive subgroup descriptions have to be considered. Of course, finding computable criteria, which subgroups are likely interesting to a user is also an eternal struggle. Therefore, a lot of literature has been devoted to the topic of subgroup discovery (including some of my own work). Recent overviews on the topic are for example:

Prerequisites and Installation

pysubgroup is built to fit in the standard Python data analysis environment from the scipy-stack. Thus, it can be used just having pandas (including its dependencies numpy, scipy, and matplotlib) installed. Visualizations are carried out with the matplotlib library.

pysubgroup consists of pure Python code. Thus, you can simply download the code from the repository and copy it in your site-packages directory. pysubgroup is also on PyPI and should be installable using: pip install pysubgroup

Note: Some users complained about the pip installation not working. If, after the installation, it still doesn't find the package, then do the following steps:

  1. Find where the directory site-packages is.
  2. Copy the folder pysubgroup, which contains the source code, into the site-packages directory. (WARNING: This is not the main repository folder. The pysubgroup folder is inside the main repository folder, at the same level as doc)
  3. Now you can import the module with import pysubgroup.

How to use:

A simple use case (here using the well known titanic data) can be created in just a few lines of code:

import pysubgroup as ps

# Load the example dataset
from pysubgroup.datasets import get_titanic_data
data = get_titanic_data()

target = ps.BinaryTarget ('Survived', True)
searchspace = ps.create_selectors(data, ignore=['Survived'])
task = ps.SubgroupDiscoveryTask (
    data,
    target,
    searchspace,
    result_set_size=5,
    depth=2,
    qf=ps.WRAccQF())
result = ps.DFS().execute(task)

The first line imports pysubgroup package. The following lines load an example dataset (the popular titanic dataset).

Therafter, we define a target, i.e., the property we are mainly interested in (_'survived'}. Then, we define the searchspace as a list of basic selectors. Descriptions are built from this searchspace. We can create this list manually, or use an utility function. Next, we create a SubgroupDiscoveryTask object that encapsulates what we want to find in our search. In particular, that comprises the target, the search space, the depth of the search (maximum numbers of selectors combined in a subgroup description), and the interestingness measure for candidate scoring (here, the Weighted Relative Accuracy measure).

The last line executes the defined task by performing a search with an algorithm---in this case depth first search. The result of this algorithm execution is stored in a SubgroupDiscoveryResults object.

To just print the result, we could for example do:

print(result.to_dataframe())

to get:

quality description
0 0.132150 Sex==female
1 0.101331 Parch==0 AND Sex==female
2 0.079142 Sex==female AND SibSp: [0:1[
3 0.077663 Cabin.isnull() AND Sex==female
4 0.071746 Embarked==S AND Sex==female

Key classes

Here is an outline on the most important classes:

  • Selector: A Selector represents an atomic condition over the data, e.g., age < 50. There several subtypes of Selectors, i.e., NominalSelector (color==BLUE), NumericSelector (age < 50) and NegatedSelector (a wrapper such as not selector1)
  • SubgroupDiscoveryTask: As mentioned before, encapsulates the specification of how an algorithm should search for interesting subgroups
  • SubgroupDicoveryResult: These are the main outcome of a subgroup disovery run. You can obtain a list of subgroups using the to_subgroups() or to a dataframe using to_dataframe()
  • Conjunction: A conjunction is the most widely used SubgroupDescription, and indicates which data instances are covered by the subgroup. It can be seen as the left hand side of a rule.

License

We are happy about anyone using this software. Thus, this work is put under an Apache license. However, if this constitutes any hindrance to your application, please feel free to contact us, we am sure that we can work something out.

Copyright 2016-2019 Florian Lemmerich

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Warning

  • GP-growth is in an experimental stage.

Cite

If you are using pysubgroup for your research, please consider citing our demo paper:

Lemmerich, F., & Becker, M. (2018, September). pysubgroup: Easy-to-use subgroup discovery in python. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECMLPKDD). pp. 658-662.

bibtex:

@inproceedings{lemmerich2018pysubgroup,
  title={pysubgroup: Easy-to-use subgroup discovery in python},
  author={Lemmerich, Florian and Becker, Martin},
  booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
  pages={658--662},
  year={2018}
}

Note

This project has been set up using PyScaffold 4.5. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysubgroup-0.9.0.tar.gz (261.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysubgroup-0.9.0-py3-none-any.whl (92.7 kB view details)

Uploaded Python 3

File details

Details for the file pysubgroup-0.9.0.tar.gz.

File metadata

  • Download URL: pysubgroup-0.9.0.tar.gz
  • Upload date:
  • Size: 261.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pysubgroup-0.9.0.tar.gz
Algorithm Hash digest
SHA256 d05d9c5340c95a362260f7f9895c3be3772e0f75f7059e5952b65fa445209b66
MD5 701bcba434721afe389a2871214635ca
BLAKE2b-256 88f65fbc621a6d617a1f807ea77bf4440f37cc0106746efcc03c446a7493ff89

See more details on using hashes here.

File details

Details for the file pysubgroup-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: pysubgroup-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pysubgroup-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9747fba6d3d23de76153191d6016daf90dc1526a8b31e506065f674b878d7e0e
MD5 8f4b463f0eb876f3b84ade089fb85ab2
BLAKE2b-256 2d442a0d106718eb789850f317efdfb943f542347c392b44b6671fe195c61be8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page