Skip to main content

Minimalistic python package for mining many concise data representations. Part of SmartFCA project

Project description

PyPi GitHub Workflow Licence LORIA SmartFCA

caspailleur

Minimalistic python package for mining many concise data representations. Part of SmartFCA project.

Get started

The package can be installed from PyPI with:

pip install caspailleur

Run example

# Load binary dataframe
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/EgorDudyrev/FCApy/main/data/animal_movement.csv', index_col=0)

# Explore the dataset
import caspailleur as csp
data_dict = csp.explore_data(df.values)
print(data_dict.keys())

['intents', 'keys', 'passkeys', 'pseudo_intents', 'proper_premises', 'intents_ordering', 'linearity', 'distributivity']

If you need only some of the fields from data_dict, please take a look inside explore_data function.

Elaborate on results

Visualize the output

By default, caspailleur outputs the data stored in frozensets. So let us drop all mentions of frozensets from the output to make it more concise.

# Prettifying the output
import re
to_print = '\n'.join([f"{k}: {v}" for k, v in data_dict.items()])
to_print = to_print.replace('frozenset()', 'set()')
for _ in re.findall(r"frozenset\(.+?\)", to_print):
    to_print = re.sub(r"frozenset\((.+?)\)", r"\g<1>", to_print)
print(to_print)

intents: [set(), {0}, {1}, {2}, {0, 1}, {0, 3}, {1, 2}, {0, 1, 2, 3}]
keys: {set(): 0, {0}: 1, {1}: 2, {2}: 3, {3}: 5, {0, 1}: 4, {0, 2}: 7, {1, 2}: 6, {1, 3}: 7, {2, 3}: 7}
passkeys: {set(): 0, {0}: 1, {1}: 2, {2}: 3, {3}: 5, {0, 1}: 4, {0, 2}: 7, {1, 2}: 6, {1, 3}: 7, {2, 3}: 7}
pseudo_intents: {{3}: 5, {0, 2}: 7, {0, 1, 3}: 7}
proper_premises: {{3}: 5, {0, 2}: 7, {1, 3}: 7, {2, 3}: 7}
intents_ordering: [set(), {0}, {0}, {0}, {1, 2}, {1}, {2, 3}, {4, 5, 6}]
linearity: 0.6428571428571429
distributivity: 0.75

Dataset

The example dataset contains 16 rows (a.k.a. objects) and 4 columns (a.k.a. attributes). The rows represent animals, and the columns show the actions the animals can perform. For example, "dove" can "fly", but cannot "hunt".

print(df.to_markdown().replace('1','X').replace('0',' '))
fly hunt run swim
dove X
hen
duck X X
goose X X
owl X X
hawk X X
eagle X X
fox X X
dog X
wolf X X
cat X X
tiger X X
lion X X
horse X
zebra X
cow

Verbose functions

First, let us define functions to match column indices with column names.

def verbose(indices, names, empty_symbol='∅'):
    return ', '.join([names[i] for i in sorted(indices)]) if indices else empty_symbol

def unpack_gens_dict(gens_dict, intents, show_difference: bool = True):
    dct = {k: intents[intent_i] for k, intent_i in gens_dict.items()}
    if show_difference:
        return {k: v-k for k, v in dct.items()}
    return dct

Intents

Intents are maximal attribute sets that describe specific subsets of objects. Intents are also known as "closed descriptions" and "closed itemsets".

print('\n'.join([verbose(intent, df.columns) for intent in data_dict['intents']]))


fly
hunt
run
fly, hunt
fly, swim
hunt, run
fly, hunt, run, swim

For example, attributes "fly, swim" are all the attributes that describe "duck, goose".

Intents ordering

Intents can be ordered by set inclusion operation. Their order can be represented with line diagram:

graph TD; A[fa:fa-empty-set];
B[fly];
C[hunt];
D[run];
E[fly, hunt];
F[fly, swim];
G[hunt, run];
H[fly, hunt, run, swim];A --> B;
A --> C;
A --> D;
B --> E;
C --> E;
B --> F;
C --> G;
D --> G;
E --> H;
F --> H;
G --> H;

In case the diagram is not compiling, visit the GitHub version of README: https://github.com/EgorDudyrev/caspailleur

The diagram was constructed with the following code:

def construct_mermaid_diagram(ordering, intents):
    node_names = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

    defining_nodes = '\n'.join([
        f'{node_name}[{verbose(intent, df.columns, empty_symbol="fa:fa-empty-set")}];'
        for node_name, intent in zip(node_names, intents)]
    )

    defining_edges = '\n'.join([
        f'{node_names[parent_i]} --> {node_names[intent_i]};'
        for intent_i, parents in enumerate(ordering) for parent_i in parents]
    )
    
    diagram = f"graph TD; "+defining_nodes + defining_edges
    return diagram

print(construct_mermaid_diagram(data_dict['intents_ordering'], data_dict['intents']))

Keys and passkeys

Keys are minimal subsets of attributes that describe specific subsets of objects. And passkeys are keys of minimal cardinality.

So keys and passkeys are equivalent to intents (w.r.t. what objects they describe), but are smaller in size.

print('\n'.join([
    verbose(k, df.columns)+' ~ '+verbose(v, df.columns)
    for k, v in unpack_gens_dict(data_dict['keys'], data_dict['intents'], show_difference=False).items()
    if k != v
]))

swim ~ fly, swim
fly, run ~ fly, hunt, run, swim
hunt, swim ~ fly, hunt, run, swim
run, swim ~ fly, hunt, run, swim

Here are examples of keys in the dataset that differ from their corresponding intents. For example, both "swim" and "fly, swim" describe the same objects "duck, goose" (so they are equivalent). But the former is a minimal subset (therefore a key), and the latter is a maximal subset (therefore an intent).

In this example, the sets of keys and passkeys are the same. But they can differ on bigger datasets.

Proper premises

The set of proper premises form a direct (or iteration-free) base of implications. Thus, al implications in the dataset can be obtained with a single application of Armstrong rules to the proper premise implications.

print('\n'.join([
    verbose(k, df.columns)+' -> '+verbose(v, df.columns)
    for k, v in unpack_gens_dict(data_dict['proper_premises'], data_dict['intents'], show_difference=True).items()
]))

swim -> fly
fly, run -> hunt, swim
hunt, swim -> fly, run
run, swim -> fly, hunt

Example shows that, according to the dataset, every animal who can swim can fly. And every animal who can fly and run can also hunt and swim.

Pseudo-intents

Pseudo-intents are subsets of attributes. The set of pseudo-intents forms an implication basis of minimum cardinality.

print('\n'.join([
    verbose(k, df.columns)+' -> '+verbose(v, df.columns)
    for k, v in unpack_gens_dict(data_dict['pseudo_intents'], data_dict['intents'], show_difference=True).items()
]))

swim -> fly
fly, run -> hunt, swim
fly, hunt, swim -> run

Note that there are 4 proper premises in the dataset, and only 3 pseudo-intents. So the set of pseudo-intents gives smaller amount of implication.

Complexity indices

Complexity indices are FCA-based tools to measure the complexity of the dataset.

Linearity index shows the percentage of comparable pairs of intents in a lattice. And distributivity index shows the percentage of pairs of intents, such that their union is also an intent.

for k in ['linearity', 'distributivity']:
    print(k, data_dict[k])

linearity 0.6428571428571429
distributivity 0.75

Outline

Caspaiileur is a python package designed to mine many characteristic attribute sets from data at once with high speed.

These sets are:

  • intents (a.k.a. closed itemsets, closed descriptions),
  • keys (a.k.a. minimal generators),
  • passkeys (a.k.a. mimimum generators),
  • pseudo-intents, and
  • proper premises.

Also, caspailleur contains functions to compute linearity and distributivity indices to measure the complexity of the data.

Mathematical definitions of intents, keys and others are presented in the paper: Buzmakov, A., Dudyrev, E., Kuznetsov, S. O., Makhalova, T., & Napoli, A. Data complexity: An FCA-based approach https://hal.science/hal-03970678v1. Definitions in a form of Python code are given in "definitions" module: caspailleur/definitions.py

Approach for faster computation

Caspailleur does three things to fasten up the computations:

  1. It exploits the connections between characterisic attribute sets.
    E.g. a function to compute proper premises takes intents and keys as inputs, and not the original binary data.
  2. The set of intents is computed by LCM algorithm
    well-implemented in scikit-mine package: https://pypi.org/project/scikit-mine/;
  3. All intrinsic computations are performed with bitwise operations
    provided by bitarray package: https://pypi.org/project/bitarray/

The diagram below presents dependencies between the characteristic attribute sets. For example, the arrow "intents -> keys" means that the set of intents is required to compute the set of keys.

  graph TD;
      S["<b>itemsets</b><br><small><tt>csp.np2bas(...)</tt></small>"];
      A["<b>intents</b><br><small><tt>csp.list_intents_via_LCM(...)</tt></small>"];
      B["<b>keys</b><br><small><tt>csp.list_keys(...)</tt></small>"];
      C["<b>passkeys</b><br><small><tt>csp.list_passkeys(...)</tt></small>"];
      D["<b>intents ordering</b><br><small><tt>csp.sort_intents_inclusion(...)</tt></small>"]; 
      E["<b>pseudo-intents</b><br><small><tt>csp.list_pseudo_intents_via_keys(...)</tt></small>"];
      F["<b>proper premises</b><br><small><tt>csp.iter_proper_premises_via_keys(...)</tt></small>"];
      G["<b>linearity index</b><br><small><tt>csp.linearity_index(...)</tt></small>"];
      H["<b>distributivity index</b><br><small><tt>csp.distributivity_index(...)</tt></small>"];
      
      S --> A
      A --> B;
      A --> C;
      A --> D;
      A --> E; 
      B --> E;  
      B --> F; A --> F; 
      A --> G; D --> G;
      D --> H; A --> H; 

In case the diagram is not compiling, visit the GitHub version of README: https://github.com/EgorDudyrev/caspailleur

How to cite

There are no papers written about caspailleur (yet). So you can cite the package itself.

@misc{caspailleur,
  title={caspailleur},
  author={Dudyrev, Egor},
  year={2023},
  howpublished={\url{https://www.smartfca.org/software}},
}

Funding

The package development is supported by ANR project SmartFCA (ANR-21-CE23-0023).

SmartFCA (https://www.smartfca.org/) is a big platform that will contain many extensions of Formal Concept Analysis including pattern structures, Relational Concept Analysis, Graph-FCA and others. While caspailleur is a small python package that covers only the basic notions of FCA.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caspailleur-0.1.3.tar.gz (60.3 kB view hashes)

Uploaded Source

Built Distribution

caspailleur-0.1.3-py3-none-any.whl (43.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page