caspailleur·PyPI

Minimalistic python package for mining many concise data representations. Part of SmartFCA project

These details have not been verified by PyPI

Project links

Homepage

Project description

Caspailleur is a python package for mining concepts and implications in binary data with FCA framework. Part of SmartFCA ANR project.

Get started

The stable version of the package can be installed from PyPI with:

pip install caspailleur

and the latest version of the package can be installed from GitHub repository:

pip install caspailleur@git+https://github.com/EgorDudyrev/caspailleur

Analysis example

Glossary

The field of Formal Concept Analysis has many mathematical terms and some conflicting notation traditions. Here is the glossary used throughout the caspailleur package: Glossary.md.

Data description

Let us study the "Famous Animals" dataset from FCA repository:

import caspailleur as csp
df, meta = csp.io.from_fca_repo('famous_animals_en')

print(meta)
print(df.replace({True: 'X', False: ''}))

{ 'title': 'Famous Animals',
'source': 'Priss, U. (2006), Formal concept analysis in information science. Ann. Rev. Info. Sci. Tech., 40: 521-543. p.525',
'size': {'objects': 5, 'attributes': 6},
'language': 'English',
'description': 'famous animals and their characteristics' }

	cartoon	real	tortoise	dog	cat	mammal
Garfield	X				X	X
Snoopy	X			X		X
Socks		X			X	X
Greyfriar's Bobby		X		X		X
Harriet		X	X

The received df object is a pandas dataframe. Other supported context data types can be found in Supported data formats section. Caspailleur works fast enough with datasets of hundreds of objects and dozens of attributes. Here, we choose a tiny dataset for illustrative purposes.

[!TIP] Caspailleur package can only work with binary data (and is optimised for this). You can consult Paspailleur package that extends Caspailleur functionality for complex non-binary data.

Mining concepts

Now we can find all concepts in the data:

concepts_df = csp.mine_concepts(df)

print(concepts_df[['extent', 'intent']].map(', '.join))

Concepts table (13 rows)

concept_id	extent	intent
0	Greyfriar's Bobby, Snoopy, Harriet, Socks, Garfield
1	Greyfriar's Bobby, Socks, Garfield, Snoopy	mammal
2	Greyfriar's Bobby, Socks, Harriet	real
3	Garfield, Snoopy	cartoon, mammal
4	Greyfriar's Bobby, Socks	mammal, real
5	Greyfriar's Bobby, Snoopy	dog, mammal
6	Socks, Garfield	cat, mammal
7	Harriet	tortoise, real
8	Snoopy	cartoon, dog, mammal
9	Garfield	cartoon, cat, mammal
10	Greyfriar's Bobby	mammal, dog, real
11	Socks	cat, mammal, real
12		mammal, cartoon, cat, dog, real, tortoise

The number of concepts is exponential to the number of objects and attributes in the data. To find only the most interesting concepts, specify min_support, min_delta_stability and/or n_stable_concepts parameters:

concepts_df = csp.mine_concepts(
  df, min_support=3, min_delta_stability=1,
  to_compute=['intent', 'keys', 'support', 'delta_stability', 'sub_concepts']
)

print(concepts_df)

concept_id	intent	keys	support	delta_stability	sub_concepts
0	set()	[set()]	5	1	{1, 2}
1	{'mammal'}	[{'mammal'}]	4	2	set()
2	{'real'}	[{'real'}]	3	1	set()

Mining implications

For many datasets, the number of concepts is too large to be read by hand. Luckily, relationships between attributes can be described via implication bases whose number is usually much smaller.

implications_df = csp.mine_implications(df)

print(implications_df[['premise', 'conclusion', 'support']])

Implications table (4 rows)

implication_id	premise	conclusion	support
0	{'cartoon'}	{'mammal'}	2
1	{'tortoise'}	{'real'}	1
2	{'dog'}	{'mammal'}	2
3	{'cat'}	{'mammal'}	2

We can read the implications in the table and find out dependencies in the data. For example:

every famous cartoon animal is a mammal
(from impl. 0: cartoon -> mammal);
one can find famous tortoises only in real life
(from impl. 1: tortoise -> real);

Set min_support=0 to see implications on contradicting subsets of attributes, e.g.:

nobody is a dog and a cat at the same time
(from: dog, cat -> ... with support 0).

Note, however, that there can be a lot of implications with 0 support. And so their computation might take a lot of time.

If finding full implication basis takes too much time, one can mine only a part of columns and implications:

implications_df = csp.mine_implications(
  df, basis_name='Canonical', unit_base=True,
  to_compute=['premise', 'conclusion', 'extent'],
  min_support=2,
)

print(implications_df)

implication_id	premise	conclusion	extent
0	{'cat'}	mammal	{'Socks', 'Garfield'}
1	{'dog'}	mammal	{"Greyfriar's Bobby", 'Snoopy'}
2	{'cartoon'}	mammal	{'Garfield', 'Snoopy'}

The supported bases are Canonical basis (a.k.a. Pseudo-Intent or Duquenne-Guigues basis) and Canonical Direct basis (a.k.a. Proper Premise or Karell basis). Every basis can also be transformed in a unit-base where every conclusion consists of only one attribute.

Mining descriptions

Finally, Caspailleur can output all descriptions in the data and their characteristics. But note that the number of descriptions = 2^number of attributes.

descriptions_df = csp.mine_descriptions(df)

print('__n. attributes:__', df.shape[1])
print('__n. descriptions:__', len(descriptions_df))
print('__columns:__', ', '.join(descriptions_df.columns))
print(descriptions_df[['description', 'support', 'is_key']].head(3))

n. attributes: 6
n. descriptions: 64

columns: description, extent, intent, support, delta_stability, is_closed, is_key, is_passkey, is_proper_premise, is_pseudo_intent

description_id	description	support	is_key
0	set()	5	True
1	{cartoon}	2	True
2	{real}	3	True

Visualising concept lattice

Caspailleur package does not support concept lattice visualisation (this task deserves its own package). For a basic concept lattice visualisation, one can produce a mermaid diagram code. Mermaid diagrams can be visualised via https://mermaid.live/ service or can be embedded in GitHub flavored markdown.

concepts_df = csp.mine_concepts(df, min_support=2)

# manually define what to show in the nodes of the diagram
new_intent_labels = ('<b>' + concepts_df['new_intent'].map(sorted).map(', '.join) + '</b>').replace('<b></b>', '')
old_intent_labels = (concepts_df['intent'] - concepts_df['new_intent']).map(sorted).map(', '.join)
intent_labels = (new_intent_labels + ';' + old_intent_labels).map(lambda l: ', '.join(l.strip(';').split(';')))
extent_labels = concepts_df['extent'].map(sorted).map(', '.join)

node_labels = intent_labels + '<br><br>' + extent_labels
node_labels = [l.replace(' ', '&nbsp') for l in node_labels] # replace space with non-breakable space for better Mermaid visualisation

diagram_code = csp.io.to_mermaid_diagram(node_labels, concepts_df['previous_concepts'])
print(diagram_code)

flowchart TD
A["<br><br>Garfield,&nbspGreyfriar's Bobby,&nbspHarriet,&nbspSnoopy,&nbspSocks"];
B["<b>mammal</b><br><br>Garfield,&nbspGreyfriar's&nbspBobby,&nbspSnoopy,&nbspSocks"];
C["<b>real</b><br><br>Greyfriar's&nbspBobby,&nbspHarriet,&nbspSocks"];
D["<b>cartoon</b>,&nbspmammal<br><br>Garfield,&nbspSnoopy"];
E["mammal,&nbspreal<br><br>Greyfriar's&nbspBobby,&nbspSocks"];
F["<b>dog</b>,&nbspmammal<br><br>Greyfriar's&nbspBobby,&nbspSnoopy"];
G["<b>cat</b>,&nbspmammal<br><br>Garfield,&nbspSocks"];
A --- B;
A --- C;
B --- D;
B --- E;
B --- F;
B --- G;
C --- E;

If, above, you see the source of the diagram, visit the GitHub version of this ReadMe for the diagram itself. If, above, you see the diagram, go to the source code of the ReadMe for the diagram code.

Supported data formats

A formal context can be defined using many data types.

Below is the list of context types and examples acceptable by high-level caspailleur functions:

Supported data types

PandasContextType

A binary Pandas dataframe. Can be obtained via csp.io.to_pandas function.

Example:

print(csp.io.to_pandas(df))

	cartoon	real	tortoise	dog	cat	mammal
Garfield	True	False	False	False	True	True
Snoopy	True	False	False	True	False	True
Socks	False	True	False	False	True	True
Greyfriar's Bobby	False	True	False	True	False	True
Harriet	False	True	True	False	False	False

ItemsetContextType

A list of sets of indices of True columns in the data. Can be obtained via csp.io.to_itemsets function.

Example:

print(*csp.io.to_itemsets(df), sep='\n')

{0, 4, 5}
{0, 3, 5}
{1, 4, 5}
{1, 3, 5}
{1, 2}

NamedItemsetContextType

A triplet: (ItemsetContextType, object names, attribute names). Can be obtained via csp.io.to_named_itemsets function.

Example:

print(*csp.io.to_named_itemsets(df), sep='\n')

[{0, 4, 5}, {0, 3, 5}, {1, 4, 5}, {1, 3, 5}, {1, 2}]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']

BitarrayContextType

A list of bitarrays where every bitarray represents "active" attributes in object's description Can be obtained via csp.io.to_bitarrays function;

Example:

print(*csp.io.to_bitarrays(df), sep='\n')

bitarray('100011')
bitarray('100101')
bitarray('010011')
bitarray('010101')
bitarray('011000')

NamedBitarrayContextType

A triplet: (BitarrayContextType, object names, attribute names). Can be obtained via csp.io.to_named_bitarrays function;

Example:

print(*csp.io.to_named_bitarrays(df), sep='\n')

[bitarray('100011'), bitarray('100101'), bitarray('010011'), bitarray('010101'), bitarray('011000')]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']

BoolContextType

A list of object's descriptions where every description is a list of bool values. Can be obtained via csp.io.to_bools function;

Example:

print(*csp.io.to_bools(df), sep='\n')

[True, False, False, False, True, True]
[True, False, False, True, False, True]
[False, True, False, False, True, True]
[False, True, False, True, False, True]
[False, True, True, False, False, False]

NamedBoolContextType

A triplet: (BoolContextType, object names, attribute names). Can be obtained via csp.io.to_named_bools function;

Example:

print(*csp.io.to_named_bools(df), sep='\n')

[[True, False, False, False, True, True], [True, False, False, True, False, True], [False, True, False, False, True, True], [False, True, False, True, False, True], [False, True, True, False, False, False]]
['Garfield', 'Snoopy', 'Socks', "Greyfriar's Bobby", 'Harriet']
['cartoon', 'real', 'tortoise', 'dog', 'cat', 'mammal']

DictContextType

A dictionary where every key is an object's name and every value if object's description represented with sets of names of attributes. Can be obtained via csp.io.to_dictionary function.

Example:

print(csp.io.to_dictionary(df))

{'Garfield': {'cartoon', 'mammal', 'cat'},
'Snoopy': {'dog', 'cartoon', 'mammal'},
'Socks': {'real', 'mammal', 'cat'},
"Greyfriar's Bobby": {'real', 'mammal', 'dog'},
'Harriet': {'real', 'tortoise'}
}

Save and load Formal Context

A formal context can also be saved to and loaded from a .cxt formatted file or a string:

with open('context.cxt', 'w') as file:
    csp.io.write_cxt(df, file)

with open('context.cxt', 'r') as file:
    df_loaded = csp.io.read_cxt(file)

assert (df == df_loaded).all(None)

Approach for faster computation

Caspailleur does three things to fasten up the computations:

It exploits the connections between characterisic attribute sets.
E.g. a function to compute proper premises takes intents and keys as inputs, and not the original binary data.
The set of intents is computed by LCM algorithm
well-implemented in scikit-mine package: https://pypi.org/project/scikit-mine/;
All intrinsic computations are performed with bitwise operations
provided by bitarray package: https://pypi.org/project/bitarray/

The diagram below presents dependencies between the characteristic attribute sets. For example, the arrow "intents -> keys" means that the set of intents is required to compute the set of keys.

  graph TD;
      S["<b>itemsets</b><br><small><tt>csp.np2bas(...)</tt></small>"];
      A["<b>intents</b><br><small><tt>csp.list_intents_via_LCM(...)</tt></small>"];
      B["<b>keys</b><br><small><tt>csp.list_keys(...)</tt></small>"];
      C["<b>passkeys</b><br><small><tt>csp.list_passkeys(...)</tt></small>"];
      D["<b>intents ordering</b><br><small><tt>csp.sort_intents_inclusion(...)</tt></small>"]; 
      E["<b>pseudo-intents</b><br><small><tt>csp.list_pseudo_intents_via_keys(...)</tt></small>"];
      F["<b>proper premises</b><br><small><tt>csp.iter_proper_premises_via_keys(...)</tt></small>"];
      G["<b>linearity index</b><br><small><tt>csp.linearity_index(...)</tt></small>"];
      H["<b>distributivity index</b><br><small><tt>csp.distributivity_index(...)</tt></small>"];
      
      S --> A
      A --> B;
      A --> C;
      A --> D;
      A --> E; 
      B --> E;  
      B --> F; A --> F; 
      A --> G; D --> G;
      D --> H; A --> H;

In case the diagram is not compiling, visit the GitHub version of README: https://github.com/EgorDudyrev/caspailleur

[!NOTE] Although caspailleur package implements many optimisations to fasten up the computations, we do not state that it is the fastest FCA package ever existed. For example, our algorithm for computing pseudo-intent basis is far from the state-of-art. Knowing that, we find caspailleur fast enough for comfortable everyday use.

How to cite

There are no papers written about caspailleur (yet). So you can cite the package itself.

@misc{caspailleur,
  title={caspailleur},
  author={Dudyrev, Egor},
  year={2023},
  howpublished={\url{https://www.smartfca.org/software}},
}

Funding

The package development is supported by ANR project SmartFCA (ANR-21-CE23-0023).

SmartFCA (https://www.smartfca.org/) is a big platform that will contain many extensions of Formal Concept Analysis including pattern structures, Relational Concept Analysis, Graph-FCA and others. While caspailleur is a small python package that covers only the basic notions of FCA.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.2

May 5, 2025

0.2.1

Nov 25, 2024

0.2.0

Sep 17, 2024

0.1.3

Nov 17, 2023

0.1.2

Jun 16, 2023

0.1.1

May 26, 2023

0.1.0

May 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caspailleur-0.2.2.tar.gz (90.6 kB view details)

Uploaded May 5, 2025 Source

Built Distribution

caspailleur-0.2.2-py3-none-any.whl (79.9 kB view details)

Uploaded May 5, 2025 Python 3

File details

Details for the file caspailleur-0.2.2.tar.gz.

File metadata

Download URL: caspailleur-0.2.2.tar.gz
Upload date: May 5, 2025
Size: 90.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for caspailleur-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`1a39c9db74c3f0aab7f08cdfbc152afa1269a214752fe7c026c3c22524a7ab32`
MD5	`0f6acdddc3ff128a5acc8c03682cd72d`
BLAKE2b-256	`292b36965a9ef3cc95ac1de763d3f90e41ca8d73e72f67be772711f74ece282c`

See more details on using hashes here.

File details

Details for the file caspailleur-0.2.2-py3-none-any.whl.

File metadata

Download URL: caspailleur-0.2.2-py3-none-any.whl
Upload date: May 5, 2025
Size: 79.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for caspailleur-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7578cc5df57e3e3f3de1d5ab842d5254fce10526c5a45808e1e2066792cc901b`
MD5	`de19c122334999278853404bbc504d3a`
BLAKE2b-256	`5d51d1a1a825475c96069ef72363cc7ae7ba4f93fde73a9712dc973c3ba90aa8`

See more details on using hashes here.

caspailleur 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Get started

Analysis example

Glossary

Data description

Mining concepts

Mining implications

Mining descriptions

Visualising concept lattice

Supported data formats

PandasContextType

ItemsetContextType

NamedItemsetContextType

BitarrayContextType

NamedBitarrayContextType

BoolContextType

NamedBoolContextType

DictContextType

Save and load Formal Context

Approach for faster computation

How to cite

Funding

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes