Skip to main content

Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI

Project description

CBRkit

cbrkit logo

PyPI | Docs | Example

Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI.

CBRkit Presentation
ICCBR 2024 Best Student Paper


CBRkit is a customizable and modular toolkit for Case-Based Reasoning (CBR) in Python. It provides a set of tools for loading cases and queries, defining similarity measures, and retrieving cases based on a query. The toolkit is designed to be flexible and extensible, allowing you to define custom similarity measures or use built-in ones. Retrieval pipelines are declared by composing these metrics, and the toolkit provides utility functions for applying them on a casebase. Additionally, it offers ready-to-use API and CLI interfaces for easy integration into your projects. The library is fully typed, enabling autocompletion and type checking in modern IDEs like VSCode and PyCharm.

To get started, we provide a demo project which contains a casebase and a predefined retriever. Further examples can be found in our tests and documentation. The following modules are part of CBRkit:

  • cbrkit.loaders: Functions for loading cases and queries.
  • cbrkit.sim: Similarity generator functions for common data types like strings and numbers.
  • cbrkit.retrieval: Functions for defining and applying retrieval pipelines.
  • cbrkit.adapt: Adaptation generator functions for adapting cases based on a query.
  • cbrkit.reuse: Functions for defining and applying reuse pipelines.
  • cbrkit.typing: Generic type definitions for defining custom functions.

Installation

The library is available on PyPI, so you can install it with pip:

pip install cbrkit

It comes with several optional dependencies for certain tasks like NLP which can be installed with:

pip install cbrkit[EXTRA_NAME,...]

where EXTRA_NAME is one of the following:

  • all: All optional dependencies
  • api: REST API Server
  • cli: Command Line Interface (CLI)
  • eval: Evaluation tools for common metrics like precision and recall
  • llm: Large Language Models (LLM) APIs like Ollama and OpenAI
  • nlp: Standalone NLP tools levenshtein, nltk, openai, and spacy
  • timeseries: Time series similarity measures like dtw and smith_waterman
  • transformers: Advanced NLP tools based on pytorch and transformers

Loading Cases

The first step is to load cases and queries. We provide predefined functions for the most common formats like CSV, JSON, and XML. Additionally, CBRkit also integrates with polars and pandas for loading data frames. The following example shows how to load cases and queries from a CSV file using polars:

import polars as pl
import cbrkit

df = pl.read_csv("path/to/cases.csv")
casebase = cbrkit.loaders.polars(df)

When dealing with formats like JSON, the files can be loaded directly:

casebase = cbrkit.loaders.json("path/to/cases.json")

Defining Queries

CBRkit expects the type of the queries to match the type of the cases. You may define a single query directly in Python as follows

query = {"name": "John", "age": 25}

If you have a collection of queries, you can load them using the same loader functions as for the cases.

 # for polars
queries = cbrkit.loaders.polars(pl.read_csv("path/to/queries.csv"))
# for json
queries = cbrkit.loaders.json("path/to/queries.json")

In case your query collection only contains a single entry, you can use the singleton function to extract it.

query = cbrkit.helpers.singleton(queries)

Similarity Measures and Aggregation

The next step is to define similarity measures for the cases and queries. It is possible to define custom measures, use built-in ones, or combine both.

Custom Similarity Measures

In CBRkit, a similarity measure is defined as a function that takes two arguments (a case and a query) and returns a similarity score: sim = f(x, y). It also supports pipeline-based similarity measures that are popular in NLP where a list of tuples is passed to the similarity measure: sims = f([(x1, y1), (x2, y2), ...]). This generic approach allows you to define custom similarity measures for your specific use case. For instance, the following function not only checks for strict equality, but also for partial matches (e.g., x = "blue" and y = "light blue"):

def color_similarity(x: str, y: str) -> float:
    if x == y:
        return 1.0
    elif x in y or y in x:
        return 0.5

    return 0.0

Please note: CBRkit inspects the signature of custom similarity functions to perform some checks. You need to make sure that the two parameters are named x and y, otherwise CBRkit will throw an error.

Built-in Similarity Measures

CBRkit also contains a selection of built-in similarity measures for the most common data types in the module cbrkit.sim. They are provided through generator functions that allow you to customize the behavior of the built-in measures. For example, an spacy-based embedding similarity measure can be obtained as follows:

semantic_similarity = cbrkit.sim.strings.spacy(model="en_core_web_lg")

Please note: Calling the function cbrkit.sim.strings.spacy returns a similarity function itself that has the same signature as the color_similarity function defined above.

An overview of all available similarity measures can be found in the module documentation.

Global Similarity and Aggregation

When dealing with cases that are not represented through elementary data types like strings, we need to aggregate individual measures to obtain a global similarity score. We provide a predefined aggregator that transforms a list of similarities into a single score. It can be used with custom and/or built-in measures.

similarities = [0.8, 0.6, 0.9]
aggregator = cbrkit.sim.aggregator(pooling="mean")
global_similarity = aggregator(similarities)

For the common use case of attribute-value based data, CBRkit provides a predefined global similarity measure that can be used as follows:

cbrkit.sim.attribute_value(
    attributes={
        "price": cbrkit.sim.numbers.linear(),
        "color": color_similarity # custom measure
        ...
    },
    aggregator=cbrkit.sim.aggregator(pooling="mean"),
)

The attribute_value function lets you define measures for each attribute of the cases/queries as well as the aggregation function. It also allows to use custom measures like the color_similarity function defined above.

Please note: The custom measure is not executed (i.e., there are no parenthesis at the end), but instead passed as a reference to the attribute_value function.

You may even nest similarity functions to create measures for object-oriented cases:

cbrkit.sim.attribute_value(
    attributes={
        "manufacturer": cbrkit.sim.attribute_value(
            attributes={
                "name": cbrkit.sim.strings.spacy(model="en_core_web_lg"),
                "country": cbrkit.sim.strings.levenshtein(),
            },
            aggregator=cbrkit.sim.aggregator(pooling="mean"),
        ),
        "color": color_similarity # custom measure
        ...
    },
    aggregator=cbrkit.sim.aggregator(pooling="mean"),
)

Retrieval

The final step is to retrieve cases based on the loaded queries. The cbrkit.retrieval module provides utility functions for this purpose. You first build a retrieval pipeline by specifying a global similarity function and optionally a limit for the number of retrieved cases.

retriever = cbrkit.retrieval.build(
    cbrkit.sim.attribute_value(...),
    limit=10
)

This retriever can then be applied on a casebase to retrieve cases for a given query.

result = cbrkit.retrieval.apply(casebase, query, retriever)

Our result has the following attributes:

  • similarities: A dictionary containing the similarity scores for each case.
  • ranking A list of case indices sorted by their similarity score.
  • casebase The casebase containing only the retrieved cases (useful for downstream tasks).

In some cases, it is useful to combine multiple retrieval pipelines, for example when applying the MAC/FAC pattern where a cheap pre-filter is applied to the whole casebase before a more expensive similarity measure is applied on the remaining cases. To use this pattern, first create the corresponding retrievers using the builder:

retriever1 = cbrkit.retrieval.build(..., min_similarity=0.5, limit=20)
retriever2 = cbrkit.retrieval.build(..., limit=10)

Then apply all of them sequentially by passing them as a list or tuple to the apply function:

result = cbrkit.retrieval.apply(casebase, query, (retriever1, retriever2))

The result has the following two attributes:

  • final: Result of the last retriever in the list.
  • steps: A list of results for each retriever in the list.

Both final and each entry in steps have the same attributes as discussed previously. The returned result also has these entries which are an alias for the corresponding entries in final (i.e., result.ranking == result.final.ranking).

Adaptation Functions

Coming soon...

Reuse

Coming soon...

Evaluation

Coming soon...

REST API and CLI

In order to use the built-in API and CLI, you need to define a retriever/reuser in a Python module using the function cbrkit.retrieval.build() and/or cbrkit.reuse.build(). For example, the file ./retriever_module.py could contain the following code:

import cbrkit

custom_retriever = cbrkit.retrieval.build(
    cbrkit.sim.attribute_value(...),
    limit=10
)

Our custom retriever can then be specified for the API/CLI using standard Python module syntax: retriever_module:custom_retriever.

CLI

When installing with the cli extra, CBRkit provides a command line interface:

cbrkit --help

Please visit the documentation for more information on how to use the CLI.

API

When installing with the api extra, CBRkit provides a REST API server:

cbrkit serve --help

After starting the server, you can access the API documentation at http://localhost:8000/docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cbrkit-0.20.4.tar.gz (45.8 kB view details)

Uploaded Source

Built Distribution

cbrkit-0.20.4-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file cbrkit-0.20.4.tar.gz.

File metadata

  • Download URL: cbrkit-0.20.4.tar.gz
  • Upload date:
  • Size: 45.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for cbrkit-0.20.4.tar.gz
Algorithm Hash digest
SHA256 f103038bdd00d5e485f4d9de36bea24eb787d74a5863521adddf4bd31f1b9c8a
MD5 0c0be56c363fd4a98ebe48594ab92309
BLAKE2b-256 494901f576bada5de2db335e6fa3a3026a07c2276dc14358cd0092fb0ea3c2be

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbrkit-0.20.4.tar.gz:

Publisher: release.yml on wi2trier/cbrkit

Attestations:

File details

Details for the file cbrkit-0.20.4-py3-none-any.whl.

File metadata

  • Download URL: cbrkit-0.20.4-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for cbrkit-0.20.4-py3-none-any.whl
Algorithm Hash digest
SHA256 76c593fb16d35bb286ed33c669591cdbcb781c9f4f055d8ab0a4e8808f1bb9e2
MD5 a4ab84569144e58f317f596b1eed6290
BLAKE2b-256 035cbe7bc0ad4428b0b547256528c2fe2beb0fcde788a81862dc347819e6c88a

See more details on using hashes here.

Provenance

The following attestation bundles were made for cbrkit-0.20.4-py3-none-any.whl:

Publisher: release.yml on wi2trier/cbrkit

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page