Skip to main content

Clusterfun - a plotting library to inspect data

Project description

example workflow

Clusterfun

Clusterfun is a python plotting library to explore image data. Play around with a live demo on https://clusterfun.app.

Getting started

Clusterfun can be installed with pip:

pip install clusterfun

Clusterfun requires Python 3.8 or higher.

Plots accept data in the form of a pandas DataFrame, which will be installed automatically if not already present. No account, payment, or internet connection is required to use clusterfun. Clusterfun is open source and free to use.

A simple example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.scatter(df, x="x", y="y", media="img_path", color="painter")

Example plot Data can be hosted locally or on AWS S3.

As you can see, a clusterfun plot takes as input a pandas dataframe and column names indicating which columns to use for the visualisation. In this way, it is similar to the seaborn or the plotly library. But in clusterfun, you can:

  • Click and drag to select data to visualise it in a grid
  • Hover over data points to see them on the right side of the page
  • Click on data points to view zoomed in versions of the image related to the data point

This makes clusterfun ideal for quickly visualising image data, which can be useful in the context of building datasets, exploring edge cases and debugging model performance.

Main features

Default parameters

The default parameters for the plot types are as follows:

  • df: pd.DataFrame (required)

    The dataframe used for the data to plot. Most other parameters are column names in this dataframe (e.g. media, color, etc.).

  • media: str (required)

    The column name of the media to display in the plot. See data loading for more information about the type of media that can be displayed.

  • show: bool = True

    Whether to show the plot or not. If show is set to True, clusterfun will start a local server to display the plot in a web browser. More specifically, we start a FastAPI server where we mount the webpage as a static file. The application itself does not require an internet connection. All data is loaded locally and does not leave your machine/browser. If show is set to False, clusterfun only saves the required data to serve the plot later on and return the path to where the data is stored. If you want to serve the plot yourself later on, you can run clusterfun {path - to - data}|{uuid} in the command line to start a local server for the plot you are interested in.

  • color: Optional[str] = None

    If given, points will be colored based on the values in the given column. Powerful for visualising clusters or classes of data.

  • title: Optional[str] = None

    The title to use for the plot.

  • bounding_box: Optional[str] = None

    You can visualise bounding boxes on top of your images by with the bounding_box parameter. For this to work, you need to have a bounding box column in the dataframe used to plot the data. Each cell in the dataframe needs to contain a dictionary or a list of dictionaries with bounding box values: xmin, ymin, xmax, ymax, label (optional), color (optional). The keys of the expected dictionary are:

    • xmin: float | int
    • ymin: float | int
    • xmax: float | int
    • ymax: float | int
    • label: Optional[str] = None
    • color: Optional[str] = None

    If no color is provided, a default color scheme will be used. The color value can be a color name or hex value. The label will be displayed in the top left of the bounding box. Example:

    single_bounding_box = {
      "xmin": 12,
      "ymin": 10,
      "xmax": 100,
      "ymax": 110,
      "color": "green",
      "label": "ground truth"
    }
    

Plot types

The following plot types are available:

  • Bar chart
  • Confusion matrix
  • Grid
  • Histogram
  • Pie chart
  • Scatterplot
  • Violin plot

Bar chart

def bar_chart(
    df: pd.DataFrame,
    x: str,
    media: str,
    color: Optional[str] = None,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame The dataframe with the data to plot
  • x: str The column name of the data for the bar chart. One bar per unique value will be plotted.
  • media: str The column name of the media to display
  • color: Optional[str] = None If added, the color will be used to create a stacked bar chart.

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.bar_chart(df, x="painter", media="img_path", color="style")

Example bar

Confusion matrix

def confusion_matrix(
    df: pd.DataFrame,
    y_true: str,
    y_pred: str,
    media: str,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • y_true: str

    The ground truth label. Values can be integers or strings.

  • y_pred: str

    The column name of the predicted label. Values can be integers or strings.

  • media: str

    The column name of the media to display

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/cifar10.csv")
clt.confusion_matrix(df, y_true="label", y_pred="pred", media="img_path")

Example confusion matrix

Grid

def grid(
    df: pd.DataFrame,
    media: str,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • media: str

    The column name of the media to display

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.grid(df, media="img_path")

Example grid

Histogram

def histogram(
    df: pd.DataFrame,
    x: str,
    media: str,
    bins: int = 20,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • x: str

    The column name of the data for the histogram

  • media: str

    The column name of the media to display

  • bins: int = 20

    The number of bins to use for the histogram

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.histogram(df, x="brightness", media="img_path")

Example histogram

Pie chart

def pie(
    df: pd.DataFrame,
    color: str,
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • color

    Column for the pies of the pie chart

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.pie_chart(df, color="painter", media="img_path")

Example pie

Scatterplot

    df: pd.DataFrame,
    x: str,
    y: str,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • x: str

    The column name of the data for the x-axis

  • y: str

    The column name of the data for the y-axis

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.scatter(df, x="x", y="y", media="img_path")

Example scatter

Violin plot

def violin(
    df: pd.DataFrame,
    y: str,
    ...
) -> Path:

Parameters

  • df: pd.DataFrame

    The dataframe with the data to plot

  • y: str

    The column name of the data for the y-axis

Example

import pandas as pd
import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
df = df[df.painter.isin(["Pablo Picasso", "Juan Gris", "Georges Braque", "Fernand Leger"])]
clt.violin(df, y="brightness", media="img_path")

Example violin

Data loading

Clusterfun supports AWS S3 and local data storage and loading. The dataframe column corresponding to the media value in the plot will be used to determine where to load the media from.

import clusterfun as clt

df = pd.read_csv("https://raw.githubusercontent.com/gietema/clusterfun-data/main/wiki-art.csv")
clt.grid(df, media="img_column")

AWS S3 media should start with s3://. Make sure to set a AWS_REGION environment variable to the region where your data is stored.

Support for Google Cloud Storage is coming soon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterfun-0.4.0a7.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusterfun-0.4.0a7-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file clusterfun-0.4.0a7.tar.gz.

File metadata

  • Download URL: clusterfun-0.4.0a7.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for clusterfun-0.4.0a7.tar.gz
Algorithm Hash digest
SHA256 e9d20581b6ebb572011dec58e90bfb52584f344a66cc974fb5c295b74391acf9
MD5 e76d5db2e69fb5e63045a14dd68826fa
BLAKE2b-256 548c699172c7efa36b3d3106db11d3445920b4b5c9d3245906236d79784c82e2

See more details on using hashes here.

File details

Details for the file clusterfun-0.4.0a7-py3-none-any.whl.

File metadata

  • Download URL: clusterfun-0.4.0a7-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.6 Darwin/23.3.0

File hashes

Hashes for clusterfun-0.4.0a7-py3-none-any.whl
Algorithm Hash digest
SHA256 6c08fa2c40e18fb4df6a8c61c4ff9ffec1cd51c6443fd6f96414a97e9c5a4f26
MD5 54da8b2d7c759a0c05362dc1ff57ebf2
BLAKE2b-256 6906de4817a752ef4d2e624be3762edeb3cb2daff175059e74331e44878fa7c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page