Meerkat is building new data abstractions to make machine learning easier.
Project description
Create interactive views of any dataset.
Website | Quickstart | Docs | Contributing | Discord | Blogpost
⚡️ Quickstart
pip install meerkat-ml
Next Steps. Check out our Getting Started page and our documentation to start building with Meerkat.
Why Meerkat?
Meerkat is an open-source Python library that helps users visualize, explore, and annotate any dataset. It is especially useful when processing unstructured data types (e.g. free text, PDFs, images, video) with machine learning models.
✏️ Features and Design Principles
Here are four principles that inform Meerkat's design.
(1) Low overhead. With four lines of Python, start interacting with any dataset.
- Zero-copy integrations with your preferred data abstractions: Pandas, Arrow, HF Datasets, Ibis, SQL.
- Limited data movement. With Meerkat, you interact with your data where it already lives: no uploads to external databases and no reformatting.
import meerkat as mk
df = mk.from_csv("paintings.csv")
df["image"] = mk.files("image_url")
df
(2) Diverse data types. Visualize and annotate almost any data type in Meerkat interfaces: text, images, audio, video, MRI scans, PDFs, HTML, JSON.
(3) "Intelligent" user interfaces. Meerkat makes it easy to embed machine learning models (e.g. LLMs) within user interfaces to enable intelligent functionality such as searching, grouping and autocomplete.
df["embedding"] = mk.embed(df["img"], engine="clip")
match = mk.gui.Match(df,
against="embedding",
engine="clip"
)
sorted_df = mk.sort(df,
by=match.criterion.name,
ascending=False
)
gallery = mk.gui.Gallery(sorted_df)
mk.gui.html.div([match, gallery])
(4) Declarative (think: Seaborn), but also infinitely customizable and composable. Meerkat visualization components can be composed and customized to create new interfaces.
plot = mk.gui.plotly.ScatterPlot(df=plot_df, x="umap_1", y="umap_2",)
@mk.gui.reactive
def filter(selected: list, df: mk.DataFrame):
return df[df.primary_key.isin(selected)]
filtered_df = filter(plot.selected, plot_df)
table = mk.gui.Table(filtered_df, classes="h-full")
mk.gui.html.flex([plot, table], classes="h-[600px]")
✨ Use cases where Meerkat shines
- Exploratory analysis over unstructured data types. Demo
- Spot-checking the behavior of large language models (e.g. GPT-3). Demo
- Identifying systematic errors made by machine learning models. Demo
- Rapid labeling of validation data.
🤔 Use cases where Meerkat may not be the right fit
- Are you only working with structured data (e.g. numerical and categorical variables)? Popular data visualization libraries (e.g. Seaborn, Matplotlib) are often sufficient. If you're looking for interactivity, Plotly and Streamlit work well with structured data. Meerkat is differentiated in how it visualizes unstructured data types: long-form text, PDFs, HTML, images, video, audio...
- Are you trying to make a straightforward demo of a machine learning model (single input/output, chatbot) and share with the world? Gradio is likely a better fit! Though, if your demo involves visualizing lots of data, you may find Meerkat useful.
- Are you trying to manually label tens of thousands of data points? If you are looking for a data labeling tool to use with a labeling team, there are great open source labeling solutions designed for this (e.g. LabelStudio). In contrast, Meerkat is great fit for teams/individuals without access to a large labeling workforce who are using pretrained models (e.g. GPT-3) and need to label validation data or in-context examples.
✉️ About
Meerkat is being built by Machine Learning PhD students in the Hazy Research lab at Stanford. We're excited to build for a future where models will make it easier for teams to sift and reason through large volumes of unstructtured data effortlessly.
Please reach out to kgoel [at] cs [dot] stanford [dot] edu, eyuboglu [at] stanford [dot] edu, and arjundd [at] stanford [dot] edu
if you would like to use Meerkat for a project, at your company or if you have any questions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for meerkat_ml-0.4.11-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0e7e2c5ad1d386f86d098c6bebb854c794b24940310e29fe343c07e957db91f |
|
MD5 | d7b2cea0a29449dd2c7f54290cdf06f5 |
|
BLAKE2b-256 | c7a67dc47dc2655da31d33adfa3662f66768f431158a12a265d24758bb778354 |