Skip to main content

An end-to-end gpu Python library that encodes categorical variables into machine-learnable numerics

Project description

cu-cat

cu-cat is an end-to-end gpu Python library that encodes categorical variables into machine-learnable numerics. It is a cuda accelerated port of what was dirty_cat, now rebranded as skrub, and allows more ambitious interactive analysis & real-time pipelines!

Loom video walkthru

What can cu-cat do?

The latest PyGraphistry[AI] release GPU accelerates to its automatic feature encoding pipeline, and to do so, we are delighted to introduce the newest member to the open source GPU dataframe ecosystem: cu_cat! The Graphistry team has been growing the library out of need. The straw that broke the camel’s back was in December 2022 when we were hacking on our winning entry to the US Cyber Command AI competition for automatically correlating & triaging gigabytes of alerts, and we realized that what was slowing down our team's iteration cycles was CPU-based feature engineering, basically pouring sand into our otherwise humming end-to-end GPU AI pipeline. Two months later, cu_cat was born. Fast forward to now, and we are getting ready to make it default-on for all our work.

Hinted by its name, cu_cat is our GPU-accelerated open source fork of the popular CPU Python library dirty_cat. Like dirty_cat, cu_cat makes it easy to convert messy dataframes filled with numbers, strings, and timestamps into numeric feature columns optimized for AI models. It adds interoperability for GPU dataframes and replaces key kernels and algorithms with faster and more scalable GPU variants. Even on low-end GPUs, we are now able to tackle much larger datasets in the same amount of time – or for the first time! – with end-to-end pipelines. We typically save time with 3-5X speedups and will even see 10X+, to the point that the more data you encode, the more time you save!

What can cu-cat NOT do?

Since cu_cat is limited to CUDF/CUML dataframes, it is not a drop-in replacement for dirty_cat. It is also not a drop-in replacement for the CPU-based dirty_cat, and we are not planning to make it one. We developed this library to accelerate our own graphistry end-to-end pipelines.

Similarly, it requires pandas or cudf input, as well as a GPU; numpy array will not suffice as they can featurize but cannot be UMAP-ed since they lack index.

Startup Code:

# !pip install graphistry[ai] ## future releases will have this by default
!pip install git+https://github.com/graphistry/pygraphistry.git@dev/depman_gpufeat

import cudf
import graphistry
df = cudf.read_csv(...)
g = graphistry.nodes(df).featurize(feature_engine='cu_cat')
print(g._node_features.describe()) # friendly dataframe interfaces
g.umap().plot() # ML/AI embedding model using the features

Example notebooks

Hello cu-cat notebook goes in-depth on how to identify and deal with messy data using the cu-cat library.

CPU v GPU Biological Demos:

Dependencies

Major dependencies the cuml and cudf libraries, as well as standard python libraries

Related projects

dirty_cat is now rebranded as part of the sklearn family as skrub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cu-cat-0.9.6.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

cu_cat-0.9.6-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file cu-cat-0.9.6.tar.gz.

File metadata

  • Download URL: cu-cat-0.9.6.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for cu-cat-0.9.6.tar.gz
Algorithm Hash digest
SHA256 da1ec150e12a16cd70237f0cc2e738ba1d55774a7e310ceff5d67cc20294c080
MD5 d714ba103898f489bcf59902e95394bd
BLAKE2b-256 8961dbae2d31ee2c5bfc3c1bcad606b2d82f1ddbe077bd089a1486cfd0b1a421

See more details on using hashes here.

File details

Details for the file cu_cat-0.9.6-py3-none-any.whl.

File metadata

  • Download URL: cu_cat-0.9.6-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for cu_cat-0.9.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f5765623604d42257b3dab8d81bcf2db6b4c6240cfc0b3f188b2c91b0c0ff9fb
MD5 b9e800a1a86f2af24a5c253d462465a0
BLAKE2b-256 bcab84b7890f1cdbfb94197a8d247800396152bb88c647178f7172827b2e8d3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page