Skip to main content

No project description provided

Project description

Apache License Read The Doc javadoc Pypi version Github Action stability-experimental

Join the community: Join the chat at https://gitter.im/rikaidev/community

:heavy_exclamation_mark: This repository is still experimental. No API-compatibility is guaranteed.

Rikai

Rikai is a framework specifically designed for AI workflows focused around large scale unstructured datasets (e.g., images, videos, sensor data (future), text (future), and more). Through every stage of the AI modeling workflow, Rikai strives to offer a great developer experience when working with real-world AI datasets.

The quality of an AI dataset can make or break an AI project, but tooling for AI data is sorely lacking in ergonomics. As a result, practitioners must spend most of their time and effort wrestling with their data instead of innovating on the models and use cases. Rikai alleviates the pain that AI practitioners experience on a daily basis dealing with the myriad of tedious data tasks, so they can focus again on model-building and problem solving.

To start trying Rikai right away, checkout the Quickstart Guide.

Main Features

Data format

The core of Rikai is a data format ("rikai format") based on Apache Parquet. Rikai augments parquet with a rich collection of semantic types design specifically for unstructured data and annotations.

Integrations

Rikai comes with an extensive set of I/O connectors. For ETL, Rikai is able to consume popular formats like ROS bags and Coco. For analysis, it's easy to read Rikai data into pandas/spark DataFrames (Rikai handles serde for the semantic types). And for training, Rikai allows direct creation of Pytorch/Tensorflow datasets without manual conversion.

SQL-ML Engine

Rikai extends Spark SQL with ML capability which allows users to analyze Rikai datasets using own models with SQL ("Bring your own model")

Visualization

Carefully crafted data-visualization embedded with semantic types, especially in Jupyter notebooks, to help you visualize and inspect your AI data without having to remember complicated raw image manipulations.

Roadmap

  1. Improved video support
  2. Text / sensors / geospatial support
  3. Versioning support built into the dataset
  4. Better Rikai UDT-support
  5. Declarative annotation API (think vega-lite for annotating images/videos)
  6. Integrations into dbt and BI tools

Example

from pyspark.sql import Row
from pyspark.ml.linalg import DenseMatrix
from rikai.types import Image, Box2d
from rikai.numpy import wrap
import numpy as np

df = spark.createDataFrame(
    [
        {
            "id": 1,
            "mat": DenseMatrix(2, 2, range(4)),
            "image": Image("s3://foo/bar/1.png"),
            "annotations": [
                Row(
                    label="cat",
                    mask=wrap(np.random.rand(256, 256)),
                    bbox=Box2d(xmin=1.0, ymin=2.0, xmax=3.0, ymax=4.0),
                )
            ],
        }
    ]
)

df.write.format("rikai").save("s3://path/to/features")

Train dataset in Pytorch

from torch.utils.data import DataLoader
from torchvision import transforms as T
from rikai.pytorch.vision import Dataset

transform = T.Compose([
   T.Resize(640),
   T.ToTensor(),
   T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

dataset = Dataset(
   "s3://path/to/features",
   image_column="image",
   transform=transform
)
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,
)
for batch in loader:
    predicts = model(batch.to("cuda"))

Using a ML model in Spark SQL (experiemental)

CREATE MODEL yolo5
OPTIONS (min_confidence=0.3, device="gpu", batch_size=32)
USING "s3://bucket/to/yolo5_spec.yaml";

SELECT id, ML_PREDICT(yolo5, image) FROM my_dataset
WHERE split = "train" LIMIT 100;

Rikai can use MLflow as its model registry. This allows you to automatically pickup the latest model version if you're using the mlflow model registry. Here is a list of supported model flavors:

  • PyTorch (pytorch)
  • Tensorflow (tensorflow)
  • Scikit-learn (sklearn)
CREATE MODEL yolo5
OPTIONS (min_confidence=0.3, device="gpu", batch_size=32)
USING "mlflow:///yolo5_model/";

SELECT id, ML_PREDICT(yolo5, image) FROM my_dataset
WHERE split = "train" LIMIT 100;

For more details on the model spec, see SQL-ML documentation

Getting Started

Currently Rikai is maintained for Scala 2.12 and Python 3.7, 3.8, 3.9

There are multiple ways to install Rikai:

  1. Try it using the included Dockerfile.
  2. Install via pip pip install rikai, with extras for gcp, pytorch/tf, and others.
  3. Install from source

Note: if you want to use Rikai with your own pyspark, please consult rikai documentation for tips.

Docker

The included Dockerfile creates a standalone demo image with Jupyter, Pytorch, Spark, and rikai preinstalled with notebooks for you to play with the capabilities of the rikai feature store.

To build and run the docker image from the current directory:

# Clone the repo
git clone git@github.com:eto-ai/rikai rikai
# Build the docker image
docker build --tag rikai --network host .
# Run the image
docker run -p 0.0.0.0:8888:8888/tcp rikai:latest jupyter lab -ip 0.0.0.0 --port 8888

If successful, the console should then print out a clickable link to JupyterLab. You can also open a browser tab and go to localhost:8888.

Install from pypi

Base rikai library can be installed with just pip install rikai. Dependencies for supporting pytorch (pytorch and torchvision), jupyter (matplotlib and jupyterlab) are all part of optional extras. Many open-source datasets also use Youtube videos so we've also added pafy and youtube-dl as optional extras as well.

For example, if you want to use pytorch in Jupyter to train models on rikai datasets in s3 containing Youtube videos you would run:

pip install rikai[pytorch,jupyter,youtube]

If you're not sure what you need and don't mind installing some extra dependencies, you can simply install everything:

pip install rikai[all]

Install from source

To build from source you'll need python as well as Scala with sbt installed:

# Clone the repo
git clone git@github.com:eto-ai/rikai rikai
# Build the jar
sbt publishLocal
# Install python package
cd python
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")

Utilities

pre-commit can be helpful in keep consistent code format with the repository. It can trigger reformat and extra things in your local machine before the CI force you to do it.

If you want it, install and enable pre-commit

pip install pre-commit
pre-commit install #in your local development directory
#pre-commit installed at .git/hooks/pre-commit

If you want to uninstall it, it would be easy, too.

pre-commit uninstall

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rikai-0.1.15.tar.gz (75.5 kB view details)

Uploaded Source

Built Distribution

rikai-0.1.15-py3-none-any.whl (126.7 kB view details)

Uploaded Python 3

File details

Details for the file rikai-0.1.15.tar.gz.

File metadata

  • Download URL: rikai-0.1.15.tar.gz
  • Upload date:
  • Size: 75.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.14

File hashes

Hashes for rikai-0.1.15.tar.gz
Algorithm Hash digest
SHA256 972317e7449737b87b4810f5e26ecf0a3939bcb7204affa266022b1b282d10bb
MD5 3da481a408093c0f974d6b75a8ba2def
BLAKE2b-256 56c54f97cb5df9b9cec7b045543fbd1fa1f2a7033bb5b932ccdc39b07ec656cf

See more details on using hashes here.

File details

Details for the file rikai-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: rikai-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 126.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.14

File hashes

Hashes for rikai-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 3e8e68cf6447e77740b34982e64d3a4079cbd3703641f250443fde77542d080a
MD5 ccb8eb7a332336c995426a60f18a7954
BLAKE2b-256 56b0f62e6e432a53a7c7e8fe4a885e1ea81bb03479172f3da989c56713512029

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page