Skip to main content

Generative Retrieval Id Semantic Transforms on top of Google Grain.

Project description

GRIST 🌾

Generative Retrieval ID Semantic Transforms for reproducible data pipelines.

GRIST is a focused Python library for bridging raw research datasets and generative retrieval models. It enriches datasets with Semantic Identifiers, guarantees deterministic preprocessing, and provides helpers for publishing results to public hubs like HuggingFace and Kaggle. It is designed to work smoothly with existing data pipeline tooling, including Grain.

Why GRIST

In Generative Retrieval (GR) research, reproducibility is everything. GRIST treats a dataset not as a static file, but as a deterministic factory. Every transformation, from text cleaning to model-based ID generation, is designed to be perfectly reproducible.

Features

  • Pipeline-native: Fits into existing data pipeline tooling without new paradigms to learn.
  • Semantic ID injection: Built-in MapTransform classes for UUIDs, hashes, or model-generated codes.
  • Inference-ready: Wrap any pre-trained model (HuggingFace, JAX, PyTorch) as an ID generator.
  • Publishing helpers: Tools to facilitate uploads to HuggingFace or Kaggle.

Installation

uv add grist

Quick Start

TODO: Quick start example for the planned public API.

Concepts

  • Semantic Identifiers: Stable, model-aware IDs that augment dataset samples for generative retrieval.
  • Deterministic pipelines: Transform semantics guarantee repeatable preprocessing.
  • Dataset configs: Optional, reusable configuration files for well-known datasets.

Why the Name

In milling, grist is the grain separated from its chaff and ready to be ground. This library prepares your "raw grain" (datasets) into a refined format ready for the "mill" of generative retrieval models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grist-0.1.0.dev0-py3-none-any.whl (2.4 kB view details)

Uploaded Python 3

File details

Details for the file grist-0.1.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: grist-0.1.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 2.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for grist-0.1.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5cd832d7c0908b0b2885d3ad41b241af137310c955a4c8abcfeb6ba16a32b44
MD5 09d83c300b6e2beb31c65f5c624b05e4
BLAKE2b-256 0e90f7355867df0045e7b5e491f96bc8709c38db7e7594418508259d108b6f17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page