Skip to main content

Dataset registration client, entity validation, and project scaffolding for Storywrangler

Project description

Storywrangler SDK

Dataset registration client, entity validation, and project scaffolding for Storywrangler.

Implements the Storywrangler Specification v0.0.3.

Installation

pip install storywrangler

Quick Start

1. Scaffold a new dataset project

# Flat parquet
uvx storywrangler new my-dataset --format parquet

# Hive-partitioned parquet
uvx storywrangler new my-dataset --format parquet_hive

# With snakemake instead of make
uvx storywrangler new my-dataset --format parquet_hive --orchestrator snakemake

This generates the project structure:

my-dataset/
  config/entities.yaml      # entity mappings (local_id → canonical ID)
  extract/src/scrape.py     # download raw data
  transform/src/process.py  # process into parquet
  adapter/submit.py         # register with the platform
  tests/                    # entity coverage tests
  Makefile                  # or Snakefile

2. Configure and register

cd my-dataset
cp .env.example .env        # fill in DATASET_ID, DOMAIN, DATA_PATH, API_KEY
uv sync
# Edit adapter/submit.py, config/entities.yaml
make submit

3. Or register programmatically

from storywrangler import Storywrangler

client = Storywrangler()  # reads API_KEY from env

client.registry.register({
    "catalog": "vcsi",
    "domain": "babynames",
    "dataset_id": "names",
    "data_location": "/data/babynames",
    "data_format": "parquet_hive",
    "description": "US baby name frequencies by state, sex, and year.",
    "endpoint_schema": {"type": "types-counts"},
    "transform": {"time_dimension": "year"},
    "entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
    "entities": [
        {"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
        # ...
    ],
    "ownership": {"owner_group": "vcsi", "contact": "compstorylab@uvm.edu"},
    "lineage": {"repo": "https://github.com/org/babynames"},
})

Registration Schema

The registration payload (DatasetCreate) is defined in Specification §3.7.

Required fields

Field Description
catalog Producer identity (organisation or group)
domain Owning service or router (e.g. wikimedia, babynames)
dataset_id Short identifier, unique within domain
data_location Path to data on disk (string or list of strings)
data_format parquet or parquet_hive
description Human-readable description
ownership {owner_group, contact}
lineage {repo} at minimum

Storage formats

  • parquet — single file, flat directory, or explicit file list. All filtering via WHERE clauses.
  • parquet_hive — directory tree with col=val/ at every level. Partition levels are auto-discovered at registration time — you only declare time_dimension and optionally hash_bucket.

Key optional fields

Field Purpose
endpoint_schema Output shape: types-counts (rank distributions) or time-series (tabular GROUP BY)
transform Query axes: time_dimension, filter_dimensions (non-hive columns), hash_bucket
entity_mapping Maps a local column to canonical entity IDs (see below)
entities Entity rows: {local_id, entity_id, entity_name}
manifest Coverage metadata (auto-derived — don't compute manually)
version "latest" (default, mutable) or semver like "1.0.0" (immutable)

Auto-derived at registration

The server computes these from the data — submitters should not set them:

  • data_schema — column names and types
  • level_order — hive nesting order with type tags and defaults
  • manifest.availability — time/entity coverage ranges
  • filter_values — distinct values per filter dimension
  • hash_bucket config — bucket counts per entity

Entity Mapping

entity_namespace — declaring identifier type

entity_mapping.entity_namespace tells the platform what kind of entity the local-ID column holds. This enables cross-dataset joins and automatic entity resolution.

Pattern 1 — opaque local keys (entity rows required):

# Column holds state abbreviations — a lookup table maps them to Wikidata
{
    "entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
    "entities": [
        {"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
    ],
}

Pattern 2 — global-identifier column (no entity rows needed):

# Column already holds OpenAlex author URLs
{
    "entity_mapping": {"local_id_column": "ego_author_id", "entity_namespace": "openalex"},
    # no "entities" list required — the platform derives canonical IDs from the namespace
}

Hash Bucket Assignment

For datasets with content-sharded partitions (transform.hash_bucket), use assign_bucket() to partition files consistently with the query layer:

from storywrangler.hashing import assign_bucket

# In your transform step — assign each row to a bucket directory
bucket = assign_bucket(term="hello world", num_buckets=16)
# → writes to ngram_bucket={bucket}/data.parquet

This uses murmur3_32 (seed 0) with a sign-bit mask, matching DuckDB's built-in murmur3_32() default. Both the backend query router and pipeline code import from the same source — storywrangler_schemas.hashing — ensuring bucket assignments are always consistent.

Entity Validation

from storywrangler.validation import EntityValidator

validator = EntityValidator()

validator.validate_wikidata("wikidata:Q937")          # True
validator.validate_orcid("orcid:0000-0002-1825-0097")  # True
validator.validate_openalex("openalex:A5002034958")     # True
validator.validate("ror:05qghxh33")                     # True (any namespace)

Supported Namespaces

Namespace Format example Entity types
wikidata wikidata:Q937 People, places, concepts, …
orcid orcid:0000-0002-1825-0097 Researchers
openalex openalex:A5002034958 Authors (A), Works (W), Institutions (I), Concepts (C), Sources (S), Funders (F), Publishers (P)
ror ror:05qghxh33 Research organisations
ipeds ipeds:231174 US higher-ed institutions
doi doi:10.1038/nature12373 Published works
isbn isbn:978-3-16-148410-0 Books
local local:<any-string> Dataset-local identifiers

Entity Graph (Beta)

The backend maintains an entity graph — a directed adjacency list of edges between canonical entity IDs. This enables multi-hop traversal across namespaces.

openalex:A5002034958
  --affiliated_with--> openalex:I26873012   (UVM)
  --same_as----------> wikidata:Q1068        (UVM on Wikidata)
  --country----------> wikidata:Q30          (United States)

Supported predicates: affiliated_with, same_as, country, broader

API endpoints:

GET  /registry/entity-graph/path?from_id=openalex:A5002034958&to_namespace=wikidata
GET  /registry/entity-graph/neighbors?entity_id=openalex:I26873012
POST /admin/registry/entity-graph          # upsert edges (admin)

Standards Compliance

This SDK implements Storywrangler Specification v0.0.3.

All validators follow the format requirements and validation algorithms defined in the specification.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

storywrangler-0.0.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

storywrangler-0.0.1-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file storywrangler-0.0.1.tar.gz.

File metadata

  • Download URL: storywrangler-0.0.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for storywrangler-0.0.1.tar.gz
Algorithm Hash digest
SHA256 019be95ec4e217aa467450cf9b42ce1d6119e4b97a0bc6a17e740843b51feddb
MD5 77f1f5999ae34816c87da4bc234d44f1
BLAKE2b-256 715f71d6b37ee898bab69573da01faf2898d2899d408df7dfe36d71b5fdc6df4

See more details on using hashes here.

File details

Details for the file storywrangler-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for storywrangler-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d396521f8e60b514100726440577cdbbca1375703a350387634b112b08d0556c
MD5 3e531708e8a60f018c557f5089421adc
BLAKE2b-256 f38b4149535bad2efe0e880d811417a552813fda1cc36f3ba5215ec5aab708b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page