Dataset registration client, entity validation, and project scaffolding for Storywrangler

These details have not been verified by PyPI

Project links

Project description

Storywrangler SDK

Dataset registration client, entity validation, and project scaffolding for Storywrangler.

Implements the Storywrangler Specification v0.0.3.

Installation

pip install storywrangler

Quick Start

1. Scaffold a new dataset project

# Flat parquet
uvx storywrangler new my-dataset --format parquet

# Hive-partitioned parquet
uvx storywrangler new my-dataset --format parquet_hive

# With snakemake instead of make
uvx storywrangler new my-dataset --format parquet_hive --orchestrator snakemake

This generates the project structure:

my-dataset/
  config/entities.yaml      # entity mappings (local_id → canonical ID)
  extract/src/scrape.py     # download raw data
  transform/src/process.py  # process into parquet
  load/submit.py         # register with the platform
  tests/                    # entity coverage tests
  Makefile                  # or Snakefile

2. Configure and register

cd my-dataset
cp .env.example .env        # fill in DATASET_ID, DOMAIN, DATA_PATH, API_KEY
uv sync
# Edit load/submit.py, config/entities.yaml
make submit

3. Or register programmatically

from storywrangler import Storywrangler

client = Storywrangler()  # reads API_KEY from env

client.registry.register({
    "catalog": "vcsi",
    "domain": "babynames",
    "dataset_id": "names",
    "data_location": "/data/babynames",
    "data_format": "parquet_hive",
    "description": "US baby name frequencies by state, sex, and year.",
    "endpoint_schema": {"type": "types-counts"},
    "transform": {"time_dimension": "year"},
    "entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
    "entities": [
        {"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
        # ...
    ],
    "ownership": {"owner_group": "vcsi", "contact": "compstorylab@uvm.edu"},
    "lineage": {"repo": "https://github.com/org/babynames"},
})

Client API Map

The SDK mirrors API routes one-to-one (Label Studio style) — method names follow the URL, so you can guess them without docs:

Route	SDK call
`POST /registry/register`	`client.registry.register(payload)`
`GET /registry/`	`client.registry.list()` — displaying `client.registry` renders this list; `.list().df()` for a table
`GET /registry/domains`	`client.registry.domains()`
`GET /registry/{domain}/{id}`	`client.registry.get(domain, id, full=, version=)`
`GET /registry/{domain}/{id}/adapter`	`client.registry.adapter(domain, id)`
`GET /registry/{domain}/{id}/versions`	`client.registry.versions(domain, id)`
`GET /registry/{domain}/{id}/validate-sources`	`client.registry.validate_sources(domain, id)`
`POST /admin/registry/{domain}/{id}/entities`	`client.registry.upsert_entities(domain, id, rows)`
`DELETE /admin/registry/{domain}/{id}`	`client.registry.delete(domain, id)`
`POST /auth/login`	`Storywrangler.login(username, password)`
`GET /auth/me`	`client.users.whoami()`
`GET /admin/auth/users`	`client.users.list()`
`POST /admin/auth/users`	`client.users.create(username, email, password, role=)`
`PUT /admin/auth/users/{id}/role`	`client.users.set_role(user_id, role)`
`GET /storywrangler/top-ngrams`	`client.instrument.top_ngrams(...)` or `client.dataset(domain, id).top_ngrams(...)`
`GET /storywrangler/allotax`	`client.instrument.allotax(...)`
`GET /storywrangler/rtd`	`client.instrument.rtd(...)`
`GET /storywrangler/wordshift`	`client.instrument.wordshift(...)`
`GET /{domain}`	`client.dataset(domain)` — repr lists endpoints; `.endpoints` / `.datasets`
`GET /{domain}/term-series`	`client.dataset(domain, id).term_series(type, ...)`
`GET /{domain}/term-series/batch`	`client.dataset(domain, id).term_series_batch(types, ...)`
`GET /health/status`	`client.health.status()`
`GET /health/status/history`	`client.health.history()`
`GET /health/status/{domain}/{id}`	`client.health.dataset(domain, id)`
`GET /version`	`client.version()`
anything else	`client.get(path, **params)` — raw escape hatch

Reading data requires no API key — Storywrangler() works without one. A key is only needed for registration and admin routes.

tests/test_api_drift.py enforces this: a new API route fails CI until it has an SDK method (bespoke routes may map to the client.get() escape hatch).

Domain- and dataset-scoped clients

client.dataset(domain) is domain-scoped: displaying it lists the domain's endpoints (mirroring GET /{domain}), data endpoints are callable as methods, and filter kwargs are validated against the dataset each route declares it serves (its x-dataset annotation). Dataset-specific metadata (.meta/.filters/.availability/.adapter) resolves automatically when the domain has exactly one registered dataset; otherwise it raises listing the choices — pass the id explicitly (client.dataset(domain, id)) to bind one. Any route without a dedicated method is callable by its guessable name — wiki.revisions(limit=10) → GET /wikimedia/revisions (underscores map to dashes; kwargs become query params; .endpoints shows what exists):

wiki = client.dataset("wikimedia", "ngrams")
wiki.filters       # filter dimensions with defaults and valid values
wiki.availability  # date ranges per entity
wiki.adapter       # entity mapping rows: local_id ↔ entity_id ↔ entity_name
wiki.endpoints     # routes served under this domain, from the live OpenAPI spec
wiki.versions()    # version history

wiki.top_ngrams(dates="2026-05-01", granularity="daily", ngram_size=1)
wiki.term_series("hello", entity="wikidata:Q30", window=30)
wiki.term_series_batch(["hello", "world"], entity="wikidata:Q30")
wiki.allotax(entity="wikidata:Q30", dates="2026-05-01", ngram_size=1)

Use .adapter to translate between global entity IDs and the values stored on disk:

wiki.adapter[0]
# {'local_id': 'United States', 'entity_id': 'wikidata:Q30',
#  'entity_name': 'United States', 'entity_ids': ['iso:US', ...]}

DataFrames

Every response is a plain dict/list, plus a .df() accessor that converts the tabular payload to pandas (install the extra: pip install 'storywrangler[pandas]'):

wiki.term_series("hello", entity="wikidata:Q30", window=30).df()
#          date  counts   rank          freq
# 0  2026-06-13  109678  64247  4.700000e-07
# ...

wiki.adapter.df()                      # entity mapping as a table
wiki.term_series_batch(["a", "b"]).df()  # long format with a `type` column

Registration Schema

The registration payload (DatasetCreate) is defined in Specification §3.7.

Required fields

Field	Description
`catalog`	Producer identity (organisation or group)
`domain`	Owning service or router (e.g. `wikimedia`, `babynames`)
`dataset_id`	Short identifier, unique within domain
`data_location`	Path to data on disk (string or list of strings)
`data_format`	`parquet` or `parquet_hive`
`description`	Human-readable description
`ownership`	`{owner_group, contact}`
`lineage`	`{repo}` at minimum

Storage formats

parquet — single file, flat directory, or explicit file list. All filtering via WHERE clauses.
parquet_hive — directory tree with col=val/ at every level. Partition levels are auto-discovered at registration time — you only declare time_dimension and optionally hash_bucket.

Key optional fields

Field	Purpose
`endpoint_schema`	Output shape: `types-counts` (rank distributions) or `time-series` (tabular GROUP BY)
`transform`	Query axes: `time_dimension`, `filter_dimensions` (non-hive columns), `hash_bucket`
`entity_mapping`	Maps a local column to canonical entity IDs (see below)
`entities`	Entity rows: `{local_id, entity_id, entity_name}`
`manifest`	Coverage metadata (auto-derived — don't compute manually)
`version`	`"latest"` (default, mutable) or semver like `"1.0.0"` (immutable)

Auto-derived at registration

The server computes these from the data — submitters should not set them:

data_schema — column names and types
level_order — hive nesting order with type tags and defaults
manifest.availability — time/entity coverage ranges
filter_values — distinct values per filter dimension
hash_bucket config — bucket counts per entity

Entity Mapping

`entity_namespace` — declaring identifier type

entity_mapping.entity_namespace tells the platform what kind of entity the local-ID column holds. This enables cross-dataset joins and automatic entity resolution.

Pattern 1 — opaque local keys (entity rows required):

# Column holds state abbreviations — a lookup table maps them to Wikidata
{
    "entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
    "entities": [
        {"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
    ],
}

Pattern 2 — global-identifier column (no entity rows needed):

# Column already holds OpenAlex author URLs
{
    "entity_mapping": {"local_id_column": "ego_author_id", "entity_namespace": "openalex"},
    # no "entities" list required — the platform derives canonical IDs from the namespace
}

Instruments

The SDK wraps the platform's analytical endpoints so you can call them from Python without building HTTP requests.

Allotaxonometer

Compare two type-frequency systems using rank-turbulence divergence:

result = client.instrument.allotax(
    domain="wikimedia", dataset="ngrams",
    entity="wikidata:Q30", entity2="wikidata:Q145",
    dates="2024-10-01,2024-10-31",
    alpha=1.0,
)
# result keys: normalization, delta_sum, diamond_counts, wordshift, balance, meta

Filter dimensions are passed as keyword arguments:

result = client.instrument.allotax(
    domain="babynames", dataset="ngrams",
    dates="1925", dates2="2025",
    sex="M", sex2="F",
)

RTD (lightweight)

Fast date-vs-date wordshift (no diamond plot):

result = client.instrument.rtd(
    entity="wikidata:Q30",
    dates="2026-02-17", dates2="2026-02-10",
)
# result keys: wordshift, alpha, meta

For the underlying computation without the platform, use the allotax package directly.

Word shift

Weighted-average sentiment word shift between two systems, scored with a bundled labMT lexicon. The sentiment analogue of RTD (no alpha):

result = client.instrument.wordshift(
    entity="wikidata:Q30",
    dates="2026-02-10", dates2="2026-02-17",
    lexicon="labMT_English",
    ngram_size=1,   # labMT scores single words; bigrams score as neutral
)
# result keys: entries, component_sums, s_avg_1, s_avg_2, reference_value, meta

System 1 is the baseline; system 2 is read as a shift away from it. Omit entity2 for a date-vs-date shift, or set it to compare two entities. Keep ngram_size=1 (the default): labMT is a unigram lexicon, so higher n-gram sizes match nothing and return an empty shift. For the underlying computation without the platform, use the wordshift package directly.

Hash Bucket Assignment

For datasets with content-sharded partitions (transform.hash_bucket), use assign_bucket() to partition files consistently with the query layer:

from storywrangler.hashing import assign_bucket

# In your transform step — assign each row to a bucket directory
bucket = assign_bucket(term="hello world", num_buckets=16)
# → writes to ngram_bucket={bucket}/data.parquet

This uses murmur3_32 (seed 0) with a sign-bit mask, matching DuckDB's built-in murmur3_32() default. Both the backend query router and pipeline code import from the same source — storywrangler_schemas.hashing — ensuring bucket assignments are always consistent.

Entity Validation

from storywrangler.validation import EntityValidator

validator = EntityValidator()

validator.validate_wikidata("wikidata:Q937")          # True
validator.validate_orcid("orcid:0000-0002-1825-0097")  # True
validator.validate_openalex("openalex:A5002034958")     # True
validator.validate("ror:05qghxh33")                     # True (any namespace)

Supported Namespaces

Namespace	Format example	Entity types
`wikidata`	`wikidata:Q937`	People, places, concepts, …
`orcid`	`orcid:0000-0002-1825-0097`	Researchers
`openalex`	`openalex:A5002034958`	Authors (A), Works (W), Institutions (I), Concepts (C), Sources (S), Funders (F), Publishers (P)
`ror`	`ror:05qghxh33`	Research organisations
`ipeds`	`ipeds:231174`	US higher-ed institutions
`doi`	`doi:10.1038/nature12373`	Published works
`isbn`	`isbn:978-3-16-148410-0`	Books
`local`	`local:<any-string>`	Dataset-local identifiers

Entity Graph (Beta)

The backend maintains an entity graph — a directed adjacency list of edges between canonical entity IDs. This enables multi-hop traversal across namespaces.

openalex:A5002034958
  --affiliated_with--> openalex:I26873012   (UVM)
  --same_as----------> wikidata:Q1068        (UVM on Wikidata)
  --country----------> wikidata:Q30          (United States)

Supported predicates: affiliated_with, same_as, country, broader

API endpoints:

GET  /registry/entity-graph/path?from_id=openalex:A5002034958&to_namespace=wikidata
GET  /registry/entity-graph/neighbors?entity_id=openalex:I26873012
POST /admin/registry/entity-graph          # upsert edges (admin)

Standards Compliance

This SDK implements Storywrangler Specification v0.0.3.

All validators follow the format requirements and validation algorithms defined in the specification.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0 yanked

May 19, 2026

This version

0.0.21

Jul 24, 2026

0.0.20

Jul 24, 2026

0.0.19

Jul 23, 2026

0.0.18

Jul 23, 2026

0.0.17

Jul 23, 2026

0.0.16

Jul 23, 2026

0.0.15

Jul 22, 2026

0.0.14

Jul 22, 2026

0.0.11

Jul 21, 2026

0.0.10

Jul 20, 2026

0.0.9

Jul 20, 2026

0.0.8

Jul 16, 2026

0.0.7

Jul 16, 2026

0.0.6

Jul 16, 2026

0.0.5

Jul 14, 2026

0.0.4

Jul 9, 2026

0.0.3

Jul 9, 2026

0.0.2

Jul 9, 2026

0.0.1

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

storywrangler-0.0.21.tar.gz (39.4 kB view details)

Uploaded Jul 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

storywrangler-0.0.21-py3-none-any.whl (38.4 kB view details)

Uploaded Jul 24, 2026 Python 3

File details

Details for the file storywrangler-0.0.21.tar.gz.

File metadata

Download URL: storywrangler-0.0.21.tar.gz
Upload date: Jul 24, 2026
Size: 39.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for storywrangler-0.0.21.tar.gz
Algorithm	Hash digest
SHA256	`40207a6d0a635adbeadbee894299bc078388ce752b766a38490cd7c967e8635d`
MD5	`d8b7b66e1368b36218146d0b735a596c`
BLAKE2b-256	`0aec66c7a34c064b55f9d23b9bdd85a754a653dacc0361032499cb957200f102`

See more details on using hashes here.

File details

Details for the file storywrangler-0.0.21-py3-none-any.whl.

File metadata

Download URL: storywrangler-0.0.21-py3-none-any.whl
Upload date: Jul 24, 2026
Size: 38.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for storywrangler-0.0.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fe3e3528871a20762c3acadfd3d8aaecf2dc8816f7e39c112f16a4cfdb938111`
MD5	`aad77fd21b0f0b45ae17038087b8f78d`
BLAKE2b-256	`c36469a8acda79afc02b78c23fd25ae22d79a03fe80fb4dddb2c0fb61971e9de`

See more details on using hashes here.

storywrangler 0.0.21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Storywrangler SDK

Installation

Quick Start

1. Scaffold a new dataset project

2. Configure and register

3. Or register programmatically

Client API Map

Domain- and dataset-scoped clients

DataFrames

Registration Schema

Required fields

Storage formats

Key optional fields

Auto-derived at registration

Entity Mapping

entity_namespace — declaring identifier type

Instruments

Allotaxonometer

RTD (lightweight)

Word shift

Hash Bucket Assignment

Entity Validation

Supported Namespaces

Entity Graph (Beta)

Standards Compliance

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`entity_namespace` — declaring identifier type