Dataset registration client, entity validation, and project scaffolding for Storywrangler
Project description
Storywrangler SDK
Dataset registration client, entity validation, and project scaffolding for Storywrangler.
Implements the Storywrangler Specification v0.0.3.
Installation
pip install storywrangler
Quick Start
1. Scaffold a new dataset project
# Flat parquet
uvx storywrangler new my-dataset --format parquet
# Hive-partitioned parquet
uvx storywrangler new my-dataset --format parquet_hive
# With snakemake instead of make
uvx storywrangler new my-dataset --format parquet_hive --orchestrator snakemake
This generates the project structure:
my-dataset/
config/entities.yaml # entity mappings (local_id → canonical ID)
extract/src/scrape.py # download raw data
transform/src/process.py # process into parquet
adapter/submit.py # register with the platform
tests/ # entity coverage tests
Makefile # or Snakefile
2. Configure and register
cd my-dataset
cp .env.example .env # fill in DATASET_ID, DOMAIN, DATA_PATH, API_KEY
uv sync
# Edit adapter/submit.py, config/entities.yaml
make submit
3. Or register programmatically
from storywrangler import Storywrangler
client = Storywrangler() # reads API_KEY from env
client.registry.register({
"catalog": "vcsi",
"domain": "babynames",
"dataset_id": "names",
"data_location": "/data/babynames",
"data_format": "parquet_hive",
"description": "US baby name frequencies by state, sex, and year.",
"endpoint_schema": {"type": "types-counts"},
"transform": {"time_dimension": "year"},
"entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
"entities": [
{"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
# ...
],
"ownership": {"owner_group": "vcsi", "contact": "compstorylab@uvm.edu"},
"lineage": {"repo": "https://github.com/org/babynames"},
})
Registration Schema
The registration payload (DatasetCreate) is defined in Specification §3.7.
Required fields
| Field | Description |
|---|---|
catalog |
Producer identity (organisation or group) |
domain |
Owning service or router (e.g. wikimedia, babynames) |
dataset_id |
Short identifier, unique within domain |
data_location |
Path to data on disk (string or list of strings) |
data_format |
parquet or parquet_hive |
description |
Human-readable description |
ownership |
{owner_group, contact} |
lineage |
{repo} at minimum |
Storage formats
parquet— single file, flat directory, or explicit file list. All filtering via WHERE clauses.parquet_hive— directory tree withcol=val/at every level. Partition levels are auto-discovered at registration time — you only declaretime_dimensionand optionallyhash_bucket.
Key optional fields
| Field | Purpose |
|---|---|
endpoint_schema |
Output shape: types-counts (rank distributions) or time-series (tabular GROUP BY) |
transform |
Query axes: time_dimension, filter_dimensions (non-hive columns), hash_bucket |
entity_mapping |
Maps a local column to canonical entity IDs (see below) |
entities |
Entity rows: {local_id, entity_id, entity_name} |
manifest |
Coverage metadata (auto-derived — don't compute manually) |
version |
"latest" (default, mutable) or semver like "1.0.0" (immutable) |
Auto-derived at registration
The server computes these from the data — submitters should not set them:
data_schema— column names and typeslevel_order— hive nesting order with type tags and defaultsmanifest.availability— time/entity coverage rangesfilter_values— distinct values per filter dimensionhash_bucketconfig — bucket counts per entity
Entity Mapping
entity_namespace — declaring identifier type
entity_mapping.entity_namespace tells the platform what kind of entity the
local-ID column holds. This enables cross-dataset joins and automatic entity
resolution.
Pattern 1 — opaque local keys (entity rows required):
# Column holds state abbreviations — a lookup table maps them to Wikidata
{
"entity_mapping": {"local_id_column": "state", "entity_namespace": "wikidata"},
"entities": [
{"local_id": "VT", "entity_id": "wikidata:Q16551", "entity_name": "Vermont"},
],
}
Pattern 2 — global-identifier column (no entity rows needed):
# Column already holds OpenAlex author URLs
{
"entity_mapping": {"local_id_column": "ego_author_id", "entity_namespace": "openalex"},
# no "entities" list required — the platform derives canonical IDs from the namespace
}
Hash Bucket Assignment
For datasets with content-sharded partitions (transform.hash_bucket), use assign_bucket() to partition files consistently with the query layer:
from storywrangler.hashing import assign_bucket
# In your transform step — assign each row to a bucket directory
bucket = assign_bucket(term="hello world", num_buckets=16)
# → writes to ngram_bucket={bucket}/data.parquet
This uses murmur3_32 (seed 0) with a sign-bit mask, matching DuckDB's built-in murmur3_32() default. Both the backend query router and pipeline code import from the same source — storywrangler_schemas.hashing — ensuring bucket assignments are always consistent.
Entity Validation
from storywrangler.validation import EntityValidator
validator = EntityValidator()
validator.validate_wikidata("wikidata:Q937") # True
validator.validate_orcid("orcid:0000-0002-1825-0097") # True
validator.validate_openalex("openalex:A5002034958") # True
validator.validate("ror:05qghxh33") # True (any namespace)
Supported Namespaces
| Namespace | Format example | Entity types |
|---|---|---|
wikidata |
wikidata:Q937 |
People, places, concepts, … |
orcid |
orcid:0000-0002-1825-0097 |
Researchers |
openalex |
openalex:A5002034958 |
Authors (A), Works (W), Institutions (I), Concepts (C), Sources (S), Funders (F), Publishers (P) |
ror |
ror:05qghxh33 |
Research organisations |
ipeds |
ipeds:231174 |
US higher-ed institutions |
doi |
doi:10.1038/nature12373 |
Published works |
isbn |
isbn:978-3-16-148410-0 |
Books |
local |
local:<any-string> |
Dataset-local identifiers |
Entity Graph (Beta)
The backend maintains an entity graph — a directed adjacency list of edges between canonical entity IDs. This enables multi-hop traversal across namespaces.
openalex:A5002034958
--affiliated_with--> openalex:I26873012 (UVM)
--same_as----------> wikidata:Q1068 (UVM on Wikidata)
--country----------> wikidata:Q30 (United States)
Supported predicates: affiliated_with, same_as, country, broader
API endpoints:
GET /registry/entity-graph/path?from_id=openalex:A5002034958&to_namespace=wikidata
GET /registry/entity-graph/neighbors?entity_id=openalex:I26873012
POST /admin/registry/entity-graph # upsert edges (admin)
Standards Compliance
This SDK implements Storywrangler Specification v0.0.3.
All validators follow the format requirements and validation algorithms defined in the specification.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file storywrangler-0.0.1.tar.gz.
File metadata
- Download URL: storywrangler-0.0.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
019be95ec4e217aa467450cf9b42ce1d6119e4b97a0bc6a17e740843b51feddb
|
|
| MD5 |
77f1f5999ae34816c87da4bc234d44f1
|
|
| BLAKE2b-256 |
715f71d6b37ee898bab69573da01faf2898d2899d408df7dfe36d71b5fdc6df4
|
File details
Details for the file storywrangler-0.0.1-py3-none-any.whl.
File metadata
- Download URL: storywrangler-0.0.1-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d396521f8e60b514100726440577cdbbca1375703a350387634b112b08d0556c
|
|
| MD5 |
3e531708e8a60f018c557f5089421adc
|
|
| BLAKE2b-256 |
f38b4149535bad2efe0e880d811417a552813fda1cc36f3ba5215ec5aab708b6
|