Skip to main content

Heterogeneous Information Network tool for GitHub research datasets

Project description

HINGE — Heterogeneous Information Network for Generalized Extraction

hinge ingests GitHub activity datasets (JSONL today; CSV/Parquet planned), persists them as a typed multi-relational graph in DuckDB, and exposes projections — dbt SQL models that derive task-specific sub-graphs and export them to formats consumed by Gephi, NetworkX, igraph, etc.

This is a research artefact accompanying an ICSME 2026 Tools and Data Showcase submission.


Quick start

# 1. Install (uv is the only supported package manager)
uv sync --all-extras

# 2. Ingest a dataset
uv run hinge ingest path/to/events.jsonl --reader numfocus
# → ingested 48 elements → 29 nodes, 19 edges (0 violations)
# → Dataset ID: 78fc87c370944dc2b4a4e2d4bdd97ce1

# 3. Use the printed Dataset ID to run a projection and export
uv run hinge export \
  --dataset 78fc87c370944dc2b4a4e2d4bdd97ce1 \
  --projection user-user-repo-collaboration \
  --format gml \
  -o out/graph.gml

# 4. Inspect all ingested datasets
uv run hinge list datasets

Or via Docker:

docker compose build
cp path/to/events.jsonl ./data/
docker compose run --rm hinge hinge ingest /data/events.jsonl --reader numfocus
docker compose run --rm hinge hinge export \
  --dataset <id> --projection user-user-repo-collaboration --format gml -o /output/graph.gml

The DuckDB store lives at $HINGE_STORE_PATH (default ./network.duckdb locally, /store/network.duckdb inside the container). Multiple datasets can coexist in the same file — each ingest run gets a unique Dataset ID.


HIN contract + adapters

The source-agnostic core starts at the adapter contract tables:

source-specific adapter
  -> contract_accounts / contract_repositories / contract_artifacts / contract_relations
  -> hin_nodes / hin_edges
  -> dbt projections
  -> exporters

The numfocus reader uses a DuckDB adapter for this repo's NumFocus Actions JSONL scrape. It is intentionally source-specific: it knows paths like $.actor.login and $.details.pull_request.id. Other data sources should implement their own adapter that writes the same contract_* tables; then the HIN views and dbt recipes can run unchanged.

The generic Python ReaderStage path is still available for simple custom readers, but large JSONL ingests should use a DuckDB/SQL contract adapter where possible.


Adding a new projection

A projection is a dbt SQL model that derives a task-specific sub-graph from the typed HIN stored in DuckDB. Adding one requires exactly four files/edits and no changes to the kernel.

Step 1 — Write the SQL model

Create hinge/dbt/models/networks/<name>.sql. The model should read from the canonical HIN dbt models and produce a fixed set of columns:

-- Inputs (prefer these canonical HIN models, never the raw tables)
{{ ref('hin_nodes') }}
{{ ref('hin_edges') }}

-- Output (prefer the network_edges macro; it emits this standard schema)
network_edges(
    recipe_name, recipe_version,
    source_node_id, source_node_type,
    target_node_id, target_node_type,
    directed, edge_type,
    weight, weight_kind,
    n_contexts, n_events,
    first_seen_at, last_seen_at,
    time_bin, bot_policy,
    properties
)

DbtProjection converts this richer schema into TypedEdge objects for existing exporters, merging standard fields and properties into edge attrs.

The upstream HIN models are built from active_* views created by the store immediately before dbt runs — they are already filtered to the requested dataset_id, so network SQL never needs to reference dataset_id at all.

Nodes-only projections: the pipeline derives output nodes from the union of source_node_id and target_node_id in the result table. A projection that emits no edges will therefore produce no nodes either. The workaround is to use self-loop edges (source_node_id = target_node_id): they make the nodes visible to the exporter, carry metadata in properties, and can be filtered out in downstream tools with G.remove_edges_from(nx.selfloop_edges(G)).

See dev_interaction.sql for a full working example with a documented input/output contract.

Step 2 — Create the spec module

Create hinge/stages/projection/specs/<name>.py and expose a SPEC constant:

from hinge.kernel.projection.projection_spec import ProjectionSpec

SPEC = ProjectionSpec(
    name="my-projection",          # CLI key: --projection my-projection
    description="...",
    model_name="my_projection",    # must match the .sql file stem
    output_node_types=["user"],
    output_edge_types=["my_edge_label"],
)

This is a plain value object — no class, no inheritance. The registry loads the module and returns the SPEC attribute directly.

Step 3 — Register the entry-point

Add one line to pyproject.toml:

[project.entry-points."hinge.projection_specs"]
my-projection = "hinge.stages.projection.specs.my_projection:SPEC"

Third-party packages can register projections the same way — no fork required.

Step 4 — Re-install so the entry-point is picked up

uv sync --all-extras
uv run hinge list projections    # → my-projection should appear

Entry-points are baked into .dist-info/entry_points.txt at install time. Without this step the registry will not find the new spec.

Run it

uv run hinge export \
  --dataset <id> \
  --projection my-projection \
  --format gml \
  -o output/result.gml

Local SQL custom projection

For research-specific variants, write a local dbt model and run it without packaging or entry-points:

uv run hinge export-sql custom_star_user_repo.sql \
  --dataset <id> \
  --format gml \
  -o output/custom.gml

The SQL file is temporarily added to the built-in dbt project, so it can use ref('hin_edges'), ref('int_user_artifact_incidence'), and all macros under hinge/dbt/macros/. It must still emit the standard network_edges schema. Use --name valid_model_name if the filename is not a valid dbt identifier.

See docs/custom-projections.md.


Logging

Logs go to stderr by default. Use env vars to control verbosity and persistence:

Variable Purpose Default
HINGE_LOG_LEVEL DEBUG / INFO / WARNING / ERROR INFO
HINGE_LOG_FILE Also write logs to this file at full DEBUG detail (none)
# See all pipeline milestones (default)
uv run hinge ingest events.jsonl --reader numfocus

# See every batch, dbt SQL, store open/close
HINGE_LOG_LEVEL=DEBUG uv run hinge ingest events.jsonl --reader numfocus

# Persist a full debug log to disk (useful for long ingest runs)
HINGE_LOG_FILE=hinge.log uv run hinge ingest events.jsonl --reader numfocus
tail -f hinge.log

Common commands

# Ingest NumFocus Actions JSONL via the DuckDB -> HIN contract adapter
uv run hinge ingest events.jsonl --reader numfocus

# Inspect stored datasets
uv run hinge list datasets
uv run hinge list projections
uv run hinge list readers
uv run hinge list exporters

# Export
uv run hinge export --dataset <id> --projection user-user-repo-collaboration --format gml -o out.gml

# Dev / CI
uv run pytest                                # tests
uv run ruff check .                          # lint
uv run ruff format .                         # format
uv run mypy hinge/kernel hinge/frontends     # strict type-check
uv run lint-imports                          # enforce kernel/stages/frontends boundary

Makefile (development)

A Makefile is provided for quick local iteration. It uses hardcoded defaults (fixture file, numfocus reader, user-user-repo-collaboration projection, GML format) so you don't have to remember arguments during development — not intended for production use.

make install     # uv sync --all-extras
make run         # ingest fixture + export graph in one shot
make ingest      # ingest tests/fixtures/events_10.jsonl --reader numfocus
make export      # export the latest dataset (auto-detects ID from the store)
make test        # pytest
make lint        # ruff check
make fmt         # ruff format
make typecheck   # mypy on kernel + frontends
make clean       # delete output/, dbt artefacts, caches
make reset       # clean + delete network.duckdb (full fresh start)

Any default can be overridden on the command line:

make run   READER=numfocus LOG_LEVEL=DEBUG
make export FORMAT=graphml OUTPUT=output/graph.graphml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hinge-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hinge-0.1.0-py3-none-any.whl (81.5 kB view details)

Uploaded Python 3

File details

Details for the file hinge-0.1.0.tar.gz.

File metadata

  • Download URL: hinge-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for hinge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 08b6a340f25fe1dc5e92e8dbf70391b8ae8226118e1993454554fc32066fc4f6
MD5 9e735128e993c3713646b03a26b44d47
BLAKE2b-256 169916446ea76175c20394ed0b177a555701f75915f32cb50c086b6a2b3b885f

See more details on using hashes here.

File details

Details for the file hinge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hinge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 81.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for hinge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 edd070acbbf5481b3b51c1aca56277522a71c17b71bc41409b07945c852999f5
MD5 c7181e99385d544cc4ab73269ce25941
BLAKE2b-256 39c00cc5c52143b323d0da908d647aecf87cf5037387421214afa31dfca22658

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page