Skip to main content

A Python package to calculate the disruption index and embedding disruptiveness measure using a citation network.

Project description

embedding-disruptiveness

Measure how disruptive a paper or patent is — using graph embeddings on citation networks.

embedding-disruptiveness learns node2vec-style embeddings from citation graphs and computes an Embedding Disruptiveness Measure (EDM) that captures whether a work disrupts or consolidates its field. It also provides the classic Disruption Index (DI) as a built-in utility.

Python 3.8+ License: MIT PyPI

Paper | Blog Post | PyPI


Features

  • End-to-end pipeline — load a citation network, train embeddings, compute disruptiveness in a few lines of code
  • Node2Vec with directional skip-gram — biased random walks with (p, q) parameters for flexible neighborhood exploration
  • Model parallelism — splits in-vectors and out-vectors across two GPUs for large-scale networks
  • Multiple negative samplers — Configuration Model, Stochastic Block Model, Erdos-Renyi, and conditional context samplers
  • Modularity-aware training — optional modularity regularization for community-sensitive embeddings
  • Mixed-precision training — AMP support for faster training with lower memory usage
  • Built-in disruption index — classic DI₁ / DI₅ / CDI computation alongside embedding-based measures

Installation

Using pip

pip install embedding-disruptiveness

Using uv

uv pip install embedding-disruptiveness

Install from source

git clone https://github.com/MunjungKim/embedding-disruptiveness.git
cd embedding-disruptiveness
pip install -e .

Or with uv:

git clone https://github.com/MunjungKim/embedding-disruptiveness.git
cd embedding-disruptiveness
uv pip install -e .

Requirements

  • Python >= 3.8
  • PyTorch (with CUDA for GPU training)
  • NumPy, SciPy, scikit-learn, numba, tqdm

Quick Start

import embedding_disruptiveness as edm

# Train embeddings on a citation network (scipy sparse matrix in .npz format)
trainer = edm.EmbeddingTrainer(
    net_input="citation_network.npz",
    dim=128,
    window_size=5,
    device_in="0",       # GPU for in-vectors
    device_out="1",      # GPU for out-vectors
    q_value=1,
    epochs=5,
    batch_size=1024,
    save_dir="./output",
)

trainer.train()
# Embeddings and cosine distances are saved to save_dir

Usage Guide

Input Format

Your citation network should be a scipy sparse matrix saved as .npz. Rows/columns represent nodes (papers/patents), and non-zero entries represent citation edges.

import scipy.sparse as sp

# Example: build a sparse adjacency matrix and save it
net = sp.csr_matrix(adjacency_data)
sp.save_npz("citation_network.npz", net)

You can also convert an edge list (numpy array) to a sparse adjacency matrix using to_adjacency_matrix:

import numpy as np
from embedding_disruptiveness.utils import to_adjacency_matrix

# (N, 2) edge list: [src, dst]
edges = np.array([[0, 1], [1, 2], [2, 3]])
net = to_adjacency_matrix(edges, edgelist=True)

# (N, 3) weighted edge list: [src, dst, weight]
weighted_edges = np.array([[0, 1, 0.5], [1, 2, 1.0], [2, 3, 0.8]])
net = to_adjacency_matrix(weighted_edges, edgelist=True)

Training Embeddings

EmbeddingTrainer handles the full pipeline:

  1. Loads the sparse network
  2. Generates biased random walks (node2vec)
  3. Trains a Word2Vec-style model with triplet loss
  4. Saves in-vectors, out-vectors, and cosine distance matrices
trainer = edm.EmbeddingTrainer(
    net_input="network.npz",
    dim=128,               # Embedding dimension
    window_size=5,         # Context window for skip-gram
    device_in="0",         # CUDA device for in-vectors
    device_out="1",        # CUDA device for out-vectors
    q_value=1,             # Node2Vec return parameter (q)
    epochs=5,              # Training epochs
    batch_size=1024,       # Batch size
    save_dir="./results",  # Output directory
    num_walks=10,          # Random walks per node (default: 10)
    walk_length=80,        # Walk length (default: 80)
)

trainer.train()

Computing the Disruption Index

import embedding_disruptiveness as edm

# net: sparse adjacency matrix (citing → cited)
# 1-step disruption index
di = edm.calc_disruption_index(net)

# 2-step (multistep) disruption index
di_2step = edm.calc_multistep_disruption_index(net)

Two computation methods are available via the method parameter:

  • "matrix" — sparse matrix multiplication. Fast for small networks (< 1M nodes), but uses O(N²) memory.
  • "iterative" — Numba-JIT row-wise loop. Memory-efficient, scales to 100M+ nodes.
  • "auto" (default) — automatically picks "matrix" if N < 1M, otherwise "iterative".
# Force iterative method for a large network
di = edm.calc_disruption_index(large_net, method="iterative")

# Force matrix method with batching for medium networks
di = edm.calc_disruption_index(net, method="matrix", batch_size=2**15)

Choosing a Negative Sampler

Different null models yield different notions of "expected" connections:

from embedding_disruptiveness.utils import (
    ConfigModelNodeSampler,
    SBMNodeSampler,
    ErdosRenyiNodeSampler,
)

# Configuration Model — preserves degree distribution
sampler = ConfigModelNodeSampler(adj_matrix)

# Stochastic Block Model — preserves community structure
sampler = SBMNodeSampler(adj_matrix, group_membership)

# Erdos-Renyi — uniform random baseline
sampler = ErdosRenyiNodeSampler(adj_matrix)

Custom Training Loop

For fine-grained control, use the components directly:

from embedding_disruptiveness.models import Word2Vec
from embedding_disruptiveness.loss import Node2VecTripletLoss
from embedding_disruptiveness.datasets import TripletDataset
from embedding_disruptiveness.torch import train

model = Word2Vec(vocab_size=num_nodes, embedding_size=128, padding_idx=num_nodes, device_in="cuda:0", device_out="cuda:1")
dataset = TripletDataset(center, context, negative_sampler, epochs=5)
loss_fn = Node2VecTripletLoss()

train(model=model, dataset=dataset, loss_func=loss_fn, batch_size=1024)

API Reference

Module Key Exports Description
embedding EmbeddingTrainer High-level training orchestrator
models Word2Vec Dual-device embedding model
loss Node2VecTripletLoss, ModularityTripletLoss Loss functions
datasets TripletDataset, ModularityDataset PyTorch datasets for triplet sampling
torch train() Training loop with AMP and SparseAdam
utils RandomWalkSampler, *NodeSampler, calc_disruption_index, calc_multistep_disruption_index Samplers and metrics

How It Works

  1. Random Walks: Node2Vec-style biased walks explore the citation graph, capturing both local and global structure via (p, q) parameters.

  2. Directional Skip-Gram: A Word2Vec model learns separate in-vectors (as a target) and out-vectors (as a context) for each node, preserving the directionality of citations.

  3. Embedding Disruptiveness: Cosine distances between a focal paper's in-vector and its references'/citations' out-vectors quantify how much the paper departs from — or reinforces — existing knowledge.

Citation

If you use this package in your research, please cite:

@article{kim2024embedding,
  title={Embedding Disruptiveness Measure},
  author={Kim, Munjung and others},
  year={2024}
}

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_disruptiveness-2.0.0.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_disruptiveness-2.0.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file embedding_disruptiveness-2.0.0.tar.gz.

File metadata

  • Download URL: embedding_disruptiveness-2.0.0.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for embedding_disruptiveness-2.0.0.tar.gz
Algorithm Hash digest
SHA256 89d47f17f6661195f05beff916b58c5b2ef2d2dbc8c7ad72e2ae856d1bbcb101
MD5 bf325530513d2d920a9c51cbb2a2348e
BLAKE2b-256 129fe61e09845c33d9cb4f1b55aed89c544a75f46cf7e554a8a779b182d56af7

See more details on using hashes here.

File details

Details for the file embedding_disruptiveness-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embedding_disruptiveness-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 610b7d613e7fb89b918a6c3d01a662ea748fcc23505c5bad188af8e410cd1413
MD5 1d6d05788a7da524c5c039d875b037de
BLAKE2b-256 08034430455ff88990fb08741e0dbb62ed944f6aabe32c5e55b93d5c895729ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page