A Python package to calculate the disruption index and embedding disruptiveness measure using a citation network.
Project description
embedding-disruptiveness
Measure how disruptive a paper or patent is — using graph embeddings on citation networks.
embedding-disruptiveness learns node2vec-style embeddings from citation graphs and computes an Embedding Disruptiveness Measure (EDM) that captures whether a work disrupts or consolidates its field. It also provides the classic Disruption Index (DI) as a built-in utility.
Features
- End-to-end pipeline — load a citation network, train embeddings, compute disruptiveness in a few lines of code
- Node2Vec with directional skip-gram — biased random walks with (p, q) parameters for flexible neighborhood exploration
- Model parallelism — splits in-vectors and out-vectors across two GPUs for large-scale networks
- Multiple negative samplers — Configuration Model, Stochastic Block Model, Erdos-Renyi, and conditional context samplers
- Modularity-aware training — optional modularity regularization for community-sensitive embeddings
- Mixed-precision training — AMP support for faster training with lower memory usage
- Built-in disruption index — classic DI₁ / DI₅ / CDI computation alongside embedding-based measures
Installation
Using pip
pip install embedding-disruptiveness
Using uv
uv pip install embedding-disruptiveness
Install from source
git clone https://github.com/MunjungKim/embedding-disruptiveness.git
cd embedding-disruptiveness
pip install -e .
Or with uv:
git clone https://github.com/MunjungKim/embedding-disruptiveness.git
cd embedding-disruptiveness
uv pip install -e .
Requirements
- Python >= 3.8
- PyTorch (with CUDA for GPU training)
- NumPy, SciPy, scikit-learn, numba, tqdm
Quick Start
import embedding_disruptiveness as edm
# Train embeddings on a citation network (scipy sparse matrix in .npz format)
trainer = edm.EmbeddingTrainer(
net_input="citation_network.npz",
dim=128,
window_size=5,
device_in="0", # GPU for in-vectors
device_out="1", # GPU for out-vectors
q_value=1,
epochs=5,
batch_size=1024,
save_dir="./output",
)
trainer.train()
# Embeddings and cosine distances are saved to save_dir
Usage Guide
Input Format
Your citation network should be a scipy sparse matrix saved as .npz. Rows/columns represent nodes (papers/patents), and non-zero entries represent citation edges.
import scipy.sparse as sp
# Example: build a sparse adjacency matrix and save it
net = sp.csr_matrix(adjacency_data)
sp.save_npz("citation_network.npz", net)
You can also convert an edge list (numpy array) to a sparse adjacency matrix using to_adjacency_matrix:
import numpy as np
from embedding_disruptiveness.utils import to_adjacency_matrix
# (N, 2) edge list: [src, dst]
edges = np.array([[0, 1], [1, 2], [2, 3]])
net = to_adjacency_matrix(edges, edgelist=True)
# (N, 3) weighted edge list: [src, dst, weight]
weighted_edges = np.array([[0, 1, 0.5], [1, 2, 1.0], [2, 3, 0.8]])
net = to_adjacency_matrix(weighted_edges, edgelist=True)
Training Embeddings
EmbeddingTrainer handles the full pipeline:
- Loads the sparse network
- Generates biased random walks (node2vec)
- Trains a Word2Vec-style model with triplet loss
- Saves in-vectors, out-vectors, and cosine distance matrices
trainer = edm.EmbeddingTrainer(
net_input="network.npz",
dim=128, # Embedding dimension
window_size=5, # Context window for skip-gram
device_in="0", # CUDA device for in-vectors
device_out="1", # CUDA device for out-vectors
q_value=1, # Node2Vec return parameter (q)
epochs=5, # Training epochs
batch_size=1024, # Batch size
save_dir="./results", # Output directory
num_walks=10, # Random walks per node (default: 10)
walk_length=80, # Walk length (default: 80)
)
trainer.train()
Computing the Disruption Index
import embedding_disruptiveness as edm
# net: sparse adjacency matrix (citing → cited)
# 1-step disruption index
di = edm.calc_disruption_index(net)
# 2-step (multistep) disruption index
di_2step = edm.calc_multistep_disruption_index(net)
Two computation methods are available via the method parameter:
"matrix"— sparse matrix multiplication. Fast for small networks (< 1M nodes), but uses O(N²) memory."iterative"— Numba-JIT row-wise loop. Memory-efficient, scales to 100M+ nodes."auto"(default) — automatically picks"matrix"if N < 1M, otherwise"iterative".
# Force iterative method for a large network
di = edm.calc_disruption_index(large_net, method="iterative")
# Force matrix method with batching for medium networks
di = edm.calc_disruption_index(net, method="matrix", batch_size=2**15)
Choosing a Negative Sampler
Different null models yield different notions of "expected" connections:
from embedding_disruptiveness.utils import (
ConfigModelNodeSampler,
SBMNodeSampler,
ErdosRenyiNodeSampler,
)
# Configuration Model — preserves degree distribution
sampler = ConfigModelNodeSampler(adj_matrix)
# Stochastic Block Model — preserves community structure
sampler = SBMNodeSampler(adj_matrix, group_membership)
# Erdos-Renyi — uniform random baseline
sampler = ErdosRenyiNodeSampler(adj_matrix)
Custom Training Loop
For fine-grained control, use the components directly:
from embedding_disruptiveness.models import Word2Vec
from embedding_disruptiveness.loss import Node2VecTripletLoss
from embedding_disruptiveness.datasets import TripletDataset
from embedding_disruptiveness.torch import train
model = Word2Vec(vocab_size=num_nodes, embedding_size=128, padding_idx=num_nodes, device_in="cuda:0", device_out="cuda:1")
dataset = TripletDataset(center, context, negative_sampler, epochs=5)
loss_fn = Node2VecTripletLoss()
train(model=model, dataset=dataset, loss_func=loss_fn, batch_size=1024)
API Reference
| Module | Key Exports | Description |
|---|---|---|
embedding |
EmbeddingTrainer |
High-level training orchestrator |
models |
Word2Vec |
Dual-device embedding model |
loss |
Node2VecTripletLoss, ModularityTripletLoss |
Loss functions |
datasets |
TripletDataset, ModularityDataset |
PyTorch datasets for triplet sampling |
torch |
train() |
Training loop with AMP and SparseAdam |
utils |
RandomWalkSampler, *NodeSampler, calc_disruption_index, calc_multistep_disruption_index |
Samplers and metrics |
How It Works
-
Random Walks: Node2Vec-style biased walks explore the citation graph, capturing both local and global structure via (p, q) parameters.
-
Directional Skip-Gram: A Word2Vec model learns separate in-vectors (as a target) and out-vectors (as a context) for each node, preserving the directionality of citations.
-
Embedding Disruptiveness: Cosine distances between a focal paper's in-vector and its references'/citations' out-vectors quantify how much the paper departs from — or reinforces — existing knowledge.
Citation
If you use this package in your research, please cite:
@article{kim2024embedding,
title={Embedding Disruptiveness Measure},
author={Kim, Munjung and others},
year={2024}
}
License
MIT License. See LICENSE for details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embedding_disruptiveness-2.0.0.tar.gz.
File metadata
- Download URL: embedding_disruptiveness-2.0.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89d47f17f6661195f05beff916b58c5b2ef2d2dbc8c7ad72e2ae856d1bbcb101
|
|
| MD5 |
bf325530513d2d920a9c51cbb2a2348e
|
|
| BLAKE2b-256 |
129fe61e09845c33d9cb4f1b55aed89c544a75f46cf7e554a8a779b182d56af7
|
File details
Details for the file embedding_disruptiveness-2.0.0-py3-none-any.whl.
File metadata
- Download URL: embedding_disruptiveness-2.0.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
610b7d613e7fb89b918a6c3d01a662ea748fcc23505c5bad188af8e410cd1413
|
|
| MD5 |
1d6d05788a7da524c5c039d875b037de
|
|
| BLAKE2b-256 |
08034430455ff88990fb08741e0dbb62ed944f6aabe32c5e55b93d5c895729ba
|