Skip to main content

Cluster Affiliation Network Embedding: platform-agnostic user networks from shared narrative participation

Project description

cane-networks

CANE (Cluster Affiliation Network Embedding) builds user-user networks from social media content without relying on follower graphs, reposts, or any platform-specific metadata. Instead of connecting users through behavioral traces, it connects them through shared participation in latent narrative clusters — modeling what people talk about rather than how they interact.

This makes it useful whenever your data spans multiple platforms or API access is limited, since the same method applies to X, Telegram, TikTok, Truth Social, Reddit, or any combination of them without modification.

The method was introduced in Gerard et al., ICWSM 2025 and applied to cross-platform narrative prediction in [Gerard et al., WWW 2025], where discourse-based networks substantially outperformed behavioral baselines across information operation detection, ideological stance prediction, and cross-platform emergence forecasting.


Installation

pip install cane-networks

FAISS is optional but strongly recommended for large datasets. Install whichever variant matches your hardware:

pip install cane-networks faiss-cpu   # CPU
pip install cane-networks faiss-gpu   # GPU

Without FAISS the package falls back to scikit-learn's NearestNeighbors, which works fine at smaller scales.


Usage

The input is a DataFrame with at least two columns: one for user identifiers and one for narrative cluster labels. How you get the cluster labels is up to you — DP-Means over sentence embeddings works well, but any clustering is fine.

Static graph (CANE)

import pandas as pd
from cane import CANE

# df needs: disc_node_id (user), cluster (narrative label)
model = CANE(similarity_threshold=0.2)
G = model.fit(df)

Temporal graph (t-CANE)

t-CANE computes similarities at each time bin and aggregates them across time. Repeated co-engagement across bins strengthens edges; lapsed connections decay.

from cane import tCANE

# df additionally needs a time_bin column, e.g. biweekly periods
df["time_bin"] = df["created_at"].dt.to_period("2W").astype(str)

model = tCANE(method="decay", lambda_=0.2)
G = model.fit(df)

Available aggregation methods: decay, sum, average, max, stability.


Handling large narrative vocabularies

When your corpus has many narrative clusters (say, more than ten thousand), the TF-IDF user vectors become very high-dimensional and sparse. Cosine similarity degrades in this regime because most users share almost no clusters — the network ends up empty or dominated by a handful of high-volume accounts.

The fix is dimensionality reduction via TruncatedSVD before the similarity search. The key question is how many dimensions to use, and suggest_svd_dims answers it in terms of variance retained:

from cane import suggest_svd_dims

recommended_dims, curve = suggest_svd_dims(df, target_variance=0.90)
# → 147 dimensions retain 90.0% of variance

This fits a single SVD on the full matrix and reads off the cumulative explained variance, so you get the entire curve cheaply and can pick whatever retention level makes sense for your use case. Once you have a number, pass it via target_variance:

model = CANE(similarity_threshold=0.2, target_variance=0.90)
G = model.fit(df)

If you're not sure whether you need it, check the sparsity diagnostic that suggest_svd_dims prints. Sparsity above ~0.97 is the signal that reduction will help.


Picking a similarity threshold

A threshold that's too high gives you a sparse, disconnected graph; too low and you're connecting users who don't really share much. suggest_threshold shows you the connectivity rate at several candidate values so you can make an informed choice:

model = CANE(target_variance=0.90)
results = model.suggest_threshold(df)

# Threshold → Connectivity rate
# 0.10  →  94.3% nodes connected
# 0.20  →  81.2% nodes connected
# 0.30  →  63.7% nodes connected
# 0.40  →  41.0% nodes connected
# 0.50  →  22.1% nodes connected

For most downstream tasks (IO detection, stance prediction) something in the 0.15–0.30 range tends to work well. Narrative emergence prediction can tolerate lower thresholds since the signal comes from neighbor activity counts rather than community structure.


Graph diagnostics

from cane import graph_diagnostics

graph_diagnostics(G, name="My corpus")
# My corpus
# ========================================
# Nodes:              45231
# Edges:              312847
# Connected nodes:    41983 (92.8%)
# ...

Citation

If you use this in your work, please cite:

@article{gerard2025bridging,
  title={Bridging the narrative divide: Cross-platform discourse networks in fragmented ecosystems},
  author={Gerard, Patrick and Hanley, Hans WA and Luceri, Luca and Ferrara, Emilio},
  journal={arXiv preprint arXiv:2505.21729},
  year={2025}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_networks-0.1.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cane_networks-0.1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file cane_networks-0.1.0.tar.gz.

File metadata

  • Download URL: cane_networks-0.1.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for cane_networks-0.1.0.tar.gz
Algorithm Hash digest
SHA256 74087c474294be7f7a66b194180ad16d17a5da6c98c4ab24100510118024c7a4
MD5 842f5d3fe02a6548b698edbda76acbd8
BLAKE2b-256 90c5d41c0abe5860decde4fd65b187a77d17212a2f3906673eccd864a8b60884

See more details on using hashes here.

File details

Details for the file cane_networks-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cane_networks-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for cane_networks-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37bf472d994ef428bca87dffa52cc6916e6f2914df001cfe6bb69c754ee0aa33
MD5 3cbc53d82e4ed281dbbf159779df2049
BLAKE2b-256 a8571ce01e1e7a1b7e6b848520ea563757bbdfe1ba70a4d01ac61951d1c71d56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page