Skip to main content

Discourse Networks: platform-agnostic user networks from shared narrative participation

Project description

discourse-networks

CANE (Cluster Affiliation Network Embedding) builds user-user networks from social media content without relying on follower graphs, reposts, or any platform-specific metadata. Instead of connecting users through behavioral traces, it connects them through shared participation in latent narrative clusters — modeling what people talk about rather than how they interact.

This makes it useful whenever your data spans multiple platforms or API access is limited, since the same method applies to X, Telegram, TikTok, Truth Social, Reddit, or any combination of them without modification.

The method was introduced in Gerard et al., ICWSM 2025 and applied to cross-platform narrative prediction in [Gerard et al., WWW 2025], where discourse-based networks substantially outperformed behavioral baselines across information operation detection, ideological stance prediction, and cross-platform emergence forecasting.


Installation

pip install discourse-networks

FAISS is optional but strongly recommended for large datasets. Install whichever variant matches your hardware:

pip install discourse-networks faiss-cpu   # CPU
pip install discourse-networks faiss-gpu   # GPU

Without FAISS the package falls back to scikit-learn's NearestNeighbors, which works fine at smaller scales.


Usage

The input is a DataFrame with at least two columns: one for user identifiers and one for narrative cluster labels. How you get the cluster labels is up to you — DP-Means over sentence embeddings works well, but any clustering is fine.

Static graph (CANE)

import pandas as pd
from discourse_networks import CANE

# df needs: disc_node_id (user), cluster (narrative label)
model = CANE(similarity_threshold=0.2)
G = model.fit(df)

Temporal graph (t-CANE)

t-CANE computes similarities at each time bin and aggregates them across time. Repeated co-engagement across bins strengthens edges; lapsed connections decay.

from discourse_networks import tCANE

# df additionally needs a time_bin column, e.g. biweekly periods
df["time_bin"] = df["created_at"].dt.to_period("2W").astype(str)

model = tCANE(method="decay", lambda_=0.2)
G = model.fit(df)

Available aggregation methods: decay, sum, average, max, stability.


Handling large narrative vocabularies

When your corpus has many narrative clusters (say, more than ten thousand), the TF-IDF user vectors become very high-dimensional and sparse. Cosine similarity degrades in this regime because most users share almost no clusters — the network ends up empty or dominated by a handful of high-volume accounts.

The fix is dimensionality reduction via TruncatedSVD before the similarity search. The key question is how many dimensions to use, and suggest_svd_dims answers it in terms of variance retained:

from discourse_networks import suggest_svd_dims

recommended_dims, curve = suggest_svd_dims(df, target_variance=0.90)
# → 147 dimensions retain 90.0% of variance

This fits a single SVD on the full matrix and reads off the cumulative explained variance, so you get the entire curve cheaply and can pick whatever retention level makes sense for your use case. Once you have a number, pass it via target_variance:

model = CANE(similarity_threshold=0.2, target_variance=0.90)
G = model.fit(df)

If you're not sure whether you need it, check the sparsity diagnostic that suggest_svd_dims prints. Sparsity above ~0.97 is the signal that reduction will help.


Picking a similarity threshold

suggest_threshold shows you the connectivity rate at several candidate values so you can make an informed choice:

model = CANE(target_variance=0.90)
results = model.suggest_threshold(df)

# Threshold → Connectivity rate
# 0.10  →  94.3% nodes connected
# 0.20  →  81.2% nodes connected
# 0.30  →  63.7% nodes connected
# 0.40  →  41.0% nodes connected
# 0.50  →  22.1% nodes connected

For most downstream tasks (IO detection, stance prediction) something in the 0.15–0.30 range tends to work well. Narrative emergence prediction can tolerate lower thresholds since the signal comes from neighbor activity counts rather than community structure.


Graph diagnostics

from discourse_networks import graph_diagnostics

graph_diagnostics(G, name="My corpus")
# My corpus
# ========================================
# Nodes:              45231
# Edges:              312847
# Connected nodes:    41983 (92.8%)
# ...

Citation

If you use this in your work, please cite:

@article{gerard2025bridging,
  title={Bridging the narrative divide: Cross-platform discourse networks in fragmented ecosystems},
  author={Gerard, Patrick and Hanley, Hans WA and Luceri, Luca and Ferrara, Emilio},
  journal={arXiv preprint arXiv:2505.21729},
  year={2025}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discourse_networks-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discourse_networks-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file discourse_networks-0.1.0.tar.gz.

File metadata

  • Download URL: discourse_networks-0.1.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for discourse_networks-0.1.0.tar.gz
Algorithm Hash digest
SHA256 218bec7bc42d43e6b8103f60ab3cc89ec6d5f3095aed5fa8619488e355cdb59f
MD5 acad478c84d17baf5593954fc83854bf
BLAKE2b-256 0821e0aaf4d37011cd2404321b015fbf2ca8efcbca116fd94c0f038faccb9357

See more details on using hashes here.

File details

Details for the file discourse_networks-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for discourse_networks-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75d5a6d8f6fb64fcbf8798fea9966cf63adf7adde0de2c3aa449d08341f4b4f6
MD5 2ac2154f2eac5035c427b68065a99878
BLAKE2b-256 5c24f9e71908f9e21e3e0cf3d57dc7250ac4c0eed546044a9c6a136b80056c84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page