Discourse Networks: platform-agnostic user networks from shared narrative participation
Project description
discourse-networks
This package builds user-user networks from social media content without relying on follower graphs, reposts, or any platform-specific metadata. Users are connected through shared participation in latent narrative clusters — the underlying idea being that people who consistently engage with the same narratives are structurally aligned, regardless of whether they ever interact directly or even use the same platform.
The method (CANE / t-CANE) was introduced in Gerard et al., ICWSM 2025 and applied to cross-platform narrative prediction in [Gerard et al., WWW 2025], where discourse-based networks substantially outperformed behavioral baselines across information operation detection, ideological stance prediction, and cross-platform emergence forecasting.
Installation
pip install discourse-networks
FAISS is optional but strongly recommended for anything beyond a few thousand users. Install whichever variant matches your hardware:
pip install discourse-networks faiss-cpu # CPU
pip install discourse-networks faiss-gpu # GPU
Without FAISS the package falls back to scikit-learn's NearestNeighbors, which is exact but slow at scale.
The pipeline
Before you can build a discourse network, you need narrative cluster labels for your posts. The package handles everything from that point forward — the clustering step is yours. Here's what the full pipeline looks like end to end:
Step 1 — Cluster your posts into narratives.
Embed your posts (we use MPNet or Qwen3-Embedding) and cluster them using DP-Means or any other algorithm. Each post gets a cluster label. That's all this package needs. One row per post, one column for the user, one column for the cluster label.
import pandas as pd
df = pd.read_parquet("your_clustered_posts.parquet")
print(df[["user_id", "narrative_cluster"]].head())
# user_id narrative_cluster
# user_001 14
# user_002 3
# user_001 3
# user_003 97
Your columns don't need to be named disc_node_id and cluster — you tell the model what they're called when you call .fit().
Step 2 — Build the network.
from discourse_networks import CANE
model = CANE(similarity_threshold=0.2)
G = model.fit(
df,
user_col="user_id", # whatever your user column is called
cluster_col="narrative_cluster", # whatever your cluster column is called
)
# G is a networkx.Graph with weighted edges
Step 3 — Use the network downstream.
The resulting networkx.Graph has weighted edges representing narrative alignment between users. From here you can run community detection (Louvain, Leiden), embed with node2vec, train a GCN, or analyze bridge users — whatever your downstream task requires.
CANE — static graph
CANE models each user as a TF-IDF-weighted vector over their narrative cluster participations, then connects users by cosine similarity using approximate nearest neighbor search (FAISS HNSW). The TF-IDF weighting means users who engage with rare, distinctive narratives contribute more to similarity than users who only engage with large, ubiquitous ones.
from discourse_networks import CANE
model = CANE(
similarity_threshold=0.2, # minimum cosine similarity to keep an edge
n_neighbors=None, # how many candidates FAISS retrieves per user; auto-set if None
target_variance=None, # reduce dimensions via SVD before similarity search (see below)
min_cluster_size=2, # drop clusters smaller than this before building vectors
verbose=True,
)
G = model.fit(
df,
user_col="user_id", # name of your user identifier column
cluster_col="narrative_cluster", # name of your cluster label column
)
Parameters
similarity_threshold — the minimum cosine similarity two users need to share before an edge is created between them. Cosine similarity ranges from 0 (no shared narrative engagement) to 1 (identical engagement patterns). A threshold of 0.2 means users need at least 20% directional overlap in their narrative profiles to be connected. Higher values give you a sparser graph with higher-confidence edges; lower values give you denser coverage but include weaker relationships. See Picking a threshold for how to choose. Default 0.2.
n_neighbors — how many candidate neighbors FAISS retrieves per user before applying the similarity threshold. FAISS doesn't compute all pairwise similarities — it retrieves the top n_neighbors most similar users for each user, and then the threshold filters those down further. If None, this is set automatically as max(10, min(300, n_users // 100)), which scales reasonably with corpus size. You'd increase it if you're getting a sparser graph than expected and want to give the threshold more candidates to work with, or decrease it to reduce memory use on very large corpora. Default None.
target_variance — if your corpus has many narrative clusters (ten thousand or more), the TF-IDF vectors become very high-dimensional and sparse, and cosine similarity degrades badly. Setting target_variance=0.90 tells the model to first compress the user-narrative matrix down to however many dimensions are needed to retain 90% of its variance, then run FAISS on those dense vectors. See Handling large narrative vocabularies for the full explanation and a diagnostic tool. Default None (no reduction).
min_cluster_size — any narrative cluster with fewer than this many posts is dropped before building the user vectors. Very small clusters (one or two posts) add a dimension to every user vector without contributing meaningful signal, and they inflate sparsity. Default 2.
user_col / cluster_col — the column names in your DataFrame for user identifiers and cluster labels respectively. Passed to .fit() rather than the constructor, so you can reuse the same model on DataFrames with different schemas. Defaults "disc_node_id" and "cluster" if not specified.
t-CANE — temporal graph
t-CANE computes user similarities at each time bin independently, then aggregates those similarities across time into a single weighted graph. The intuition is that users who repeatedly co-engage with the same narratives across multiple time windows are more deeply aligned than users who happen to overlap in a single window. Rather than treating the corpus as a single snapshot, t-CANE lets the structure of relationships build up — and decay — over time.
from discourse_networks import tCANE
# Add a time bin column to your DataFrame before fitting.
# Any periodic grouping works — the only requirement is that it sorts correctly.
df["time_bin"] = df["created_at"].dt.to_period("2W").astype(str)
# e.g. "2024-01", "2024-W03", "2024-Q1" all work fine
model = tCANE(
method="decay", # aggregation strategy across time bins (see below)
lambda_=0.2, # decay rate; higher = older bins matter less
k=10, # nearest neighbors per user per time bin
similarity_threshold=0.0, # minimum aggregated score to keep an edge
target_variance=None, # SVD variance retention, same as CANE
min_cluster_size=2,
verbose=True,
)
G = model.fit(
df,
user_col="user_id",
cluster_col="narrative_cluster",
time_col="time_bin", # name of your time bin column
)
Parameters
time_col — the column containing time bin labels. Passed to .fit(). These can be any string or value that sorts correctly — biweekly periods, monthly strings, week numbers, whatever granularity makes sense for your data. Finer bins (e.g. weekly) capture faster dynamics but give each bin less data to work with; coarser bins (monthly, biweekly) are more stable but less sensitive to rapid shifts. Default "time_bin".
method — how similarities computed at each time bin are combined into a single edge weight. The five options capture meaningfully different notions of user alignment:
-
"decay"— each bin's similarity is multiplied by an exponentially decreasing weight based on how far back it is, then summed. The most recent bin gets weight 1.0; the bin before it getsexp(-lambda_); the one before thatexp(-2 * lambda_), and so on. This means the network reflects current alignment more than historical alignment, while still letting persistent relationships accumulate weight over time. Good default for most use cases. -
"sum"— similarities are simply summed across all bins without any time weighting. A pair of users who co-appear in 10 bins gets a score 10x a pair who co-appear in 1 bin, regardless of when. Use this when you want the network to reward sustained co-engagement over the full period equally. -
"average"— the mean similarity across all bins in which both users appear. Unlike"sum", this doesn't reward users just for being active in many bins — it measures the average quality of their alignment when they do co-appear. Less sensitive to differences in activity volume. -
"max"— keeps only the single highest similarity observed across any bin. This captures peak alignment rather than sustained alignment, which can be useful if your data has a key event window you want the network to reflect (e.g. a breaking news cycle where alignment spikes and then disperses). -
"stability"— the fraction of time bins in which the pair co-appears, regardless of similarity magnitude. A score of 0.8 means the two users were neighbors in 80% of all bins. This is a purely temporal measure of relational persistence — useful when you care more about whether a relationship is durable than how strong it is at any given moment.
Default "decay".
lambda_ — decay rate for method="decay". Controls how quickly older bins lose influence. At lambda_=0.2, a bin from 5 steps ago retains exp(-1.0) ≈ 37% of its weight; at lambda_=0.5, it retains only exp(-2.5) ≈ 8%. Higher values make the network more reactive to recent activity; lower values make it more of a long-run average. Only used when method="decay". Default 0.2.
k — nearest neighbors retrieved per user per time bin. This is applied within each individual time slice rather than across the full corpus, so lower values are appropriate — there's simply less data per bin. Default 10.
similarity_threshold — minimum aggregated score to include an edge in the final graph. Because scores accumulate across bins (particularly for "sum" and "decay"), the scale of "meaningful" is higher than in static CANE. It's often worth leaving this at 0.0 and inspecting the score distribution with graph_diagnostics before deciding where to cut. Default 0.0.
Handling large narrative vocabularies
When your corpus has many narrative clusters — a few hundred or more — the TF-IDF user vectors become high-dimensional and sparse. A solution is to reduce the dimensionality of the user-narrative matrix before running the similarity search. suggest_svd_dims tells you how many dimensions you need to retain a given fraction of the variance in one pass:
from discourse_networks import suggest_svd_dims
recommended_dims, curve = suggest_svd_dims(
df,
target_variance=0.90, # how much variance to retain, between 0 and 1
min_cluster_size=2,
user_col="user_id",
cluster_col="narrative_cluster",
)
# Matrix: 45231 users x 1847 clusters
# Sparsity: 0.981
# ⚠ High sparsity — SVD reduction is strongly recommended
#
# 10 dims → 31.4%
# 25 dims → 52.1%
# 50 dims → 68.3%
# 100 dims → 81.2%
# 147 dims → 90.0% ◄ recommended
# 200 dims → 94.7%
The "sparsity" line is the key signal: above ~0.97 you almost certainly want reduction. The curve shows exactly what you're trading off — 147 dimensions captures 90% of the variance in a matrix that originally had 1847 columns. Once you've picked a target, pass it via target_variance:
model = CANE(similarity_threshold=0.2, target_variance=0.90)
G = model.fit(df, user_col="user_id", cluster_col="narrative_cluster")
The SVD is fit internally — you don't need to precompute anything.
Picking a threshold
suggest_threshold runs the FAISS search and shows you the connectivity rate (fraction of nodes with at least one edge) at several candidate thresholds, so you can see the sparsity tradeoff before committing:
model = CANE(target_variance=0.90)
results = model.suggest_threshold(df, user_col="user_id", cluster_col="narrative_cluster")
# Threshold → Connectivity rate
# 0.10 → 94.3% nodes connected
# 0.20 → 81.2% nodes connected
# 0.30 → 63.7% nodes connected
# 0.40 → 41.0% nodes connected
# 0.50 → 22.1% nodes connected
For tasks that rely on community structure (IO detection, stance prediction) you generally want high connectivity — something in the 0.10–0.25 range. For tasks where edge quality matters more than coverage (bridge user analysis, narrative emergence prediction) a higher threshold around 0.25–0.40 is usually better. Neither is a hard rule; it depends on your corpus size and narrative vocabulary.
Graph diagnostics
After building a network, graph_diagnostics gives you a quick summary of its structure:
from discourse_networks import graph_diagnostics
graph_diagnostics(G, name="My corpus")
# My corpus
# ========================================
# Nodes: 45231
# Edges: 312847
# Connected nodes: 41983 (92.8%)
# Isolated nodes: 3248
#
# Degree distribution:
# median 9.0 mean 14.9 p95 48.0 max 312.0
#
# Edge weight distribution:
# median 0.2341 mean 0.2819 p10 0.2012 p90 0.3847
Isolated nodes (degree zero) mean those users didn't get connected to anyone above your threshold — either because their narrative participation was too sparse, too generic, or because n_neighbors wasn't high enough to surface their nearest matches. A high isolated node count is a signal to lower your threshold, increase n_neighbors, or reduce dimensionality first.
Citation
If you use this in your work, please cite:
@article{gerard2025bridging,
title={Bridging the narrative divide: Cross-platform discourse networks in fragmented ecosystems},
author={Gerard, Patrick and Hanley, Hans WA and Luceri, Luca and Ferrara, Emilio},
journal={arXiv preprint arXiv:2505.21729},
year={2025}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file discourse_networks-0.1.1.tar.gz.
File metadata
- Download URL: discourse_networks-0.1.1.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab6da314cb1ce320c7aa34fca6983316ffca428ad1c7b2666aa557bec93fd8eb
|
|
| MD5 |
5a54358d31e3009b4a6491340ecd0239
|
|
| BLAKE2b-256 |
396814a247afca00d6f975f054b7479506a513c3b3984958b5d3753c3bec4a93
|
File details
Details for the file discourse_networks-0.1.1-py3-none-any.whl.
File metadata
- Download URL: discourse_networks-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d14198fba22b715add4867c8906e03b46bf181753b5c99bf3b38a30ac93239d
|
|
| MD5 |
8241a8a3c27fb926cffc7f454e826314
|
|
| BLAKE2b-256 |
07f4304eb067c356c0f2c82c206e970aadcc818bf54db223288afe7e6542c9f6
|