Python interface to CommonCrawl's 93M domain webgraph for network analysis and domain discovery
Project description
pyccwebgraph: Python Interface to CommonCrawl Webgraph
Discover related domains using link topology from CommonCrawl's webgraph.
Installation
Prerequisites:
- Python 3.8+
- Java 17+ (install instructions)
- ~30GB disk space for webgraph data
pip install pyccwebgraph
First use downloads graph data:
from pyccwebgraph import CCWebgraph, get_available_versions
# List available versions
versions = get_available_versions()
print(versions[:3]) # ['cc-main-2024-nov-dec-jan', 'cc-main-2024-feb-apr-may', ...]
webgraph = CCWebgraph.setup(
webgraph_dir="/data/my-webgraph",
version="cc-main-2024-feb-apr-may"
)
# Find domains that link TO seeds (backlinks)
results = webgraph.discover_backlinks(
seeds=["cnn.com", "bbc.com", "nytimes.com"],
min_connections=3 # Must link to all seeds
)
print(f"Found {len(results['nodes'])} domains")
print(f"Top result: {results['nodes'][0]}")
# {'domain': 'news-aggregator.com', 'connections': 15, 'percentage': 50.0}
Working with NetworkX
# Get results as NetworkX graph
G = webgraph.discover_backlinks(
seeds=["cnn.com", "bbc.com"],
min_connections=2,
format='networkx' # Returns nx.DiGraph
)
# Run standard NetworkX algorithms
import networkx as nx
# Centrality analysis
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)
# Community detection
from cdlib import algorithms
communities = algorithms.louvain(G)
# Visualization
from pyvis.network import Network
net = Network(notebook=True)
net.from_nx(G)
net.show("network.html")
Performance: Large Graphs with NetworKit
For large discovered subgraphs (>100K nodes), use NetworKit instead of NetworkX:
# Discover large subgraph
G_nk, name_map = webgraph.discover_backlinks(
seeds=seed_list,
min_connections=2,
format='networkit' # Returns NetworKit graph
)
CC-Webgraph mapping
# Check if domain exists in graph
vid = webgraph.domain_to_id("example.com")
if vid is not None:
print(f"Found at vertex ID {vid}")
# Get all domains this domain links to
outlinks = webgraph.get_successors("cnn.com")
print(f"CNN links to {len(outlinks)} domains")
# Get all domains linking to this domain
backlinks = webgraph.get_predecessors("cnn.com")
print(f"{len(backlinks)} domains link to CNN")
# Validate seeds before discovery
found, missing = webgraph.validate_seeds(["cnn.com", "fake-site.xyz"])
print(f"Found: {found}")
print(f"Missing: {missing}")
Links
- Interactive demo: https://github.com/PeterCarragher/NetNeighbors
- PyPI: https://pypi.org/project/pyccwebgraph/
- Documentation: https://pyccwebgraph.readthedocs.io/
- Research Papers for webgraph-based discovery:
- CommonCrawl Webgraphs: https://commoncrawl.org/web-graphs
- cc-webgraph: https://github.com/commoncrawl/cc-webgraph
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyccwebgraph-0.3.1.tar.gz.
File metadata
- Download URL: pyccwebgraph-0.3.1.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a576c436de8c41d30e60763f10a5b5133fc9dbf0cf8f6dcd1cba2ab7be20e5
|
|
| MD5 |
a96d6d7289b735f55bc3258e713d44c0
|
|
| BLAKE2b-256 |
b554690f20af0208fcebd54abbdbe8ee8ce33c71dca83ea9b0f87227f5d47841
|
File details
Details for the file pyccwebgraph-0.3.1-py3-none-any.whl.
File metadata
- Download URL: pyccwebgraph-0.3.1-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7507f4f80a59f4a02c84f29651a4fee31c8968c786357c57f8392a1fd9d6dec1
|
|
| MD5 |
d1c14b920a9a15f0e8a253e8cbb5916f
|
|
| BLAKE2b-256 |
fc5361a6b61b5b2a5b5982d0f6246032d4ad2b95e74a8489e9e182fed7f309d1
|