Python interface to CommonCrawl's 93M domain webgraph for network analysis and domain discovery
Project description
pyccwebgraph: Python Interface to CommonCrawl Webgraph
Discover related domains using link topology from CommonCrawl's webgraph.
Installation
Prerequisites:
- Python 3.8+
- Java 17+ (install instructions)
- ~30GB disk space for webgraph data
pip install pyccwebgraph
First use downloads graph data:
from pyccwebgraph import CCWebgraph, get_available_versions
# List available versions
versions = get_available_versions()
print(versions[:3]) # ['cc-main-2024-nov-dec-jan', 'cc-main-2024-feb-apr-may', ...]
webgraph = CCWebgraph.setup(
webgraph_dir="/data/my-webgraph",
version="cc-main-2024-feb-apr-may"
)
# Find domains that link TO seeds (backlinks)
results = webgraph.discover_backlinks(
seeds=["cnn.com", "bbc.com", "nytimes.com"],
min_connections=3 # Must link to all seeds
)
print(f"Found {len(results['nodes'])} domains")
print(f"Top result: {results['nodes'][0]}")
# {'domain': 'news-aggregator.com', 'connections': 15, 'percentage': 50.0}
Working with NetworkX
# Get results as NetworkX graph
G = webgraph.discover_backlinks(
seeds=["cnn.com", "bbc.com"],
min_connections=2,
format='networkx' # Returns nx.DiGraph
)
# Run standard NetworkX algorithms
import networkx as nx
# Centrality analysis
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)
# Community detection
from cdlib import algorithms
communities = algorithms.louvain(G)
# Visualization
from pyvis.network import Network
net = Network(notebook=True)
net.from_nx(G)
net.show("network.html")
Performance: Large Graphs with NetworKit
For large discovered subgraphs (>100K nodes), use NetworKit instead of NetworkX:
# Discover large subgraph
G_nk, name_map = webgraph.discover_backlinks(
seeds=seed_list,
min_connections=2,
format='networkit' # Returns NetworKit graph
)
CC-Webgraph mapping
# Check if domain exists in graph
vid = webgraph.domain_to_id("example.com")
if vid is not None:
print(f"Found at vertex ID {vid}")
# Get all domains this domain links to
outlinks = webgraph.get_successors("cnn.com")
print(f"CNN links to {len(outlinks)} domains")
# Get all domains linking to this domain
backlinks = webgraph.get_predecessors("cnn.com")
print(f"{len(backlinks)} domains link to CNN")
# Validate seeds before discovery
found, missing = webgraph.validate_seeds(["cnn.com", "fake-site.xyz"])
print(f"Found: {found}")
print(f"Missing: {missing}")
Links
- Interactive demo: https://github.com/PeterCarragher/NetNeighbors
- PyPI: https://pypi.org/project/pyccwebgraph/
- Documentation: https://pyccwebgraph.readthedocs.io/
- Research Papers for webgraph-based discovery:
- CommonCrawl Webgraphs: https://commoncrawl.org/web-graphs
- cc-webgraph: https://github.com/commoncrawl/cc-webgraph
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyccwebgraph-0.3.1.tar.gz
(17.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyccwebgraph-0.3.1.tar.gz.
File metadata
- Download URL: pyccwebgraph-0.3.1.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a576c436de8c41d30e60763f10a5b5133fc9dbf0cf8f6dcd1cba2ab7be20e5
|
|
| MD5 |
a96d6d7289b735f55bc3258e713d44c0
|
|
| BLAKE2b-256 |
b554690f20af0208fcebd54abbdbe8ee8ce33c71dca83ea9b0f87227f5d47841
|
File details
Details for the file pyccwebgraph-0.3.1-py3-none-any.whl.
File metadata
- Download URL: pyccwebgraph-0.3.1-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7507f4f80a59f4a02c84f29651a4fee31c8968c786357c57f8392a1fd9d6dec1
|
|
| MD5 |
d1c14b920a9a15f0e8a253e8cbb5916f
|
|
| BLAKE2b-256 |
fc5361a6b61b5b2a5b5982d0f6246032d4ad2b95e74a8489e9e182fed7f309d1
|