Skip to main content

Python interface to CommonCrawl's 93M domain webgraph for network analysis and domain discovery

Project description

pyccwebgraph: Python Interface to CommonCrawl Webgraph

PyPI version Python 3.8+ License: MIT

Discover related domains using link topology from CommonCrawl's webgraph.

Installation

Prerequisites:

pip install pyccwebgraph

First use downloads graph data:

from pyccwebgraph import CCWebgraph, get_available_versions

# List available versions
versions = get_available_versions()
print(versions[:3])  # ['cc-main-2024-nov-dec-jan', 'cc-main-2024-feb-apr-may', ...]

webgraph = CCWebgraph.setup(
    webgraph_dir="/data/my-webgraph",
    version="cc-main-2024-feb-apr-may"
)

# Find domains that link TO seeds (backlinks)
results = webgraph.discover_backlinks(
    seeds=["cnn.com", "bbc.com", "nytimes.com"],
    min_connections=3  # Must link to all seeds
)

print(f"Found {len(results['nodes'])} domains")
print(f"Top result: {results['nodes'][0]}")
# {'domain': 'news-aggregator.com', 'connections': 15, 'percentage': 50.0}

Working with NetworkX

# Get results as NetworkX graph
G = webgraph.discover_backlinks(
    seeds=["cnn.com", "bbc.com"],
    min_connections=2,
    format='networkx'  # Returns nx.DiGraph
)

# Run standard NetworkX algorithms
import networkx as nx

# Centrality analysis
pr = nx.pagerank(G)
bc = nx.betweenness_centrality(G)

# Community detection
from cdlib import algorithms
communities = algorithms.louvain(G)

# Visualization
from pyvis.network import Network
net = Network(notebook=True)
net.from_nx(G)
net.show("network.html")

Performance: Large Graphs with NetworKit

For large discovered subgraphs (>100K nodes), use NetworKit instead of NetworkX:

# Discover large subgraph
G_nk, name_map = webgraph.discover_backlinks(
    seeds=seed_list,
    min_connections=2,
    format='networkit'  # Returns NetworKit graph
)

CC-Webgraph mapping

# Check if domain exists in graph
vid = webgraph.domain_to_id("example.com")
if vid is not None:
    print(f"Found at vertex ID {vid}")

# Get all domains this domain links to
outlinks = webgraph.get_successors("cnn.com")
print(f"CNN links to {len(outlinks)} domains")

# Get all domains linking to this domain  
backlinks = webgraph.get_predecessors("cnn.com")
print(f"{len(backlinks)} domains link to CNN")

# Validate seeds before discovery
found, missing = webgraph.validate_seeds(["cnn.com", "fake-site.xyz"])
print(f"Found: {found}")
print(f"Missing: {missing}")

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyccwebgraph-0.3.1.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyccwebgraph-0.3.1-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file pyccwebgraph-0.3.1.tar.gz.

File metadata

  • Download URL: pyccwebgraph-0.3.1.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyccwebgraph-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b6a576c436de8c41d30e60763f10a5b5133fc9dbf0cf8f6dcd1cba2ab7be20e5
MD5 a96d6d7289b735f55bc3258e713d44c0
BLAKE2b-256 b554690f20af0208fcebd54abbdbe8ee8ce33c71dca83ea9b0f87227f5d47841

See more details on using hashes here.

File details

Details for the file pyccwebgraph-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pyccwebgraph-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyccwebgraph-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7507f4f80a59f4a02c84f29651a4fee31c8968c786357c57f8392a1fd9d6dec1
MD5 d1c14b920a9a15f0e8a253e8cbb5916f
BLAKE2b-256 fc5361a6b61b5b2a5b5982d0f6246032d4ad2b95e74a8489e9e182fed7f309d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page