Skip to main content

Common Crawl import support for Meshagent datasets

Project description

Meshagent Common Crawl

Import Common Crawl captures into a Meshagent room dataset.

from meshagent.commoncrawl import import_domain_from_commoncrawl

result = await import_domain_from_commoncrawl(
    room,
    index="CC-MAIN-2025-08",
    domain="example.com",
    table="pages",
    url_filter=r"https?://(www\.)?example\.com/docs/.*",
)

To test it through meshagent room connect:

meshagent room connect --room=my-room --identity=commoncrawl -- \
  python meshagent-sdk/meshagent-commoncrawl/examples/crawl.py \
  http://www.meshagent.com --table=sample --namespace=crawls --limit=10

The sample command writes progress to stderr while it imports. TTY output uses a single updating line; redirected output uses plain log lines. Pass --silent to suppress progress output.

By default, records are merged on url with the columns url, date, content_type, and text. Pass an async extract= callback to derive custom columns from the WARC record and decoded content bytes. Return None from the callback to skip the record. Pass an async progress= callback to observe import progress from library code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meshagent_commoncrawl-0.39.4.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meshagent_commoncrawl-0.39.4-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file meshagent_commoncrawl-0.39.4.tar.gz.

File metadata

  • Download URL: meshagent_commoncrawl-0.39.4.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_commoncrawl-0.39.4.tar.gz
Algorithm Hash digest
SHA256 500d20491fa5f3d51fb64742b1fa84c9b228fe06535be4d9575e5bde80b41e08
MD5 2973e1afec8673829984b26da2154981
BLAKE2b-256 9e3355e1150c708423cdf6aa5215489686265caef09498d606133e22edcf047c

See more details on using hashes here.

File details

Details for the file meshagent_commoncrawl-0.39.4-py3-none-any.whl.

File metadata

File hashes

Hashes for meshagent_commoncrawl-0.39.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e7126225ef359343d3504a72f49b05b7e2c29b6282f38d49571c6e97155c1c7f
MD5 3ffaf54a67fccefbde63ca45d198975c
BLAKE2b-256 16d385b9c027d85722f5ffb145f1c03951452e38ae5aab0326acd9142e742d65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page