Skip to main content

Common Crawl import support for Meshagent datasets

Project description

Meshagent Common Crawl

Import Common Crawl captures into a Meshagent room dataset.

from meshagent.commoncrawl import import_domain_from_commoncrawl

result = await import_domain_from_commoncrawl(
    room,
    index="CC-MAIN-2025-08",
    domain="example.com",
    table="pages",
    url_filter=r"https?://(www\.)?example\.com/docs/.*",
)

To test it through meshagent room connect:

meshagent room connect --room=my-room --identity=commoncrawl -- \
  python meshagent-sdk/meshagent-commoncrawl/examples/crawl.py \
  http://www.meshagent.com --table=sample --namespace=crawls --limit=10

The sample command writes progress to stderr while it imports. TTY output uses a single updating line; redirected output uses plain log lines. Pass --silent to suppress progress output.

By default, records are merged on url with the columns url, date, content_type, and text. Pass an async extract= callback to derive custom columns from the WARC record and decoded content bytes. Return None from the callback to skip the record. Pass an async progress= callback to observe import progress from library code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meshagent_commoncrawl-0.39.3.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meshagent_commoncrawl-0.39.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file meshagent_commoncrawl-0.39.3.tar.gz.

File metadata

  • Download URL: meshagent_commoncrawl-0.39.3.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_commoncrawl-0.39.3.tar.gz
Algorithm Hash digest
SHA256 27fb94f511eb8bc53427e983d49d536eb4c7ace779d38045bb80530fdc3f4a50
MD5 044fa82c67d501e1f0e4faa1292f98af
BLAKE2b-256 1f2d19bce7d919a1c4c6e6d255680745479ed0881cf1c389dbbdeb2adb5d3b2e

See more details on using hashes here.

File details

Details for the file meshagent_commoncrawl-0.39.3-py3-none-any.whl.

File metadata

File hashes

Hashes for meshagent_commoncrawl-0.39.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5f353a939c94f63a6e877aa66e871231f3453c6181534bac68ed88d7151169db
MD5 eed91a8be5fa5eb1ecd51380b00ee8e3
BLAKE2b-256 4383ebacf2d308e8edcd652967061de5f4fe4e9828c27fd0623fe7783cb79707

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page