Common Crawl import support for Meshagent datasets

These details have not been verified by PyPI

Project links

Project description

Meshagent Common Crawl

Import Common Crawl captures into a Meshagent room dataset.

from meshagent.commoncrawl import import_domain_from_commoncrawl

result = await import_domain_from_commoncrawl(
    room,
    index="CC-MAIN-2025-08",
    domain="example.com",
    table="pages",
    url_filter=r"https?://(www\.)?example\.com/docs/.*",
)

To test it through meshagent room connect:

meshagent room connect --room=my-room --identity=commoncrawl -- \
  python meshagent-sdk/meshagent-commoncrawl/examples/crawl.py \
  http://www.meshagent.com --table=sample --namespace=crawls --limit=10

The example defaults to --scope=host, so https://www.example.com imports only captures from www.example.com. Use --scope=domain when you explicitly want sibling subdomains too, for example when a large site stores useful content outside www.

The sample command writes progress to stderr while it imports. TTY output uses a single updating line; redirected output uses plain log lines. Pass --silent to suppress progress output. Columnar scans emit periodic heartbeat updates while waiting for DataFusion batches. WARC reads run concurrently by default and report queued records, downloaded bytes, and request counts; use --scan-partitions to tune DataFusion scan parallelism and --concurrency, --warc-retries, and --warc-retry-delay to tune object reads.

The importer uses Common Crawl's columnar index by default through DataFusion. Basic imports generate a SQL query that selects one latest HTML capture per URL from the requested host or domain, excluding robots.txt. Advanced callers can pass columnar_sql= to control the URL selection directly; the query must return url plus WARC pointer columns (filename/offset/length or the Common Crawl names warc_filename/warc_record_offset/warc_record_length). The example CLI exposes this as --sql.

Common Crawl's CDX API is rate limited and not a good fit for broad filtering. The SDK still contains the polite CDX reader for compatibility, using https://index.commoncrawl.org, a Meshagent User-Agent, serialized/paced requests, and clearer HTTP 503 guidance.

By default, records are merged on url with the columns url, date, content_type, and text. Pass an async extract= callback to derive custom columns from the WARC record and decoded content bytes. Return None from the callback to skip the record. Pass an async progress= callback to observe import progress from library code.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.39.9

May 7, 2026

This version

0.39.8

May 6, 2026

0.39.7

May 4, 2026

0.39.6

May 2, 2026

0.39.5

May 1, 2026

0.39.4

Apr 30, 2026

0.39.3

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meshagent_commoncrawl-0.39.8.tar.gz (22.6 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meshagent_commoncrawl-0.39.8-py3-none-any.whl (17.0 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file meshagent_commoncrawl-0.39.8.tar.gz.

File metadata

Download URL: meshagent_commoncrawl-0.39.8.tar.gz
Upload date: May 6, 2026
Size: 22.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_commoncrawl-0.39.8.tar.gz
Algorithm	Hash digest
SHA256	`b0ad0a6b005369ec117b30af09963611c1b5b5e6eab8ac75d23cbc1ceedaced0`
MD5	`b57465c5b4d2e59e4965c739d82da4c4`
BLAKE2b-256	`a775115bacf93ae3b7424d0275e65639c9d5f2ba8d60a13928993370a6fdd407`

See more details on using hashes here.

File details

Details for the file meshagent_commoncrawl-0.39.8-py3-none-any.whl.

File metadata

Download URL: meshagent_commoncrawl-0.39.8-py3-none-any.whl
Upload date: May 6, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_commoncrawl-0.39.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2711daac7982b4d46c05826edf89a25a3f22b10f5f0888da6072d28f7fdbb339`
MD5	`33f9cc53492baaed84a5823de8e2f04e`
BLAKE2b-256	`adcc64b1fc76e9816181378d3251927a0c58e9800fe87d743aea2eba2df016bb`

See more details on using hashes here.

meshagent-commoncrawl 0.39.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Meshagent Common Crawl

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes