Scrapy spider imports for Meshagent datasets

These details have not been verified by PyPI

Project links

Project description

Meshagent Scrapy

Spider a website with Scrapy and import page content into a Meshagent room dataset.

from meshagent.scrapy import import_domain_with_scrapy

result = await import_domain_with_scrapy(
    room,
    url="https://example.com",
    table="pages",
    namespace=["crawls"],
    limit=100,
    concurrency=5,
)

To test it through meshagent room connect:

meshagent room connect --room=my-room --identity=scrapy -- \
  python meshagent-sdk/meshagent-scrapy/examples/crawl.py \
  https://www.meshagent.com --table=sample --namespace=crawls --limit=100 --concurrency=5

The sample command writes progress to stderr while it imports. TTY output uses a single updating line; redirected output uses plain log lines. Pass --silent to suppress progress output.

Pass --concurrency or concurrency= to tune Scrapy's maximum concurrent requests.

Pass --batch-size or batch_size= to cap how many page records are merged into the content table at once. The crawler also flushes content batches by estimated payload size with --max-batch-bytes or max_batch_bytes=, which defaults to 16 MiB, and by elapsed time with --max-batch-delay or max_batch_delay=, which defaults to 60 seconds. The row-count cap defaults to 100. Raw HTML rows can be large, so prefer lowering the byte limit before lowering the row count if the room server reports Lance/DataFusion merge memory exhaustion while importing full HTML pages.

The crawler sends a browser-like User-Agent by default. Pass --user-agent or user_agent= to override it for a specific crawl.

The default extractor writes page content as markdown in the text column. Use --format=html to keep HTML, --format=text to strip markup to plain text, or pass content_format= from library code.

By default, the crawler runs Trafilatura cleanup before converting markdown/text content, which strips common navigation, footer, sidebar, and ad boilerplate. For --format=html, the default is to strip scripts and inline image data URLs while preserving the rest of the HTML. Use --strip= with comma-separated values like scripts, css, whitespace, image-data-urls, or clean to choose the HTML stripping steps, or --strip=none to process the raw response body.

The CLI persists crawl frontier state by default in <table>__frontier, so a limited run can be resumed by running the same command again:

meshagent room connect --room=my-room --identity=scrapy -- \
  python meshagent-sdk/meshagent-scrapy/examples/crawl.py \
  https://www.meshagent.com --table=sample --namespace=crawls --limit=100

Pass --frontier-table to choose a different state table, or --no-resume to run without frontier persistence. Library callers can opt in with resume=True. Frontier updates are buffered before they are written; tune that with --frontier-batch-size or the library frontier_batch_size= argument. Failed URLs are not retried on resume unless you pass --retry-failed or retry_failed=True.

The crawler creates indexes by default: a BTREE index on the page table primary key, plus BTREE url and BITMAP status indexes on the frontier table. Pass --index=text or index_columns=("text",) to also create an INVERTED index on text. Pass --no-indexes or create_indexes=False to skip all automatic index creation. It also runs dataset optimization periodically while importing and shows optimizing/optimized in progress output. Tune that with --optimize-every or optimize_every=, and use 0 on the CLI or None in library code to disable automatic optimization.

By default, the crawler imports textual responses only, based on Content-Type values containing text/, html, xml, or json. Pass --response-filter or response_filter= to replace that default with a JMESPath expression over url, status, headers, content_type, and content_type_lower. Header names are lower-cased, so an HTML-only crawl can use:

--response-filter "contains(headers.\"content-type\", 'text/html')"

By default, records are merged on url with the columns url, date, content_type, text, and images. text is markdown unless another content format is selected. images is a struct array with src and alt only, and inline image data URLs are excluded.

Pass an async extract= callback to derive custom columns from the Scrapy response and content bytes. Return None from the callback to skip the record. Pass an async progress= callback to observe import progress from library code.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.39.8

May 6, 2026

0.39.7

May 4, 2026

0.39.6

May 2, 2026

0.39.5

May 1, 2026

0.39.4

Apr 30, 2026

0.39.3

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meshagent_scrapy-0.39.8.tar.gz (29.7 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meshagent_scrapy-0.39.8-py3-none-any.whl (21.3 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file meshagent_scrapy-0.39.8.tar.gz.

File metadata

Download URL: meshagent_scrapy-0.39.8.tar.gz
Upload date: May 6, 2026
Size: 29.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_scrapy-0.39.8.tar.gz
Algorithm	Hash digest
SHA256	`5701651770d95e2147f8b08b4a6c693833984be240d3545996acc3b07119a886`
MD5	`e6a3a76992805824d8ed602a48174ebe`
BLAKE2b-256	`09bdb6e9441fc67757516da8139473731a4c8fcc34cf2fe5157a195593fe0b83`

See more details on using hashes here.

File details

Details for the file meshagent_scrapy-0.39.8-py3-none-any.whl.

File metadata

Download URL: meshagent_scrapy-0.39.8-py3-none-any.whl
Upload date: May 6, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_scrapy-0.39.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0a8c01b56e1605d7eb4640a0100105225ce8c1835a4393d75cac848dbc111c0`
MD5	`c20da0f28037d38f215295beeab521af`
BLAKE2b-256	`3ebe4cf60730fcc7d4020c4cec44bea608d344af0327940e2dbe2cf46d5af4fd`

See more details on using hashes here.

meshagent-scrapy 0.39.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Meshagent Scrapy

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes