Skip to main content

Scrapy spider imports for Meshagent datasets

Project description

Meshagent Scrapy

Spider a website with Scrapy and import page content into a Meshagent room dataset.

from meshagent.scrapy import import_domain_with_scrapy

result = await import_domain_with_scrapy(
    room,
    url="https://example.com",
    table="pages",
    namespace=["crawls"],
    limit=100,
)

To test it through meshagent room connect:

meshagent room connect --room=my-room --identity=scrapy -- \
  python meshagent-sdk/meshagent-scrapy/examples/crawl.py \
  https://www.meshagent.com --table=sample --namespace=crawls --limit=100

The sample command writes progress to stderr while it imports. TTY output uses a single updating line; redirected output uses plain log lines. Pass --silent to suppress progress output.

The default extractor writes page content as markdown in the text column. Use --format=html to keep HTML, --format=text to strip markup to plain text, or pass content_format= from library code.

By default, the crawler runs Trafilatura cleanup before converting content and extracting links/images, which strips common navigation, footer, sidebar, and ad boilerplate. Use --clean=after-links to keep links/images from the original page while still cleaning the text content, or --clean=none to process the raw response body.

The CLI persists crawl frontier state by default in <table>__frontier, so a limited run can be resumed by running the same command again:

meshagent room connect --room=my-room --identity=scrapy -- \
  python meshagent-sdk/meshagent-scrapy/examples/crawl.py \
  https://www.meshagent.com --table=sample --namespace=crawls --limit=100

Pass --frontier-table to choose a different state table, or --no-resume to run without frontier persistence. Library callers can opt in with resume=True. Frontier updates are buffered before they are written; tune that with --frontier-batch-size or the library frontier_batch_size= argument. Failed URLs are not retried on resume unless you pass --retry-failed or retry_failed=True.

The crawler creates indexes by default: a BTREE index on the page table primary key, an INVERTED index on text, LABEL_LIST indexes on link_urls and image_urls, plus BTREE url and BITMAP status indexes on the frontier table. Pass --no-indexes or create_indexes=False to skip that. It also runs dataset optimization periodically while importing and shows optimizing/optimized in progress output. Tune that with --optimize-every or optimize_every=, and use 0 on the CLI or None in library code to disable automatic optimization.

Pass --response-filter or response_filter= to skip responses with a JMESPath expression over url, status, headers, and content_type. Header names are lower-cased, so an HTML-only crawl can use:

--response-filter "contains(headers.\"content-type\", 'text/html')"

By default, records are merged on url with the columns url, date, content_type, text, links, link_urls, images, and image_urls. text is markdown unless another content format is selected. links and images are struct arrays that keep the source attributes and link text or image metadata; link_urls and image_urls are flattened URL arrays for fast lookup. The crawler creates a BTREE index on url, an INVERTED index on text, and LABEL_LIST indexes on link_urls and image_urls.

Pass an async extract= callback to derive custom columns from the Scrapy response and content bytes. Return None from the callback to skip the record. Pass an async progress= callback to observe import progress from library code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meshagent_scrapy-0.39.3.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meshagent_scrapy-0.39.3-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file meshagent_scrapy-0.39.3.tar.gz.

File metadata

  • Download URL: meshagent_scrapy-0.39.3.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for meshagent_scrapy-0.39.3.tar.gz
Algorithm Hash digest
SHA256 e1ada7cb38f78f612f290156e25200f1046f850ddbdf8512284aab31faf23754
MD5 e70c66a0feec324e5d52983f685490e8
BLAKE2b-256 6b37f6d8581db4561345bb20fc4045897df6078af9a7a89c4d94eff50cc645d1

See more details on using hashes here.

File details

Details for the file meshagent_scrapy-0.39.3-py3-none-any.whl.

File metadata

File hashes

Hashes for meshagent_scrapy-0.39.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4e98b05ef9cbb30f682291aab2b73a74c5ab9e65d77bb6c72cc4281dd456484e
MD5 2e9c43e0009f47ac644db2dc8f4940d2
BLAKE2b-256 57d65fe0038720fb792ecc934a25100749d7dc2dac80865c7e768fefe371c4e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page