Skip to main content

Scrapy utils for Modis crawlers projects.

Project description

crawler-utils

Scrapy utils for Modis crawlers projects.

MongoDB

Some utils connected with mongodb.

MongoDBPipeline - pipeline for saving items in mongodb.

Params:

  • MONGODB_SERVER - address of mongodb database.
  • MONGODB_PORT - port of mongodb database.
  • MONGODB_DB - database where to save data.
  • MONGODB_USERNAME - username for authentication in MONGODB_DB database.
  • MONGODB_PWD - password for authentication.
  • DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is test).
  • MONGODB_COLLECTION_KEY - key of item which identifies items collection name (MONGO_COLLECTION) where to save item (default value is collection).
  • MONGODB_UNIQUE_KEY - key of item which identifies item

Kafka

Some utils connected with kafka.

KafkaPipeline - pipeline for pushing items into kafka.

Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}. Where RESOURCE_TAG is tag of resource from which data is crawled and DATA_TYPE is type of data crawled: data, post, comment, like, user, friend, share, member, news, community.

Params:

  • KAFKA_ADDRESS - address of kafka broker.
  • KAFKA_KEY - key of item which is put into kafka record key.
  • KAFKA_RESOURCE_TAG_KEY - key of item which identifies item RESOURCE_TAG (default value is platform)
  • KAFKA_DEFAULT_RESOURCE_TAG - default RESOURCE_TAG for crawled items without KAFKA_RESOURCE_TAG_KEY (default value is crawler)
  • KAFKA_DATA_TYPE_KEY - key of item from which identifies item DATA_TYPE (default value is type).
  • KAFKA_DEFAULT_DATA_TYPE - default DATA_TYPE for crawled items without KAFKA_DATA_TYPE_KEY (default value is data).
  • KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example gzip.

OpenSearch

OpenSearchRequestsDownloaderMiddleware transforms request-response pair into an item, and then sends it to the OpenSearch.

Settings:

`OPENSEARCH_REQUESTS_SETTINGS` - dict specifying OpenSearch client connections:
    "hosts": Optional[str | list[str]] = "localhost:9200" - hosts with opensearch endpoint,
    "timeout": Optional[int] = 60 - timeout of connections,
    "http_auth": Optional[tuple[str, str]] = None - HTTP authentication if needed,
    "port": Optional[int] = 443 - access port if not specified in hosts,
    "use_ssl": Optional[bool] = True - usage of SSL,
    "verify_certs": Optional[bool] = False - verifying certificates,
    "ssl_show_warn": Optional[bool] = False - show SSL warnings,
    "ca_certs": Optional[str] = None - CA certificate path,
    "client_key": Optional[str] = None - client key path,
    "client_cert": Optional[str] = None - client certificate path,
    "buffer_length": Optional[int] = 500 - number of items in OpenSearchStorage's buffer.

`OPENSEARCH_REQUESTS_INDEX`: Optional[str] = "scrapy-job-requests" - index in OpenSearch.

See an example in examples/opensearch.

CaptchaDetection

Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.

If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha with True value.

The middleware must be set up with higher precedence (lower number) than RetryMiddleware:

DOWNLOADER_MIDDLEWARES = {
    "crawler_utils.CaptchaDetectionDownloaderMiddleware": 549,  # By default, RetryMiddleware has 550
}

Middleware settings:

  • ENABLE_CAPTCHA_DETECTOR: bool = True. Whether to enable captcha detection.
  • CAPTCHA_SERVICE_URL: str. For an example: http://127.0.0.1:8000

Sentry logging

You may want to log exceptions during crawling to your Sentry. Use the crawler_utils.sentry_logging.SentryLoggingExtension for this. Note that sentry_sdk wants to be loaded as earlier as possible. To satisfy this condition make the extension with negative order:

EXTENSIONS = {
    # Load SentryLogging extension before other extensions.
    "crawler_utils.sentry_logging.SentryLoggingExtension": -1,
}

Settings:

SENTRY_DSN: str - Sentry's DSN, where to send events.
SENTRY_SAMPLE_RATE: float = 1.0 - sample rate for error events. Must be in range from 0.0 to 1.0.
SENTRY_TRACES_SAMPLE_RATE: float = 1.0 - the percentage chance a given transaction will be sent to Sentry.
SENTRY_ATTACH_STACKTRACE: bool = False - whether to attach stacktrace for error events.
SENTRY_MAX_BREADCRUMBS: int = 10 - max breadcrumbs to capture with Sentry.

For an example, check examples/sentry_logging.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modis_crawler_utils-0.3.55.tar.gz (72.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modis_crawler_utils-0.3.55-py3-none-any.whl (106.3 kB view details)

Uploaded Python 3

File details

Details for the file modis_crawler_utils-0.3.55.tar.gz.

File metadata

  • Download URL: modis_crawler_utils-0.3.55.tar.gz
  • Upload date:
  • Size: 72.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for modis_crawler_utils-0.3.55.tar.gz
Algorithm Hash digest
SHA256 003b3443d7965b14a128c1585129fb9d36cac3c7189e36d137ca26ab6a3798f3
MD5 861acb45f28d15486f8288f186f1e88f
BLAKE2b-256 b5a321bb49c9e79fa8deaee33070ac28d13595bc9dd851b4ae2b258c58bee9d9

See more details on using hashes here.

File details

Details for the file modis_crawler_utils-0.3.55-py3-none-any.whl.

File metadata

File hashes

Hashes for modis_crawler_utils-0.3.55-py3-none-any.whl
Algorithm Hash digest
SHA256 06b700b1a3f25dbe1cc83dfa457326d8e41c9b7381b999d6221624405f2ab522
MD5 ecd2f37d8980facf41f6459eb40e7e99
BLAKE2b-256 21fcff980252687409664dad4bd0161b2ba830ec846cb3da3a8196fd0d5a8104

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page