Skip to main content

Scrapy utils for Modis crawlers projects.

Project description

crawler-utils

Scrapy utils for Modis crawlers projects.

MongoDB

Some utils connected with mongodb.

MongoDBPipeline - pipeline for saving items in mongodb.

Params:

  • MONGODB_SERVER - address of mongodb database.
  • MONGODB_PORT - port of mongodb database.
  • MONGODB_DB - database where to save data.
  • MONGODB_USERNAME - username for authentication in MONGODB_DB database.
  • MONGODB_PWD - password for authentication.
  • DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is test).
  • MONGODB_COLLECTION_KEY - key of item which identifies items collection name (MONGO_COLLECTION) where to save item (default value is collection).
  • MONGODB_UNIQUE_KEY - key of item which identifies item

Kafka

Some utils connected with kafka.

KafkaPipeline - pipeline for pushing items into kafka.

Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}. Where RESOURCE_TAG is tag of resource from which data is crawled and DATA_TYPE is type of data crawled: data, post, comment, like, user, friend, share, member, news, community.

Params:

  • KAFKA_ADDRESS - address of kafka broker.
  • KAFKA_KEY - key of item which is put into kafka record key.
  • KAFKA_RESOURCE_TAG_KEY - key of item which identifies item RESOURCE_TAG (default value is platform)
  • KAFKA_DEFAULT_RESOURCE_TAG - default RESOURCE_TAG for crawled items without KAFKA_RESOURCE_TAG_KEY (default value is crawler)
  • KAFKA_DATA_TYPE_KEY - key of item from which identifies item DATA_TYPE (default value is type).
  • KAFKA_DEFAULT_DATA_TYPE - default DATA_TYPE for crawled items without KAFKA_DATA_TYPE_KEY (default value is data).
  • KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example gzip.

CaptchaDetection

Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.

If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha with True value.

The middleware must be set up with higher precedence (lower number) than RetryMiddleware:

DOWNLOADER_MIDDLEWARES = {
    "crawler_utils.CaptchaDetectionDownloaderMiddleware": 549,  # By default, RetryMiddleware has 550
}

Middleware settings:

  • ENABLE_CAPTCHA_DETECTOR: bool = True. Whether to enable captcha detection.
  • CAPTCHA_SERVICE_URL: str. For an example: http://127.0.0.1:8000

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modis_crawler_utils-0.3.27.tar.gz (50.2 kB view details)

Uploaded Source

Built Distribution

modis_crawler_utils-0.3.27-py3-none-any.whl (68.2 kB view details)

Uploaded Python 3

File details

Details for the file modis_crawler_utils-0.3.27.tar.gz.

File metadata

  • Download URL: modis_crawler_utils-0.3.27.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.7

File hashes

Hashes for modis_crawler_utils-0.3.27.tar.gz
Algorithm Hash digest
SHA256 ddd660397ee9eb7dddb5adcb27ca81de881fe6e9f8e47e37c24c6115f2bb26b0
MD5 e4ca00ecf65729b69269e0da08d81004
BLAKE2b-256 64e6a5df00d8f9bfa90871955d3cb4f6d1850b8c52ff3216ef0e7e6104769428

See more details on using hashes here.

File details

Details for the file modis_crawler_utils-0.3.27-py3-none-any.whl.

File metadata

File hashes

Hashes for modis_crawler_utils-0.3.27-py3-none-any.whl
Algorithm Hash digest
SHA256 93bc8975ca82e09e9b5676c2cf9fe9bdc10d4e3a637689a347f25345d3a3827b
MD5 e2edf4454030a9a20e4d92b98b63f6c7
BLAKE2b-256 cd840f03d6e32e3279dcfe3ef402a26a4e522f31ffc4526c43d08c94b929feb4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page