Skip to main content

Scrapy utils for Modis crawlers projects.

Project description

crawler-utils

Scrapy utils for Modis crawlers projects.

MongoDB

Some utils connected with mongodb.

MongoDBPipeline - pipeline for saving items in mongodb.

Params:

  • MONGODB_SERVER - address of mongodb database.
  • MONGODB_PORT - port of mongodb database.
  • MONGODB_DB - database where to save data.
  • MONGODB_USERNAME - username for authentication in mongodb in MONGODB_DB database.
  • MONGODB_PWD - password for authentication.
  • DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is test).
  • MONGODB_COLLECTION_KEY - key of item which identifies items collection name (MONGO_COLLECTION) where to save item (default value is collection).
  • MONGODB_UNIQUE_KEY - key of item which identifies item

Kafka

Some utils connected with kafka.

KafkaPipeline - pipeline for pushing items into kafka.

Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}. Where RESOURCE_TAG is tag of resource from which data is crawled and DATA_TYPE is type of data crawled: data, post, comment, like, user, friend, share, member, news, community.

Params:

  • KAFKA_ADDRESS - address of kafka broker.
  • KAFKA_KEY - key of item which is put into kafka record key.
  • KAFKA_RESOURCE_TAG_KEY - key of item which identifies item RESOURCE_TAG (default value is platform)
  • KAFKA_DEFAULT_RESOURCE_TAG - default RESOURCE_TAG for crawled items without KAFKA_RESOURCE_TAG_KEY (default value is crawler)
  • KAFKA_DATA_TYPE_KEY - key of item from which identifies item DATA_TYPE (default value is type).
  • KAFKA_DEFAULT_DATA_TYPE - default DATA_TYPE for crawled items without KAFKA_DATA_TYPE_KEY (default value is data).
  • KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example gzip.

CaptchaDetection

Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.

If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha with True value.

The middleware must be set up with higher precedence (lower number) than RetryMiddleware:

DOWNLOADER_MIDDLEWARES = {
    "crawler_utils.CaptchaDetectionDownloaderMiddleware": 549,  # By default, RetryMiddleware has 550
}

Middleware settings:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modis_crawler_utils-0.3.21.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

modis_crawler_utils-0.3.21-py3-none-any.whl (68.0 kB view details)

Uploaded Python 3

File details

Details for the file modis_crawler_utils-0.3.21.tar.gz.

File metadata

  • Download URL: modis_crawler_utils-0.3.21.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.7

File hashes

Hashes for modis_crawler_utils-0.3.21.tar.gz
Algorithm Hash digest
SHA256 0b77ff590d496fd648cd6abcaf3e5af15e8aab6b8f62d246e911c4a05e601226
MD5 7c797c76fb6837e74546ee52bf5248f7
BLAKE2b-256 aacc7a78570e89cbe4a417a3d36d208a0003709437cc0ecd9f0df567912bc71a

See more details on using hashes here.

File details

Details for the file modis_crawler_utils-0.3.21-py3-none-any.whl.

File metadata

File hashes

Hashes for modis_crawler_utils-0.3.21-py3-none-any.whl
Algorithm Hash digest
SHA256 5bfd25f2a7e2d1bad833c2f8f8f86c9e9765b612f61e729d7900e7979ab4a686
MD5 9a7c8cd94e65f7dc00af97f1a9a3b73b
BLAKE2b-256 aabefc25e867b4c56ebafc02df619492351618e49f6de5be22d3086659e45afb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page