Scrapy utils for Modis crawlers projects.
Project description
crawler-utils
Scrapy utils for Modis crawlers projects.
MongoDB
Some utils connected with mongodb.
MongoDBPipeline - pipeline for saving items in mongodb.
Params:
- MONGODB_SERVER - address of mongodb database.
- MONGODB_PORT - port of mongodb database.
- MONGODB_DB - database where to save data.
- MONGODB_USERNAME - username for authentication in mongodb in MONGODB_DB database.
- MONGODB_PWD - password for authentication.
- DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is
test
). - MONGODB_COLLECTION_KEY - key of item which identifies items collection name (
MONGO_COLLECTION
) where to save item (default value iscollection
). - MONGODB_UNIQUE_KEY - key of item which identifies item
Kafka
Some utils connected with kafka.
KafkaPipeline - pipeline for pushing items into kafka.
Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}
.
Where RESOURCE_TAG
is tag of resource from which data is crawled and DATA_TYPE
is type of
data crawled: data
, post
, comment
, like
, user
, friend
, share
, member
, news
,
community
.
Params:
- KAFKA_ADDRESS - address of kafka broker.
- KAFKA_KEY - key of item which is put into kafka record key.
- KAFKA_RESOURCE_TAG_KEY - key of item which identifies item
RESOURCE_TAG
(default value isplatform
) - KAFKA_DEFAULT_RESOURCE_TAG - default
RESOURCE_TAG
for crawled items withoutKAFKA_RESOURCE_TAG_KEY
(default value iscrawler
) - KAFKA_DATA_TYPE_KEY - key of item from which identifies item
DATA_TYPE
(default value istype
). - KAFKA_DEFAULT_DATA_TYPE - default
DATA_TYPE
for crawled items withoutKAFKA_DATA_TYPE_KEY
(default value isdata
). - KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example
gzip
.
CaptchaDetection
Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.
If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha
with True
value.
The middleware must be set up with higher precedence (lower number) than RetryMiddleware:
DOWNLOADER_MIDDLEWARES = {
"crawler_utils.CaptchaDetectionDownloaderMiddleware": 549, # By default, RetryMiddleware has 550
}
Middleware settings:
- CAPTCHA_SERVICE_URL: str. For an example: http://127.0.0.1:8000
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for modis_crawler_utils-0.3.19.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5b7e0620c7bf438c538664fadcc4f2eae9544e124c6e80e4146d8210b75d3d4 |
|
MD5 | 19ab8d5e3451d709a890bcb1b5110a47 |
|
BLAKE2b-256 | 5813183842c69ade295dc3642ebf495d130f0ddb153f384088e85f39f8281e14 |
Hashes for modis_crawler_utils-0.3.19-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f88927b7cbb773f22e5ae326ff5682a745822559fca6e95883339fc2ad0c4f41 |
|
MD5 | 0393183a10c93156b338004ca5d90fe8 |
|
BLAKE2b-256 | be757f6df406cc64abf7ce523405bb6bac934e1c70887780f45b51e86d18703f |