Scrapy utils for Modis crawlers projects.
Project description
crawler-utils
Scrapy utils for Modis crawlers projects.
MongoDB
Some utils connected with mongodb.
MongoDBPipeline - pipeline for saving items in mongodb.
Params:
- MONGODB_SERVER - address of mongodb database.
- MONGODB_PORT - port of mongodb database.
- MONGODB_DB - database where to save data.
- MONGODB_USERNAME - username for authentication in mongodb in MONGODB_DB database.
- MONGODB_PWD - password for authentication.
- DEFAULT_MONGODB_COLLECTION - default collection where to save data (default value is
test
). - MONGODB_COLLECTION_KEY - key of item which identifies items collection name (
MONGO_COLLECTION
) where to save item (default value iscollection
). - MONGODB_UNIQUE_KEY - key of item which identifies item
Kafka
Some utils connected with kafka.
KafkaPipeline - pipeline for pushing items into kafka.
Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}
.
Where RESOURCE_TAG
is tag of resource from which data is crawled and DATA_TYPE
is type of
data crawled: data
, post
, comment
, like
, user
, friend
, share
, member
, news
,
community
.
Params:
- KAFKA_ADDRESS - address of kafka broker.
- KAFKA_KEY - key of item which is put into kafka record key.
- KAFKA_RESOURCE_TAG_KEY - key of item which identifies item
RESOURCE_TAG
(default value isplatform
) - KAFKA_DEFAULT_RESOURCE_TAG - default
RESOURCE_TAG
for crawled items withoutKAFKA_RESOURCE_TAG_KEY
(default value iscrawler
) - KAFKA_DATA_TYPE_KEY - key of item from which identifies item
DATA_TYPE
(default value istype
). - KAFKA_DEFAULT_DATA_TYPE - default
DATA_TYPE
for crawled items withoutKAFKA_DATA_TYPE_KEY
(default value isdata
). - KAFKA_COMPRESSION_TYPE - type of data compression in kafka for example
gzip
.
CaptchaDetection
Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.
If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha
with True
value.
The middleware must be set up with higher precedence (lower number) than RetryMiddleware:
DOWNLOADER_MIDDLEWARES = {
"crawler_utils.CaptchaDetectionDownloaderMiddleware": 549, # By default, RetryMiddleware has 550
}
Middleware settings:
- CAPTCHA_SERVICE_URL: str. For an example: http://127.0.0.1:8000
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file modis_crawler_utils-0.3.20.tar.gz
.
File metadata
- Download URL: modis_crawler_utils-0.3.20.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d00b6f65af977cacb6679d20354258a51888533f94bd9b15564cad0300da1228 |
|
MD5 | 3ea224cb114f88407de42696bbf6060e |
|
BLAKE2b-256 | cc96b68a95d439b0f80e05da1e25a055c1082b52b54edea56fa91e726d6c57f4 |
File details
Details for the file modis_crawler_utils-0.3.20-py3-none-any.whl
.
File metadata
- Download URL: modis_crawler_utils-0.3.20-py3-none-any.whl
- Upload date:
- Size: 68.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a4f4ae1dd0232ef5c3d2ebcc91e7d212ab0e11cb7c2eed9b2a60a2cd7c5ec0d |
|
MD5 | 02ee1acc0db9d99e6fa7d0da5094ce95 |
|
BLAKE2b-256 | 041c533cf3828b0afc1ff998810ced637a55dc5f138f46575024f9c702d7ebc7 |