MongoDB plugins for Scrapy
Project description
MongoDB plugins for Scrapy
Installation
pip install scrapy-mongo
Pipeline
This pipeline stores scraped items into a MongoDB collection.
Each item must have a unique id field to avoid duplicates.
This field is automatically mapped to MongoDB’s _id field.
Each item must include a collection field that specifies the name of the target MongoDB collection.
Items are upserted in batches of 100 by default.
The batch size can be adjusted using the PIPELINE_MONGO_BATCH_SIZE setting.
To enable the pipeline, include the following lines in settings.py:
ITEM_PIPELINES = {
'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"
Note: Update PIPELINE_MONGO_URL and PIPELINE_MONGO_DATABASE with the appropriate values for the specific environment.
Cache
The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times. It leverages Scrapy’s fingerprinting mechanism to identify responses.
It uses Scrapy's fingerprint mechanism to identify the responses.
To enable caching, include the following lines in settings.py:
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800 # Default is 1 week
Note: Update HTTPCACHE_MONGO_URL and HTTPCACHE_MONGO_DATABASE with the appropriate values for the specific environment.
The default expiration time is set to 1 week (604800 seconds).
This value can be modified via HTTPCACHE_EXPIRATION_SECS.
Note: You can use the same MongoDB connection for both the pipeline and cache.
Tip: It is possible to use the same MongoDB connection for both the pipeline and cache
by replacing PIPELINE_MONGO_URL and HTTPCACHE_MONGO_URL with a unified MONGO_URL setting.
Cache policy
An advanced cache policy mechanism with whitelist support is available. This feature allows for the definition of specific HTTP response codes to be cached, using both explicit lists and regular expressions.
To enable the cache policy, add the following lines to settings.py:
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'
This configuration will accept all 2XX HTTP codes and 302 redirects.
Build for publish
Install dependencies:
pip install build twine
Build the package:
python -m build --outdir dist
And publish to PyPi:
python -m twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_mongo-1.0.2.tar.gz.
File metadata
- Download URL: scrapy_mongo-1.0.2.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
822d68d3330b8ef07752f58bf80a034e54b6b3c3879f3e0a198bea60870151ec
|
|
| MD5 |
00567214cd5526b65b8f2fc8d41d9552
|
|
| BLAKE2b-256 |
afd77b3a94762530a7f52db3bd7bb2d7f0bbae14d71ddbc9f650ba3f7ad20227
|
File details
Details for the file scrapy_mongo-1.0.2-py3-none-any.whl.
File metadata
- Download URL: scrapy_mongo-1.0.2-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
547f7304da05825d2f7a07527cdd5c0b228f7bc18e0882d4e57a69acc6b9f2e7
|
|
| MD5 |
b8452c8109f709063e0d38c708dafec6
|
|
| BLAKE2b-256 |
3aad857082f5a40cb15dc7eb47ff7b64b41153d7e05f9f7c61d0074e5ef54bed
|