Skip to main content

MongoDB plugins for Scrapy

Project description

MongoDB plugins for Scrapy

Installation

pip install scrapy-mongo

Pipeline

This pipeline stores scraped items into a MongoDB collection.

Each item must have a unique id field to avoid duplicates. This field is automatically mapped to MongoDB’s _id field.

Each item must include a collection field that specifies the name of the target MongoDB collection.

Items are upserted in batches of 100 by default. The batch size can be adjusted using the PIPELINE_MONGO_BATCH_SIZE setting.

To enable the pipeline, include the following lines in settings.py:

ITEM_PIPELINES = {
    'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"

Note: Update PIPELINE_MONGO_URL and PIPELINE_MONGO_DATABASE with the appropriate values for the specific environment.

Cache

The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times. It leverages Scrapy’s fingerprinting mechanism to identify responses.

It uses Scrapy's fingerprint mechanism to identify the responses.

To enable caching, include the following lines in settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800  # Default is 1 week

Note: Update HTTPCACHE_MONGO_URL and HTTPCACHE_MONGO_DATABASE with the appropriate values for the specific environment.

The default expiration time is set to 1 week (604800 seconds). This value can be modified via HTTPCACHE_EXPIRATION_SECS.

Cache policy

An advanced cache policy mechanism with whitelist support is available. This feature allows for the definition of specific HTTP response codes to be cached, using both explicit lists and regular expressions.

To enable the cache policy, add the following lines to settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'

This configuration will accept all 2XX HTTP codes and 302 redirects.

Error

The error component stores error logs in a MongoDB collection. It catches error from the Downloader pipeline and the Spider pipeline.

To enable error logging, include the following lines in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_mongo.TraceErrorDownloaderMiddleware': 1000,
}

SPIDER_MIDDLEWARES = {
    'scrapy_mongo.TraceErrorSpiderMiddleware': 1000,
}

ERROR_MONGO_URL = "mongodb://localhost:27017"
ERROR_MONGO_DATABASE = 'scraping'
ERROR_MONGO_COLLECTION = 'errors'

Note: Update ERROR_MONGO_URL, ERROR_MONGO_DATABASE and ERROR_MONGO_COLLECTION with the appropriate values for the specific environment.

It is possible to use the same MongoDB connection for both the pipeline and cache by replacing PIPELINE_MONGO_URL, HTTPCACHE_MONGO_URL and ERROR_MONGO_URL with a unified MONGO_URL setting.

Build for publish

Install dependencies:

pip install build twine

Build the package:

python -m build --outdir dist

And publish to PyPi:

python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_mongo-1.1.0.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_mongo-1.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_mongo-1.1.0.tar.gz.

File metadata

  • Download URL: scrapy_mongo-1.1.0.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_mongo-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c83297e67eac122abc0cf68ead0d394c0b099cd0c25b2d5ca8c009ab427e9e06
MD5 8c64b1733888e4f4e8495474c6201541
BLAKE2b-256 59964579ed7c3d74a07c70f4ab1341c4bc8b7e7f7ada4e4d24c5d99907dba221

See more details on using hashes here.

File details

Details for the file scrapy_mongo-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapy_mongo-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_mongo-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 826105f2f587574692c213bcd6c48f4bce33c381afa3b9c65d69954ed979c378
MD5 a5e3179dc41a372a1c2e5cd12bfa47d1
BLAKE2b-256 7a5348ad4eb70a0bf8384c2e8b325a2b20f9d46619065e2b77a1d35df3803057

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page