Skip to main content

MongoDB plugins for Scrapy

Project description

MongoDB plugins for Scrapy

Installation

pip install scrapy-mongo

Pipeline

This pipeline stores scraped items into a MongoDB collection.

Each item must have a unique id field to avoid duplicates. This field is automatically mapped to MongoDB’s _id field.

Each item must include a collection field that specifies the name of the target MongoDB collection.

Items are upserted in batches of 100 by default. The batch size can be adjusted using the PIPELINE_MONGO_BATCH_SIZE setting.

To enable the pipeline, include the following lines in settings.py:

ITEM_PIPELINES = {
    'scrapy_mongo.MongoPipeline': 300,
}
PIPELINE_MONGO_URL = "mongodb://localhost:27017"
PIPELINE_MONGO_DATABASE = "mycollection"

Note: Update PIPELINE_MONGO_URL and PIPELINE_MONGO_DATABASE with the appropriate values for the specific environment.

Cache

The cache component stores scraped responses in a MongoDB collection to avoid downloading the same pages multiple times. It leverages Scrapy’s fingerprinting mechanism to identify responses.

It uses Scrapy's fingerprint mechanism to identify the responses.

To enable caching, include the following lines in settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy_mongo.MongoCacheStorage'
HTTPCACHE_MONGO_URL = "mongodb://localhost:27017"
HTTPCACHE_MONGO_DATABASE = "scraping"
HTTPCACHE_EXPIRATION_SECS = 604800  # Default is 1 week

Note: Update HTTPCACHE_MONGO_URL and HTTPCACHE_MONGO_DATABASE with the appropriate values for the specific environment.

The default expiration time is set to 1 week (604800 seconds). This value can be modified via HTTPCACHE_EXPIRATION_SECS.

Note: You can use the same MongoDB connection for both the pipeline and cache.

Tip: It is possible to use the same MongoDB connection for both the pipeline and cache by replacing PIPELINE_MONGO_URL and HTTPCACHE_MONGO_URL with a unified MONGO_URL setting.

Cache policy

An advanced cache policy mechanism with whitelist support is available. This feature allows for the definition of specific HTTP response codes to be cached, using both explicit lists and regular expressions.

To enable the cache policy, add the following lines to settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy_mongo.CacheOnlyPolicy'
HTTPCACHE_ACCEPT_HTTP_CODES = [302]
HTTPCACHE_ACCEPT_HTTP_CODES_REGEX = r'2\d\d'

This configuration will accept all 2XX HTTP codes and 302 redirects.

Build for publish

Install dependencies:

pip install build twine

Build the package:

python -m build --outdir dist

And publish to PyPi:

python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_mongo-1.0.1.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_mongo-1.0.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_mongo-1.0.1.tar.gz.

File metadata

  • Download URL: scrapy_mongo-1.0.1.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_mongo-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f92520bedd86ebf00909ebbf6155f239ba8286e676792a186af8c1ab513c3eca
MD5 e9cd39aa4a5a92a8327806651609dd04
BLAKE2b-256 ccc275e04830cec726b222c3a9e8b774906008882eea119d3f621a4298ccde56

See more details on using hashes here.

File details

Details for the file scrapy_mongo-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_mongo-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for scrapy_mongo-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 07a92fe1aa5657090fc03a49092ebebc3f133851602d0e202d718e94083a73f5
MD5 4c3f852ee3fa78a454a513c06ceda73b
BLAKE2b-256 2cebfc60d1fc821020c2c0513bffda6032e51b2d0e00697d8ce6bd9316ba6b0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page