Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Scrapy is a great framework for web crawling. This package provides two pipelines to save items into MongoDB in a async or sync way. And also provide a a highly customized way to interact with MongoDB in a async or sync way.

  • Save an item and get Object ID from this pipeline

  • Update an item and get Object ID from this pipeline

Requirements

  • Txmongo, a async MongoDB driver with Twisted

  • Not support Python 2.7

  • Tests on Python 3.5, but it should work on other version higher then Python 3.3

  • Tests on Linux, but it’s a pure python module, it should work on other platforms with official python and Twisted supported, e.g. Windows, Mac OSX, BSD

Installation

The quick way:

pip install scrapy-pipeline-mongodb

Or put this middleware just beside the scrapy project.

Documentation

Block Inspector in spider middleware, in settings.py, for example:

from txmongo.filter import ASCENDING
from txmongo.filter import DESCENDING

# -----------------------------------------------------------------------------
# PIPELINE MONGODB ASYNC
# -----------------------------------------------------------------------------

ITEM_PIPELINE.update({
    'scrapy_pipeline_mongodb.pipelines.mongodb_async.PipelineMongoDBAsync': 500,
})

MONGODB_USERNAME = 'user'
MONGODB_PASSWORD = 'password'
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DATABASE = 'test_mongodb_async_db'
MONGODB_COLLECTION = 'test_mongodb_async_coll'

# MONGODB_OPTIONS_ = 'MONGODB_OPTIONS_'

MONGODB_INDEXES = [('field_0', ASCENDING),
                   (('field_0', 'field_1'), ASCENDING),
                   (('field_0', ASCENDING), ('field_0', DESCENDING))],

MONGODB_PROCESS_ITEM = 'scrapy_pipeline_mongodb.utils.process_item.process_item'

Settings Reference

MONGODB_USERNAME

A string of the username of the database.

MONGODB_PASSWORD

A string of the password of the database.

MONGODB_HOST

A string of the ip address or the domain of the database.

MONGODB_PORT

A int of the port of the database.

MONGODB_DATABASE

A string of the name of the database.

MONGODB_COLLECTION

A string of the name of the collection.

MONGODB_OPTIONS_*

Options can be attached when the pipeline start to connect to MongoBD.

If any options are needed, the name of the option can be with the prefix MONGODB_OPTIONS_, the pipeline will parse it.

For example:

option name

in settings.py

authMechanism

MONGODB_OPTIONS_authMechanism

For more options, please refer to the page:

Connection String URI Format — MongoDB Manual 3.4

MONGODB_INDEXES

A list of the indexes defined in this setting will be created when the spider is open.

If the index has already existed, there will be no warning or error raised.

MONGODB_PROCESS_ITEM

To highly customize to interact with MongoDB, this pipeline provide a setting to define the function process_item. And with this package, there is one default function: just call the method insert_one of the collection to save the item into MongoDB, then return the item.

If a customize is provided to replace the default one, please note the behavior should follow the requirement which is clearly written in the scrapy documents:

Item Pipeline — Scrapy 1.4.0 documentation

Build-in Functions For Processing Item

scrapy_pipeline_mongodb.utils.process_item.process_item

This is a build-in function to call the method insert_one of the collection, and return the item.

To use this function, in settings.py:

MONGODB_PROCESS_ITEM = 'scrapy_pipeline_mongodb.utils.process_item.process_item'

NOTE

The drivers may have different api for the same operation, this pipeline adopts txmongo as the async driver for MongoDB, please read the relative documents to make sure the customized functions can run fluently in this pipeline.

TODO

  • Add a unit test for the index created function

  • Add a sync pipeline

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-pipeline-mongodb-0.0.1.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_pipeline_mongodb-0.0.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-pipeline-mongodb-0.0.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-pipeline-mongodb-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0e97fe3a23628eb914581df17e778580d50b4c5e9de0ab123deae6e868bb25ba
MD5 b57e267b30a484ca3622c77e25750d41
BLAKE2b-256 0707c5a44c7b4f54ad014abb7a3025d0271fd257c6883b90063f9a03532d63c5

See more details on using hashes here.

File details

Details for the file scrapy_pipeline_mongodb-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_pipeline_mongodb-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ebf5618507fb3f317280f1cd8432b49e4da1b5bc29138ddd4619ff09f4170f90
MD5 813752bb8c67cf0bd5ad8a29de027b5c
BLAKE2b-256 085adc90c44c87202b7ff1132da2e5a2f1b76ba3708b3f1b808f43560d71db49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page