Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Scrapy is a great framework for web crawling. This package provides two pipelines of saving items into MongoDB in both async and sync ways for scrapy. Also it provides a highly customized way to interact with MongoDB in both async and sync ways:

  • Save an item and get Object ID with this pipeline

  • Update an item and get Object ID with this pipeline

Requirements

  • Txmongo, a async MongoDB driver with Twisted

  • Tests on Python 3.5

  • Tests on Linux, but it’s a pure python module, should work on other platforms with official python and Twisted supported

Installation

The quick way:

pip install scrapy-pipeline-mongodb

Or put this middleware just beside the scrapy project.

Documentation

Set Block Inspector in ITEMPIPELINES in settings.py, for example:

from txmongo.filter import ASCENDING
from txmongo.filter import DESCENDING

# -----------------------------------------------------------------------------
# PIPELINE MONGODB ASYNC
# -----------------------------------------------------------------------------

ITEM_PIPELINES.update({
    'scrapy_pipeline_mongodb.pipelines.mongodb_async.PipelineMongoDBAsync': 500,
})

MONGODB_USERNAME = 'user'
MONGODB_PASSWORD = 'password'
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DATABASE = 'test_mongodb_async_db'
MONGODB_COLLECTION = 'test_mongodb_async_coll'

# MONGODB_OPTIONS_ = 'MONGODB_OPTIONS_'

MONGODB_INDEXES = [('field_0', ASCENDING, {'unique': True}),
                   (('field_0', 'field_1'), ASCENDING),
                   (('field_0', ASCENDING), ('field_0', DESCENDING))]

MONGODB_PROCESS_ITEM = 'scrapy_pipeline_mongodb.utils.process_item.process_item'

Settings Reference

MONGODB_USERNAME

A string of the username of the database.

MONGODB_PASSWORD

A string of the password of the database.

MONGODB_HOST

A string of the ip address or the domain of the database.

MONGODB_PORT

A int of the port of the database.

MONGODB_DATABASE

A string of the name of the database.

MONGODB_COLLECTION

A list of the indexes to create on the collection.

MONGODB_OPTIONS_*

Options can be attached when the pipeline start to connect to MongoBD.

If any options are needed, the option can be set with the prefix MONGODB_OPTIONS_, the pipeline will parse it.

For example:

option name

in settings.py

authMechanism

MONGODB_OPTIONS_authMechanism

For more options, please refer to the page:

Connection String URI Format — MongoDB Manual 3.4

MONGODB_INDEXES

A list of the indexes defined in this setting will be created when the spider is open.

If the index has already existed, the warning or error will be suspended.

MONGODB_PROCESS_ITEM

This pipeline provides a setting to define the function process_item, which can help to customize the way to interact with MongoDB. With this package there is one default function provided: calling the method insert_one of collection to save the item into MongoDB, then return the item.

If a customize method is provided to replace the default one, please note the behavior should follow the requirement which is clearly written in the scrapy documents:

Item Pipeline — Scrapy 1.4.0 documentation

Built-in Functions For Processing Item

scrapy_pipeline_mongodb.utils.process_item.process_item

This is a built-in function to call the method insert_one of collection, and return the item.

To use this function, in settings.py:

MONGODB_PROCESS_ITEM = 'scrapy_pipeline_mongodb.utils.process_item.process_item'

NOTE

The database drivers may have different api for the same operation, this pipeline adopts txmongo as the async driver for MongoDB. Please read the relative documents to make sure the customized method can run fluently in this pipeline.

TODO

  • Add a unit test for the index created function

  • Add a sync pipeline

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-pipeline-mongodb-0.0.7.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

scrapy_pipeline_mongodb-0.0.7-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-pipeline-mongodb-0.0.7.tar.gz.

File metadata

File hashes

Hashes for scrapy-pipeline-mongodb-0.0.7.tar.gz
Algorithm Hash digest
SHA256 e8bce330aa6cf94d13d1077c56bdb1961b74e60f83789d62ecb32ef1720af444
MD5 59209415f93de793b63f1854cdc1c17a
BLAKE2b-256 e2e8b878bb56b3e462d0cd08b238e7907343db6ddee7174ac930fd556f96a0f3

See more details on using hashes here.

File details

Details for the file scrapy_pipeline_mongodb-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_pipeline_mongodb-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 361adf1239091aed439ac9064ee2443245e3b03213ad07d4509b443d20453581
MD5 20cb6519b076c1374ff553dc24568adb
BLAKE2b-256 8f0d042688838e70247e4fb582b2306b9e08fcc558b329a430780a76d86be55f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page