Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Frontera is a great framework for broad crawling, especially working with scrapy. This package provides a seed loader from MongoDB in a sync ways for frontera:

  • Querying seeds can be customized

Requirements

  • pymongo

  • Tests on Python 3.5

  • Tests on Linux, but it’s a pure python module, should work on other platforms with official python and Twisted supported

Installation

The quick way:

pip install frontera-seedloader-mongodb

Or put this middleware just beside the scrapy project.

Documentation

Set Seed Loader in SPIDER_MIDDLEWARES in settings.py, for example:

# -----------------------------------------------------------------------------
# FRONTERA SEEDLOADER MONGODB ASYNC
# -----------------------------------------------------------------------------

SPIDER_MIDDLEWARES.update({
    'frontera_seedloader_mongodb.contrib.scrapy.middlewares.seeds.mongodb.MongoDBAsyncSeedLoader': 500,
})

SEEDS_MONGODB_USERNAME = 'user'
SEEDS_MONGODB_PASSWORD = 'password'
SEEDS_MONGODB_HOST = 'localhost'
SEEDS_MONGODB_PORT = 27017
SEEDS_MONGODB_DATABASE = 'test_mongodb_async_db'
SEEDS_MONGODB_COLLECTION = 'test_mongodb_async_coll'

# SEEDS_MONGODB_OPTIONS_ = 'SEEDS_MONGODB_OPTIONS_'

SEEDS_MONGODB_SEEDS_QUERY = {
    'filter': {'websites': {'$exists': 1}}
}
SEEDS_MONGODB_SEEDS_BATCH_SIZE = 1000

SEEDS_MONGODB_SEEDS_PREPARE = 'scrapy_project.utils.seeds_prepara'

Settings Reference

SEEDS_MONGODB_USERNAME

A string of the username of the database.

SEEDS_MONGODB_PASSWORD

A string of the password of the database.

SEEDS_MONGODB_HOST

A string of the ip address or the domain of the database.

SEEDS_MONGODB_PORT

A int of the port of the database.

SEEDS_MONGODB_DATABASE

A string of the name of the database.

SEEDS_MONGODB_COLLECTION

A list of the indexes to create on the collection.

SEEDS_MONGODB_OPTIONS_*

Options can be attached when the seed loader start to connect to MongoBD.

If any options are needed, the option can be set with the prefix SEEDS_MONGODB_OPTIONS_, the pipeline will parse it.

For example:

option name

in settings.py

authMechanism

SEEDS_MONGODB_OPTIONS_authMechanism

For more options, please refer to the page:

SEEDS_MONGODB_SEEDS_QUERY

A dictionary as the keyword arguments to query. The default value is:

SEEDS_MONGODB_SEEDS_QUERY = {
    'filter': None,
    'projection': None,
    'skip': 0,
    'limit': 0,
    'no_cursor_timeout': False,
    'cursor_type': CursorType.NON_TAILABLE,
    'sort': None,
    'allow_partial_results': False,
    'oplog_replay': False,
    'modifiers': None,
    'manipulate': True
}

The keys and values in this setting is followed the keyword arguments of the method find of collection in pymongo. Please refer to the following documents for more details.

SEEDS_MONGODB_SEEDS_BATCH_SIZE

A int of The batch size that each query will return, the default value is 100.

SEEDS_MONGODB_SEEDS_PREPARE

A string of the module path to the function to process the task (seed) from MongoDB.

The input will be the document queried from the collection set in SEEDS_MONGODB_COLLECTION, and the output should be a iterator which will return tuples with two elements: (url, doc). The url will be the argument url of Request, and the doc will be given to request.meta. As an example, the default function in this middleware:

class MongoDBSeedLoader(SeedLoader):

    ...

    def open_spider(self, spider: Spider):
        try:
            if self.settings.get(SEEDS_MONGODB_SEEDS_PREPARE):
                self.prepare = load_object(
                    self.settings.get(SEEDS_MONGODB_SEEDS_PREPARE))
            else:
                self.prepare = lambda x: map(
                    lambda y: (y, {'seed': x}),
                    x['websites'])
        except:
            raise NotConfigured

        ...

    ...

NOTE

The database drivers may have different api for the same operation, this seed loader adopts pymongo as the sync driver for MongoDB. If you want to customize this seed loader, please read the following documents for more details.

TODO

  • add an async way to load the seed from MongoDB

  • split the MongoDB to backend, make this seed load more flexible

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frontera-seedloader-mongodb-0.0.6.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

frontera_seedloader_mongodb-0.0.6-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file frontera-seedloader-mongodb-0.0.6.tar.gz.

File metadata

File hashes

Hashes for frontera-seedloader-mongodb-0.0.6.tar.gz
Algorithm Hash digest
SHA256 97a15a93e194a3613c9ce2309345b40ae8d2e331f9e0ef97deb36b3aab9be7e7
MD5 0ecf61d728547e2192c1d47e93938195
BLAKE2b-256 c0e7447e6bebc46b032b62370464239304ad096394f269d5ea6bfc7832e15859

See more details on using hashes here.

File details

Details for the file frontera_seedloader_mongodb-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for frontera_seedloader_mongodb-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a07575e32009fe41488d8940f13ae1eee6b2043c9d876ccecccaa3b28844c42a
MD5 e6bc3615738bc2b923e8db5706764fe1
BLAKE2b-256 19b64be15fb8894648dcb8ea48bb68cead16efbc0f429d97dea46b6f5cc6978d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page