Skip to main content

Based on scrapy.extensions.feedexport.FeedExporter to live stream data

Project description

scrapy-feedstreaming

Scrapy live Streaming data. scrapy.extensions.feedexport.FeedExporter fork to export item during scraping. See [https://medium.com/@alex_ber/scrapy-streaming-data-cdf97434dc15]

See CHANGELOG.md for detail description.

Getting Help

QuickStart

python3 -m pip install -U scrapy-feedstreaming

Installing from Github

python3 -m pip install -U https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip

Optionally installing tests requirements.

python3 -m pip install -U https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip#egg=alex-ber-utils[tests]

Or explicitly:

wget https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip -O master.zip; unzip master.zip; rm master.zip

And then installing from source (see below).

Installing from source

python3 -m pip install -r req.txt # only installs "required" (relaxed)
python3 -m pip install . # only installs "required"
python3 -m pip install .[tests] # installs dependencies for tests

Alternatively you install install from requirements file:

python3 -m pip install -r requirements.txt # only installs "required"
python3 -m pip install -r requirements-tests.txt # installs dependencies for tests

From the directory with setup.py

python3 setup.py test #run all tests

or

pytest

Installing new version

See https://docs.python.org/3.1/distutils/uploading.html

python3 setup.py sdist upload

Requirements

scrapy-feedstreaming requires the following modules.

  • Python 3.6+

Changelog

Scrapy live Streaming data. scrapy.extensions.feedexport.FeedExporter fork to export item during scraping. See [https://medium.com/@alex_ber/scrapy-streaming-data-cdf97434dc15]

All notable changes to this project will be documented in this file.

#https://pypi.org/manage/project/scrapy-feedstreaming/releases/

[Unrelased]

[0.0.1] - 12/07/2020

Added

  • Buffering was added to item_scraped().
  • S3FeedStorage: you can specify ACL as query part of URI.
  • S3FeedStorage: support of region is added.
  • FEEDS: slot_key_param: New (not available in Scrapy itself) specify (global) function which takes item and spider as parameter and slot_key. Given the item that is passed through the pipeline to what URI you want to send it. Fall back to noop method — method that does nothing.
  • FEEDS: buff_capacity: New (not available in Scrapy itself) — after what amount of item you want to export them. The fall back value is 1.
  • _FeedSlot instances are created from your settings. They are created per supplied URI. Some extra (compare to Scraping) information is stored, namely:
  • uri_template — it is available through public API get_slots() method, see below.
  • spider_name — is used in public API get_slots() method to restrict returned slots for requested spider.
  • buff_capacity —buffer’s capacity, if the number of item exceed this number the buffer is flushed
  • buff — buffer where all items pending export are stored.
  • FeedExported there is 1 extra public method
  • get_slots() — this method is used to get feed slot’s information (see implementation note above). It is populated from the settings. For example, you can retrieve to either URI you will export the items. Note:
  1. slot_key is slot identifier as described above. If you have only 1 URI you can supply None for this value.
  2. You can retrieve feed slot’s information only from your spider.
  3. It has optional force_create=True parameter. If you’re calling this method early in the Scrapy life-cycle feed slot’s information may be not yet created. In this case, the default behavior is to create this information and return it for you. If force_create=False is supplied you will receive an empty collection of feed slot’s information.
  • On S3FeedStorage there couple of public methods:
  • botocore_session
  • botocore_client
  • botocore_base_kwargs — dict of minimal parameters for botocore_client.put_object() method as supplied in settings.
  • botocore_kwargs — dict of all supplied parameters for botocore_client.put_object() method as supplied in settings. For example, if supplied, it will contain ACL parameter while botocore_base_kwargs will not contain it.

Changed

  • You can have multiple URI for exports.
  • Logic of sending the item was moved from the close_spider() to item_scraped().
  • back-port Fix missing storage.store() calls in FeedExporter.close_spider() [https://github.com/scrapy/scrapy/pull/4626]
  • back-port Fix duplicated feed logs [https://github.com/scrapy/scrapy/pull/4629]

Removed

  • removed deprecated: fallback to boto library if botocore is not found
  • removed deprecated: implicit retrieval of settings from the project — settings is passed explicitly now

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-feedstreaming-0.0.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

scrapy_feedstreaming-0.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-feedstreaming-0.0.1.tar.gz.

File metadata

  • Download URL: scrapy-feedstreaming-0.0.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for scrapy-feedstreaming-0.0.1.tar.gz
Algorithm Hash digest
SHA256 18fcd13db5c441f640f06ff36e2eaae1b1e74d24ac436f645ea9a1dd04a1576c
MD5 52bb8100fe7379b77677c40858cb4cdd
BLAKE2b-256 ba13d59a17821a748ff273e478478020a86253e1775b38609643ea3dde32afa9

See more details on using hashes here.

File details

Details for the file scrapy_feedstreaming-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_feedstreaming-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for scrapy_feedstreaming-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 89ee2ea3ac994843fdeb381b1fda60712aa5d16f763d478c94cc0035763252c3
MD5 eb319c3bac8111ceb83241d4f9d4db4b
BLAKE2b-256 ea92a8227b524fdf61ae3698de79c4e2a17ffb7dd93a5426680a76252239a2bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page