Based on scrapy.extensions.feedexport.FeedExporter to live stream data
Project description
scrapy-feedstreaming
Scrapy live Streaming data. scrapy.extensions.feedexport.FeedExporter
fork to export item during scraping. See
[https://medium.com/@alex_ber/scrapy-streaming-data-cdf97434dc15]
See CHANGELOG.md for detail description.
Getting Help
QuickStart
python3 -m pip install -U scrapy-feedstreaming
Installing from Github
python3 -m pip install -U https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip
Optionally installing tests requirements.
python3 -m pip install -U https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip#egg=alex-ber-utils[tests]
Or explicitly:
wget https://github.com/alex-ber/scrapy-feedstreaming/archive/master.zip -O master.zip; unzip master.zip; rm master.zip
And then installing from source (see below).
Installing from source
python3 -m pip install -r req.txt # only installs "required" (relaxed)
python3 -m pip install . # only installs "required"
python3 -m pip install .[tests] # installs dependencies for tests
Alternatively you install install from requirements file:
python3 -m pip install -r requirements.txt # only installs "required"
python3 -m pip install -r requirements-tests.txt # installs dependencies for tests
From the directory with setup.py
python3 setup.py test #run all tests
or
pytest
Installing new version
See https://docs.python.org/3.1/distutils/uploading.html
python3 setup.py sdist upload
Requirements
scrapy-feedstreaming requires the following modules.
- Python 3.6+
Changelog
Scrapy live Streaming data. scrapy.extensions.feedexport.FeedExporter
fork to export item during scraping. See
[https://medium.com/@alex_ber/scrapy-streaming-data-cdf97434dc15]
All notable changes to this project will be documented in this file.
#https://pypi.org/manage/project/scrapy-feedstreaming/releases/
[Unrelased]
[0.0.1] - 12/07/2020
Added
- Buffering was added to
item_scraped()
. - S3FeedStorage: you can specify
ACL
as query part of URI. - S3FeedStorage: support of
region
is added. - FEEDS:
slot_key_param
: New (not available in Scrapy itself) specify (global) function which takes item and spider as parameter andslot_key
. Given the item that is passed through the pipeline to what URI you want to send it. Fall back to noop method — method that does nothing. - FEEDS:
buff_capacity
: New (not available in Scrapy itself) — after what amount of item you want to export them. The fall back value is 1. _FeedSlot
instances are created from your settings. They are created per supplied URI. Some extra (compare to Scraping) information is stored, namely:
uri_template
— it is available through public API get_slots() method, see below.spider_name
— is used in public API get_slots() method to restrict returned slots for requested spider.buff_capacity
—buffer’s capacity, if the number of item exceed this number the buffer is flushedbuff
— buffer where all items pending export are stored.
FeedExported
there is 1 extra public method
get_slots()
— this method is used to get feed slot’s information (see implementation note above). It is populated from the settings. For example, you can retrieve to either URI you will export the items. Note:
slot_key
is slot identifier as described above. If you have only 1 URI you can supply None for this value.- You can retrieve feed slot’s information only from your spider.
- It has optional
force_create=True
parameter. If you’re calling this method early in the Scrapy life-cycle feed slot’s information may be not yet created. In this case, the default behavior is to create this information and return it for you. Ifforce_create=False
is supplied you will receive an empty collection of feed slot’s information.
- On
S3FeedStorage
there couple of public methods:
botocore_session
botocore_client
botocore_base_kwargs
— dict of minimal parameters forbotocore_client.put_object()
method as supplied in settings.botocore_kwargs
— dict of all supplied parametersfor botocore_client.put_object()
method as supplied in settings. For example, if supplied, it will containACL
parameter whilebotocore_base_kwargs
will not contain it.
Changed
- You can have multiple URI for exports.
- Logic of sending the item was moved from the
close_spider()
toitem_scraped()
. - back-port Fix missing
storage.store()
calls inFeedExporter.close_spider()
[https://github.com/scrapy/scrapy/pull/4626] - back-port Fix duplicated feed logs [https://github.com/scrapy/scrapy/pull/4629]
Removed
- removed deprecated: fallback to
boto
library ifbotocore
is not found - removed deprecated: implicit retrieval of settings from the project — settings is passed explicitly now
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-feedstreaming-0.0.1.tar.gz
.
File metadata
- Download URL: scrapy-feedstreaming-0.0.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18fcd13db5c441f640f06ff36e2eaae1b1e74d24ac436f645ea9a1dd04a1576c |
|
MD5 | 52bb8100fe7379b77677c40858cb4cdd |
|
BLAKE2b-256 | ba13d59a17821a748ff273e478478020a86253e1775b38609643ea3dde32afa9 |
File details
Details for the file scrapy_feedstreaming-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_feedstreaming-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89ee2ea3ac994843fdeb381b1fda60712aa5d16f763d478c94cc0035763252c3 |
|
MD5 | eb319c3bac8111ceb83241d4f9d4db4b |
|
BLAKE2b-256 | ea92a8227b524fdf61ae3698de79c4e2a17ffb7dd93a5426680a76252239a2bc |