Skip to main content

Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-arweave provides scrapy pipelines and feed exports to store items into Arweave.

Project description

original

Welcome to Scrapy-Arweave

Version

Scrapy is a popular open-source and collaborative python framework for extracting the data you need from websites. scrapy-arweave provides scrapy pipelines and feed exports to store items into Arweave.

🏠 Homepage

Install

pip install scrapy-arweave

Examples

Usage

  1. Install scrapy-arweave and some additional requirements.
pip install scrapy-arweave

It has some requirements that must be installed as well:

Debian/Ubuntu

sudo apt-get install libmagic1

Windows

pip install python-magic-bin

OSX

  • When using Homebrew: brew install libmagic
  • When using macports: port install file
  1. Add 'scrapy-arweave.pipelines.ImagesPipeline' and/or 'scrapy-arweave.pipelines.FilesPipeline' to ITEM_PIPELINES setting in your Scrapy project if you need to store images or other files to Arweave. For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy_arweave.pipelines.ImagesPipeline': 1}

For Files Pipeline, use:

ITEM_PIPELINES = {'scrapy_arweave.pipelines.FilesPipeline': 1}

The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.

Or You can also use both the Files and Images Pipeline at the same time.

ITEM_PIPELINES = {
 'scrapy_arweave.pipelines.ImagesPipeline': 0,
 'scrapy_arweave.pipelines.FilesPipeline': 1
}

If you are using the ImagesPipeline make sure to install the pillow package. The Images Pipeline requires Pillow 7.1.0 or greater. It is used for thumbnailing and normalizing images to JPEG/RGB format.

pip install pillow

Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

Add store path of files or images for Web3Storage, LightHouse, Moralis, Pinata or Estuary as required.

# For ImagesPipeline
IMAGES_STORE = 'ar://images'

# For FilesPipeline
FILES_STORE = 'ar://files'

For more info regarding ImagesPipeline and FilesPipline. See here

  1. For Feed storage to store the output of scraping as json, csv, json, jsonlines, jsonl, jl, csv, xml, marshal, pickle etc set FEED_STORAGES as following for the desired output format:
from scrapy_arweave.feedexport import get_feed_storages
FEED_STORAGES = get_feed_storages()

Then set WALLET_JWK and GATEWAY_URL. And, set FEEDS as following to finally store the scraped data.

WALLET_JWK = "<WALLET_JWK>" # It can be wallet jwk file path or jwk data itself
GATEWAY_URL = "https://arweave.net"

FEEDS = {
   'ar://house.json': {
    "format": "json"
  },
}

See more on FEEDS here

  1. Now perform the scrapping as you would normally.

Author

👤 Pawan Paudel

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

Copyright © 2023 Pawan Paudel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_arweave-0.1.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

scrapy_arweave-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_arweave-0.1.0.tar.gz.

File metadata

  • Download URL: scrapy_arweave-0.1.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for scrapy_arweave-0.1.0.tar.gz
Algorithm Hash digest
SHA256 994856fe124d55723292c58a520d0b6e1f31682d38b748265e1565843f026066
MD5 5032e27228af88d349134e134a4a030c
BLAKE2b-256 abc453da97cd341f44b40814136c5855594d1fbfef3531a05b39cf559ca03150

See more details on using hashes here.

File details

Details for the file scrapy_arweave-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_arweave-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 693b2ff590c755984bd6f804fd10f4083565ad5d8dd3d2984e60dd749468c522
MD5 2c5953b4a039e3674e1a28263a343007
BLAKE2b-256 0a04b49c253853500a29251ed800708ec276ca6103a46623c806d8bd53f547f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page