Skip to main content

Scrapy exporter for Big Data formats

Project description

Overview

scrapy-contrib-bigexporters provides additional exporters for the web crawling and scraping framework Scrapy (https://scrapy.org).

The following big data formats are supported:

The library is published using pypi trusted publishers

Requirements

  • Python 3.12+

  • Scrapy 2.13+

  • Works on Linux, Windows, macOS, BSD

  • Parquet export requires pyarrow 22.00+ and pandas

  • Avro export requires fastavro 1.12+

  • ORC export requires pyarrow 22.00+ and pandas

  • Iceberg export requires pyiceberg 0.10+, pyarrow 22.00+ and pandas

Install

The quick way (pip):

pip install scrapy-contrib-bigexporters

Alternatively, you can install it from conda-forge:

conda install -c conda-forge scrapy-contrib-bigexporters

Depending on which format you want to use you need to install one or more of the following libraries.

Avro:

pip install fastavro

Avro is a file format.

Iceberg:

pip install pyiceberg pyarrow pandas

Iceberg is an open table format.

Note: Most likely you will need to add specific dependencies so that Iceberg works for you. See pyiceberg installation

ORC:

pip install pyarrow pandas

ORC is a file format.

Parquet:

pip install pyarrow pandas

Parquet is a file format.

Additional libraries may be needed for specific compression algorithms. The open table format may require additional libraries also to use different filesystems, catalogs and compression formats. See “Use”.

Use

Use of the library is simple. Install it with your Scrapy project as described above.You only need to configure the exporter in the Scrapy settings, run your scraper and the data will be exported into your desired format. There is no development needed.

See here for configuring the exporter in settings:

Source

The source is available at:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_contrib_bigexporters-1.1.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_contrib_bigexporters-1.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_contrib_bigexporters-1.1.0.tar.gz
Algorithm Hash digest
SHA256 bed46e48182979b8e505f0bd7be9b9f1235b4f72df5a959b57e608dfe25525b4
MD5 3c7165b8b2836d621d98e09b4ce69ebc
BLAKE2b-256 7ac0c9b18675a53c5cb508ff05437fa14bb391648810d9bd755f199db5190bff

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_contrib_bigexporters-1.1.0.tar.gz:

Publisher: publish_pypi.yml on ZuInnoTe/scrapy-contrib-bigexporters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 304145f6bcca82630acfc7c1697fd8c5894b7668eaa004d0daf45f7ee33e2b37
MD5 d03f38f0c3a17cdf02f6a1e011ab23b9
BLAKE2b-256 e353cf279126fa20b372d5f4280f12af97622b7866b70edf33d15999b5556e8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl:

Publisher: publish_pypi.yml on ZuInnoTe/scrapy-contrib-bigexporters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page