Scrapy exporter for Big Data formats
Project description
Overview
scrapy-contrib-bigexporters provides additional exporters for the web crawling and scraping framework Scrapy (https://scrapy.org).
The following big data formats are supported:
Avro: https://avro.apache.org/
Iceberg: https://iceberg.apache.org/
Parquet: https://parquet.apache.org/
The library is published using pypi trusted publishers
Requirements
Python 3.12+
Scrapy 2.13+
Works on Linux, Windows, macOS, BSD
Parquet export requires pyarrow 22.00+ and pandas
Avro export requires fastavro 1.12+
ORC export requires pyarrow 22.00+ and pandas
Iceberg export requires pyiceberg 0.10+, pyarrow 22.00+ and pandas
Install
The quick way (pip):
pip install scrapy-contrib-bigexporters
Alternatively, you can install it from conda-forge:
conda install -c conda-forge scrapy-contrib-bigexporters
Depending on which format you want to use you need to install one or more of the following libraries.
Avro:
pip install fastavro
Avro is a file format.
Iceberg:
pip install pyiceberg pyarrow pandas
Iceberg is an open table format.
Note: Most likely you will need to add specific dependencies so that Iceberg works for you. See pyiceberg installation
ORC:
pip install pyarrow pandas
ORC is a file format.
Parquet:
pip install pyarrow pandas
Parquet is a file format.
Additional libraries may be needed for specific compression algorithms. The open table format may require additional libraries also to use different filesystems, catalogs and compression formats. See “Use”.
Use
Use of the library is simple. Install it with your Scrapy project as described above.You only need to configure the exporter in the Scrapy settings, run your scraper and the data will be exported into your desired format. There is no development needed.
See here for configuring the exporter in settings:
Source
The source is available at:
Codeberg (a non-commercial European hosted Git for Open Source): https://codeberg.org/ZuInnoTe/scrapy-contrib-bigexporters
Github (an US hosted commercial Git platform): https://github.com/ZuInnoTe/scrapy-contrib-bigexporters
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_contrib_bigexporters-1.1.0.tar.gz.
File metadata
- Download URL: scrapy_contrib_bigexporters-1.1.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bed46e48182979b8e505f0bd7be9b9f1235b4f72df5a959b57e608dfe25525b4
|
|
| MD5 |
3c7165b8b2836d621d98e09b4ce69ebc
|
|
| BLAKE2b-256 |
7ac0c9b18675a53c5cb508ff05437fa14bb391648810d9bd755f199db5190bff
|
Provenance
The following attestation bundles were made for scrapy_contrib_bigexporters-1.1.0.tar.gz:
Publisher:
publish_pypi.yml on ZuInnoTe/scrapy-contrib-bigexporters
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_contrib_bigexporters-1.1.0.tar.gz -
Subject digest:
bed46e48182979b8e505f0bd7be9b9f1235b4f72df5a959b57e608dfe25525b4 - Sigstore transparency entry: 1077060162
- Sigstore integration time:
-
Permalink:
ZuInnoTe/scrapy-contrib-bigexporters@df8bddfac5aa0e4ca611a2c84e99d32b132c4ab4 -
Branch / Tag:
refs/tags/scb-1.1.0 - Owner: https://github.com/ZuInnoTe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@df8bddfac5aa0e4ca611a2c84e99d32b132c4ab4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
304145f6bcca82630acfc7c1697fd8c5894b7668eaa004d0daf45f7ee33e2b37
|
|
| MD5 |
d03f38f0c3a17cdf02f6a1e011ab23b9
|
|
| BLAKE2b-256 |
e353cf279126fa20b372d5f4280f12af97622b7866b70edf33d15999b5556e8a
|
Provenance
The following attestation bundles were made for scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl:
Publisher:
publish_pypi.yml on ZuInnoTe/scrapy-contrib-bigexporters
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_contrib_bigexporters-1.1.0-py3-none-any.whl -
Subject digest:
304145f6bcca82630acfc7c1697fd8c5894b7668eaa004d0daf45f7ee33e2b37 - Sigstore transparency entry: 1077060222
- Sigstore integration time:
-
Permalink:
ZuInnoTe/scrapy-contrib-bigexporters@df8bddfac5aa0e4ca611a2c84e99d32b132c4ab4 -
Branch / Tag:
refs/tags/scb-1.1.0 - Owner: https://github.com/ZuInnoTe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@df8bddfac5aa0e4ca611a2c84e99d32b132c4ab4 -
Trigger Event:
push
-
Statement type: