Skip to main content

Scrapy pipeline to store items into BigQuery

Project description

scrapy-bigquery

A Big Query pipeline to store items into Google BigQuery.

Dependencies :globe_with_meridians:

Installation :inbox_tray:

This is a python package hosted on pypi, so to install simply run the following command:

pip install scrapy-bigquery

Settings

BIGQUERY_DATASET (Required)

The name of the bigquery dataset to post to.

BIGQUERY_TABLE (Required)

The name of the bigquery table in the dataset to post to.

BIGQUERY_SERVICE_ACCOUNT (Required)

The base64'd JSON of the Google Service Account used to authenticate with Google BigQuery. You can generate it from a service account like so:

cat service-account.json | jq . -c | base64

BIGQUERY_ADD_SCRAPED_TIME (Optional)

Whether to add the time the item was scraped to the item when posting it to BigQuery. This will add current datetime to the column scraped_time in the BigQuery table.

BIGQUERY_ADD_SCRAPER_NAME (Optional)

Whether to add the name of the scraper to the item when posting it to BigQuery. This will add the scrapers name to the column scraper in the BigQuery table.

Usage example :eyes:

In order to use this plugin simply add the following settings and substitute your variables:

BIGQUERY_DATASET = "my-dataset"
BIGQUERY_TABLE = "my-table"
BIGQUERY_SERVICE_ACCOUNT = "eyJ0eX=="
ITEM_PIPELINES = {
    "bigquerypipeline.pipelines.BigQueryPipeline": 301
}

The pipeline will attempt to create a dataset/table if none exist by inferring the type from the dictionaries it processes, however be aware that this can be flaky (especially if you have nulls in the dictionary), so it is recommended you create the table prior to running.

If you want to specify a table for a specific item, you can add the keys "BIGQUERY_DATASET" and "BIGQUERY_TABLE" to the item you pass back to the pipeline. This will override where the item is posted, allowing you to handle more than one item type in a scraper. The keys/values here will not be part of the final item in the table.

License :memo:

The project is available under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-bigquery-1.0.3.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file scrapy-bigquery-1.0.3.tar.gz.

File metadata

  • Download URL: scrapy-bigquery-1.0.3.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for scrapy-bigquery-1.0.3.tar.gz
Algorithm Hash digest
SHA256 3479fb616d9db1992daf6463f2fdfed9e1f71866fe40370382c98b12cd4c8fa6
MD5 7c9230798e7f19a1cb3b33c416df6ae9
BLAKE2b-256 84192ae996816c1fca62a35d4322042b8a32cacbe1376b9765ae8b60e7bf9f37

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page