Skip to main content

Local version of ScraperWiki libraries

Project description

This is a Python library for scraping web pages and saving data.

Warning: This library is now in maintenance mode.

The library has been updated to work with Python 3.14 but there are no guarantees on future maintenance.

Installing

pip install scraperwiki

Scraping

scraperwiki.scrape(url[, params][,user_agent])

Returns the downloaded string from the given url.

params are sent as a POST if set.

user_agent sets the user-agent string if provided.

Saving data

Helper functions for saving and querying an SQL database. Updates the schema automatically according to the data you save.

Currently only supports SQLite. It will make a local SQLite database. It is based on SQLAlchemy.

scraperwiki.sql.save(unique_keys, data[, table_name=”swdata”])

Saves a data record into the datastore into the table given by table_name.

data is a dict object with field names as keys; unique_keys is a subset of data.keys() which determines when a record is overwritten. For large numbers of records data can be a list of dicts.

scraperwiki.sql.save is entitled to buffer an arbitrary number of rows until the next read via the ScraperWiki API, an exception is hit, or until process exit. An effort is made to do a timely periodic flush. Records can be lost if the process experiences a hard-crash, power outage or SIGKILL due to high memory usage during an out-of-memory condition. The buffer can be manually flushed with scraperwiki.sql.flush().

scraperwiki.sql.execute(sql[, vars])

Executes any arbitrary SQL command. For example CREATE, DELETE, INSERT or DROP.

vars is an optional list of parameters, inserted when the SQL command contains ‘?’s. For example:

scraperwiki.sql.execute("INSERT INTO swdata VALUES (?,?,?)", [a,b,c])

The ‘?’ convention is like “paramstyle qmark” from Python’s DB API 2.0 (but note that the API to the datastore is nothing like Python’s DB API). In particular the ‘?’ does not itself need quoting, and can in general only be used where a literal would appear. (Note that you cannot substitute in, for example, table or column names.)

scraperwiki.sql.select(sqlfrag[, vars])

Executes a select command on the datastore. For example:

scraperwiki.sql.select("* FROM swdata LIMIT 10")

Returns a list of dicts that have been selected.

vars is an optional list of parameters, inserted when the select command contains ‘?’s. This is like the feature in the .execute command, above.

scraperwiki.sql.commit()

Functionality now removed. It is retained, but does nothing. (sql.save auto-commits after every action).

scraperwiki.sql.show_tables([dbname])

Returns an array of tables and their schemas in the current database.

scraperwiki.sql.save_var(key, value)

Saves an arbitrary single-value into a table called swvariables. Intended to store scraper state so that a scraper can continue after an interruption.

scraperwiki.sql.get_var(key[, default])

Retrieves a single value that was saved by save_var. Only works for string, float, or int types. For anything else, use the pickle library to turn it into a string.

Miscellaneous

scraperwiki.status(type, message=None)

Functionality now removed since it was only for the now defunct ScraperWiki platform. It is retained but always returns None.

scraperwiki.pdftoxml(pdfdata)

Convert a byte string containing a PDF file into an XML file containing the coordinates and font of each text string (see the pdftohtml documentation for details). This requires pdftohtml which is part of poppler-utils.

Environment Variables

SCRAPERWIKI_DATABASE_NAME

default: scraperwiki.sqlite - name of database

SCRAPERWIKI_DATABASE_TIMEOUT

default: 300 - number of seconds database will wait for a lock

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraperwiki-1.0.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraperwiki-1.0.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file scraperwiki-1.0.0.tar.gz.

File metadata

  • Download URL: scraperwiki-1.0.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraperwiki-1.0.0.tar.gz
Algorithm Hash digest
SHA256 32c7fbf8ccb5d039132eb19d79f9f896da185585c184b1e958ff448aa8b076d8
MD5 7815878ff28eeae74ed41042f811161d
BLAKE2b-256 a861fa5a8335d771cc35ac0e20368ee20dcf2255d0a165984ba932ed670b7461

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraperwiki-1.0.0.tar.gz:

Publisher: ci-build.yml on cantabular/scraperwiki-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraperwiki-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: scraperwiki-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraperwiki-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d6839d29e1b931b1eeb8ec21fc35d00234b3830805a8116ede49158ab3f2052
MD5 83dc6285718bcfc0b15b8d35a835e0c4
BLAKE2b-256 e4d8da2e63891263f82c547f77b38de5d42e26778cf44b8b8fcb7ad55f0b0418

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraperwiki-1.0.0-py3-none-any.whl:

Publisher: ci-build.yml on cantabular/scraperwiki-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page