Skip to main content

Provides an SqlAlchemy based cache storage backend, a Selenium middleware, and a few other utilities for working with Scrapy.

Project description

Scrachy

Scrachy was primarily developed to provide a flexible cache storage backend for Scrapy that stores its data in a relational database using SQLAlchemy. However, it now has several other additional features including middleware for using Selenium to download requests. It also comes with a downloader middleware that will optionally ignore requests that are already in the cache.

Install

You can install the latest version from git:

>pip install git+https://bitbucket.org/reidswanson/scrachy.git

or from PyPI:

>pip install scrachy

NOTE : libnss3 must be installed for chromedriver to work. For example: sudo apt install libnss3

NOTE: At least on Ubuntu, you must install chromium-chromedriver. The version downloaded by the webdriver manager fails with an error about the user data directory.

Documentation

A brief guide to minimally using the cache storage engine and the Selenium backend are given below. For other configuration options and features please see the full documentation on Read the Docs.

Storage Backend

To (minimally) use the storage backend you simply need to enable caching by adding the following to your settings.py file:

# Enable caching
HTTPCACHE_ENABLED = True

# Set the storage backend to the one provided by Scrachy.
HTTPCACHE_STORAGE = 'scrachy.middleware.httpcache.AlchemyCacheStorage'

# One of the supported SqlAlchemy dialects
SCRACHY_DB_DIALECT = '<database-dialect>'

# The name of the driver (that must be installed as an extra) and used.
SCRACHY_DB_DRIVER = '<database-driver>'

# Options for connecting to the database
SCRACHY_DB_HOST = '<database-hostname>'
SCRACHY_DB_PORT = '<database-port>'
SCRACHY_DB_SCHEMA = <database-schema>
SCRACHY_DB_DATABASE = '<database-name>'
SCRACHY_DB_USERNAME = '<username>'

# Note, do not store this value in the settings file. Use an environment
# variable or python-dotenv.
SCRACHY_DB_PASSWORD = '<password>'

# A dictionary of other connection arguments
SCRACHY_DB_CONNECT_ARGS = {}

# there may be a conflict with the compression middleware. If you encounter
# errors either disable it or move it after the caching middleware.
DOWNLOADER_MIDDLEWARES = {
   ...
   'scrapy.downloadermiddlewares.http.compression.HttpCompressionMiddleware': None,
}

Selenium

There are two Selenium middleware classes provided by Scrachy. To use them, first add one of them to the DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES = {
    ...
    'scrachy.middleware.selenium.SeleniumMiddleware': 800,  # or AsyncSeleniumMiddleware
    ...
}

Then in your spider parsing code use a SeleniumRequest instead of a scrapy.http.Request.

License

Scrachy is released using the GNU Lesser General Public License. See the LICENSE file for more details. Files that are adapted or use code from other sources are indicated either at the top of the file or at the location of the code snippet. Some of these files were adapted from code released under a 3-clause BSD license. Those files should indicate the original copyright in a comment at the top of the file. See the BSD_LICENSE file for details of this license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrachy-0.14.0.dev0.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrachy-0.14.0.dev0-py3-none-any.whl (74.3 kB view details)

Uploaded Python 3

File details

Details for the file scrachy-0.14.0.dev0.tar.gz.

File metadata

  • Download URL: scrachy-0.14.0.dev0.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.22.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scrachy-0.14.0.dev0.tar.gz
Algorithm Hash digest
SHA256 7b7b1c0f27cb781f424afc164b8cae9f9d8371ec7090bbe23be215241cc8e5e0
MD5 ae21f548a31613d127beb37096b1ea11
BLAKE2b-256 d2b286b688d5ef94e96bca88b713ddb36d8a26ed30b4f340e4edb34840277998

See more details on using hashes here.

File details

Details for the file scrachy-0.14.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: scrachy-0.14.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 74.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.22.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for scrachy-0.14.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 40414816b4520ff94ddaff99d7be3fe09517663581e5d33d17a9ee11c09babd0
MD5 3662a4ec303a8338f6caba7c82f6fa6d
BLAKE2b-256 1244ef9175c8ae9191cb81118c05b90e43fd820e20a35305a006a118860fb9d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page