Skip to main content

Scraping library to retrieve data from useful pages, such as Amazon wishlists

Project description

Travis Build Test coverage PyPI - Latest version PyPI - Python Version

Scraping library to retrieve data from useful pages, such as Amazon wishlists

API

The API to use the library, scrape data and manage spiders is the following:

  • scrape(SPIDER_NAME, URL): scrapes the given URL using the spider referenced on SPIDER_NAME.

  • spiders(): list all spiders found by the library.

Custom Spiders

Using custom spiders is possible, as long as they:

  • They must be implemented as a class, and inherit from BaseSpider.

  • The spider file need to be either on scraper_factory/spiders, or in a custom location, as long as the environment variable $SPIDER_PATH is set to the directory where the spider is located.

Usage example

>>> import scraper_factory as SF
>>> SF.scrape('amazon-wishlist', 'https://www.amazon.com/hz/wishlist/ls/24XY9873RPAYN')
[{
    'id': 'I1MZVK8RDPYK8P',
    'title': 'AmazonBasics Heavy Weight Ruled Lined Index Cards, White, 3x5 Inch Card, 100-Count - AMZ63500',
    'byline': None,
    'price': None,
    'link': 'https://www.amazon.com/dp/B06XSRLP51/',
    'img': 'https://images-na.ssl-images-amazon.com/images/I/71i7LVTzpsL._SS135_.jpg'
}, {
    'id': 'I14TUJ6TADACU5',
    'title': "Women's Walking Shoes Sock Sneakers - Mesh Slip On Air Cushion Lady Girls Modern Jazz Dance Easy Shoes Platform Loafers",
    'byline': None,
    'price': None,
    'link': 'https://www.amazon.com/dp/B07MWCDJ9X/',
    'img': 'https://images-na.ssl-images-amazon.com/images/I/61sHA7o-bxL._SS135_.jpg'
}, {
    'id': 'I3C97JA2JR06PN',
    'title': 'Tenergy Redigrill\xa0Smoke-Less Infrared Grill, Indoor Grill, Heating\xa0Electric Tabletop Grill, Non-Stick Easy to Clean\xa0BBQ Grill, for Party/Home, ETL Certified',
    'byline': None,
    'price': '$179.99',
    'link': 'https://www.amazon.com/dp/B07BZ412HT/',
    'img': 'https://images-na.ssl-images-amazon.com/images/I/41uGvSPg-ML._SS135_.jpg'
}, {
    'id': 'I1C7RJI2H0VWZ7',
    'title': 'Shelf Liners for Wire Shelf Liner Set of 4 - Graphite (14-Inch-by-36-Inch)',
    'byline': None,
    'price': '$29.99',
    'link': 'https://www.amazon.com/dp/B01N9V4A9A/',
    'img': 'https://images-na.ssl-images-amazon.com/images/I/71Lg6J7sGHL._SS135_.jpg'
},
...]

Installation

Latest release through PyPI:

$ pip install scraper_factory

Development version:

$ git clone git@github.com:machinia/scraper-factory.git
$ cd scraper_factory
$ pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper-factory-0.2.1.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

scraper_factory-0.2.1-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file scraper-factory-0.2.1.tar.gz.

File metadata

  • Download URL: scraper-factory-0.2.1.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for scraper-factory-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ad3096f13ce48d9ddfbe095e41b62cb5763da1122236812525e6b9d418912a8a
MD5 9959031c96e250dbf0bb2e824128cae9
BLAKE2b-256 9f49a0977138f4a762cb253b222b90fbaf36dc2270456e4cd31422b9c20c1a20

See more details on using hashes here.

File details

Details for the file scraper_factory-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: scraper_factory-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for scraper_factory-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2de92b5a97c6f9b96a2f02384570b857902d0f9a0edc060eeaf874886cc0cbb2
MD5 97b08676b43d6d403b962516a7116dac
BLAKE2b-256 a53ba65e99a1bf8bc5b64929a2b424ab0247b27953c57c60193f8eb43619cab4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page