Skip to main content

Scrapy spider for parsing XLS files.

Project description

XLS Scrapy Spider

Scrapy spider for parsing XLS files. It receives a XLS file in a response; iterates through each of its rows, and calls parse_row with a dict containing each field's data and the sheet name. Inspired in the builtin CSV feed scrapy spider.

This is built on top of openpyxl and pandas libs, you can set some options regarding the XLS file, such as the sheet names and number of rows to be skiped. If sheets are not provided, it'll returns all sheets available.

Installation

pip install scrapy_xls

How to use

from scrapy_xls import XLSSpider


def SuperMarketCatalogSpider(XLSSpider):

    start_urls = [
        'https://www.supermarket.com/files/catalog-oct-22.xlsx',
        'https://www.supermarket.com/files/catalog-set-22.xlsx',
    ]

    skip_rows = 3 # Assuming that the sheet headers starts at the 4th row
    sheets = ['FRUITS', 'BEVERAGES']

    def parse_row(self, response, sheet, row):
        item = {}
        item['department'] = sheet
        item['product'] = row['NAME']
        item['price'] = row['PRICE']
        item['brand'] = row['BRAND']
        yield item

Contributing

Feel free to contribute with the project. Git repository.

Buy Me A Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_xls-0.0.1.tar.gz (3.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_xls-0.0.1-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_xls-0.0.1.tar.gz.

File metadata

  • Download URL: scrapy_xls-0.0.1.tar.gz
  • Upload date:
  • Size: 3.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.5

File hashes

Hashes for scrapy_xls-0.0.1.tar.gz
Algorithm Hash digest
SHA256 dcda83262bf0ed2327e1ab86c2550e256968df1b1e6c8baeaee72908c2d19833
MD5 6ee27f56e50cbc9ae94ace7837a0be25
BLAKE2b-256 c9a48b9bab48400e1b6d4c62dc7e31b51a41748e3418459463c0c236c5886009

See more details on using hashes here.

File details

Details for the file scrapy_xls-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_xls-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.5

File hashes

Hashes for scrapy_xls-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a206654f843d336f5bb879606d67d6deb913fd0de3d1497102437c62723fa49
MD5 6626f87c995932ba187569345b87c429
BLAKE2b-256 32551a6afd069225d5036d3303eda5044e1597e34967393f9bdc27d04919cc22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page