Skip to main content

A web crawler to scrape documents from websites

Project description

Travis Total alerts Language grade: Python

Overview

The docscraper package is a scrapy spider for crawling a given set of websites and dowloading all available documents with a given set of file extensions. The package is intended to be called from a Python script.

Getting Started

You can get started by downloading the package with pip:

$ pip install docscraper

Once the package is installed, you can use it with scrapy directly in your Python script to download files from websites as follows:

>>> import docscraper
>>> allowed_domains = ["books.toscrape.com"]
>>> start_urls = ["https://books.toscrape.com"]
>>> extensions = [".html", ".pdf", ".docx", ".doc", ".svg"]
>>> docscraper.crawl(allowed_domains, start_urls, extensions=extensions)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docscraper-2.0.7.tar.gz (10.1 kB view hashes)

Uploaded Source

Built Distribution

docscraper-2.0.7-py3-none-any.whl (12.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page