Skip to main content

A web crawler to scrape documents from websites

Project description

Travis Total alerts Language grade: Python

Overview

The docscraper package is a scrapy spider for crawling a given set of websites and dowloading all available documents with a given set of file extensions. The package is intended to be called from a Python script.

Getting Started

You can get started by downloading the package with pip:

$ pip install docscraper

Once the package is installed, you can use it with scrapy directly in your Python script to download files from websites as follows:

>>> import docscraper
>>> allowed_domains = ["books.toscrape.com"]
>>> start_urls = ["https://books.toscrape.com"]
>>> extensions = [".html", ".pdf", ".docx", ".doc", ".svg"]
>>> docscraper.crawl(allowed_domains, start_urls, extensions=extensions)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docscraper-2.0.7.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docscraper-2.0.7-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file docscraper-2.0.7.tar.gz.

File metadata

  • Download URL: docscraper-2.0.7.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.8

File hashes

Hashes for docscraper-2.0.7.tar.gz
Algorithm Hash digest
SHA256 29a5362ffcc939366e2dd0380795ebd78f887be378421eabd5447d19f7d9cdf3
MD5 f8e96169b740230a6f6fd55d7390a5ca
BLAKE2b-256 bae4d0536d5a703316207221e815cc7382c91f5dee7c585d6c8236b1855ce566

See more details on using hashes here.

File details

Details for the file docscraper-2.0.7-py3-none-any.whl.

File metadata

  • Download URL: docscraper-2.0.7-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.8

File hashes

Hashes for docscraper-2.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 25cbe46842e75cc6576cbc305b418688a60f5bb5c79a221055c2b8f6d1293d1d
MD5 eff9fad3300deef827b6acf19f8deeb1
BLAKE2b-256 8af418140f3ec189a01bbd5337d71223a4976039104a3af070f8cde1073a3ee0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page