A web crawler to scrape documents from websites
Project description
Overview
The docscraper
package is a scrapy
spider for crawling a given set of
websites and dowloading all available documents with a given set of file
extensions. The package is intended to be called from a Python script.
Getting Started
You can get started by downloading the package with pip
:
$ pip install docscraper
Once the package is installed, you can use it with scrapy directly in your Python script to download files from websites as follows:
>>> import docscraper
>>> allowed_domains = ["books.toscrape.com"]
>>> start_urls = ["https://books.toscrape.com"]
>>> extensions = [".html", ".pdf", ".docx", ".doc", ".svg"]
>>> docscraper.crawl(allowed_domains, start_urls, extensions=extensions)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
docscraper-2.0.7.tar.gz
(10.1 kB
view hashes)
Built Distribution
docscraper-2.0.7-py3-none-any.whl
(12.5 kB
view hashes)
Close
Hashes for docscraper-2.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25cbe46842e75cc6576cbc305b418688a60f5bb5c79a221055c2b8f6d1293d1d |
|
MD5 | eff9fad3300deef827b6acf19f8deeb1 |
|
BLAKE2b-256 | 8af418140f3ec189a01bbd5337d71223a4976039104a3af070f8cde1073a3ee0 |