Skip to main content

a command-line web scraping tool

Project description

a command-line web scraping tool

scrape is a rule-based web crawler and information extraction tool capable of manipulating and merging new and existing documents. XML Path Language (XPath) and regular expressions are used to define rules for filtering content and web traversal. Output may be converted into text, csv, pdf, and/or HTML formats.

Installation

pip install scrape

or

pip install git+https://github.com/huntrar/scrape.git#egg=scrape

or

git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install

You must install wkhtmltopdf to save files to pdf.

Usage

usage: scrape.py [-h] [-a [ATTRIBUTES [ATTRIBUTES ...]]] [-all]
                 [-c [CRAWL [CRAWL ...]]] [-C] [--csv] [-cs [CACHE_SIZE]]
                 [-f [FILTER [FILTER ...]]] [--html] [-i] [-m]
                 [-max MAX_CRAWLS] [-n] [-ni] [-no] [-o [OUT [OUT ...]]] [-ow]
                 [-p] [-pt] [-q] [-s] [-t] [-v] [-x [XPATH]]
                 [QUERY [QUERY ...]]

a command-line web scraping tool

positional arguments:
  QUERY                 URLs/files to scrape

optional arguments:
  -h, --help            show this help message and exit
  -a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
                        extract text using tag attributes
  -all, --crawl-all     crawl all pages
  -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                        regexp rules for following new pages
  -C, --clear-cache     clear requests cache
  --csv                 write files as csv
  -cs [CACHE_SIZE], --cache-size [CACHE_SIZE]
                        size of page cache (default: 1000)
  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                        regexp rules for filtering text
  --html                write files as HTML
  -i, --images          save page images
  -m, --multiple        save to multiple files
  -max MAX_CRAWLS, --max-crawls MAX_CRAWLS
                        max number of pages to crawl
  -n, --nonstrict       allow crawler to visit any domain
  -ni, --no-images      do not save page images
  -no, --no-overwrite   do not overwrite files if they exist
  -o [OUT [OUT ...]], --out [OUT [OUT ...]]
                        specify outfile names
  -ow, --overwrite      overwrite a file if it exists
  -p, --pdf             write files as pdf
  -pt, --print          print text output
  -q, --quiet           suppress program output
  -s, --single          save to a single file
  -t, --text            write files as text
  -v, --version         display current version
  -x [XPATH], --xpath [XPATH]
                        filter HTML using XPath

Author

Notes

  • Input to scrape can be links, files, or a combination of the two, allowing you to create new files constructed from both existing and newly scraped content.

  • Multiple input files/URLs are saved to multiple output files/directories by default. To consolidate them, use the –single flag.

  • Images are automatically included when saving as pdf or HTML; this involves making additional HTTP requests, adding a significant amount of processing time. If you wish to forgo this feature use the –no-images flag, or set the environment variable SCRAPE_DISABLE_IMGS.

  • Requests cache is enabled by default to cache webpages, it can be disabled by setting the environment variable SCRAPE_DISABLE_CACHE.

  • Pages are saved temporarily as PART.html files during processing. Unless saving pages as HTML, these files are removed automatically upon conversion or exit.

  • To crawl pages with no restrictions use the –crawl-all flag, or filter which pages to crawl by URL keywords by passing one or more regexps to –crawl.

  • If you want the crawler to follow links outside of the given URLs domain, use –nonstrict.

  • Crawling can be stopped by Ctrl-C or alternatively by setting the number of pages or links to be crawled using –maxpages and –maxlinks. A page may contain zero or many links to more pages.

  • The text output of scraped files can be printed to stdout rather than saved by entering –print.

  • Filtering HTML can be done using –xpath, while filtering text is done by entering one or more regexps to –filter.

  • If you only want to specify specific tag attributes to extract rather than an entire XPath, use –attributes. The default choice is to extract only text attributes, but you can specify one or many different attributes (such as href, src, title, or any attribute available..).

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape-0.11.3.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

scrape-0.11.3-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file scrape-0.11.3.tar.gz.

File metadata

  • Download URL: scrape-0.11.3.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.2

File hashes

Hashes for scrape-0.11.3.tar.gz
Algorithm Hash digest
SHA256 d5b6f31fd677689d967d76c2582356cc827c9eb33c78507d8a84343b8fd31dc9
MD5 8d592337bc5bcf0e0ef6de497943ac7d
BLAKE2b-256 04342e68e3c8f4deba1ad2a855849b3d123c2c43e895a73fdbb420978beeca11

See more details on using hashes here.

File details

Details for the file scrape-0.11.3-py3-none-any.whl.

File metadata

  • Download URL: scrape-0.11.3-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.2

File hashes

Hashes for scrape-0.11.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cd69542758f1e72f77b9b0251fd90afa79f99381f1b564f378ee5346865c1f38
MD5 1a90b94f3cc3298bb9d883fd09489561
BLAKE2b-256 f927d4dbf2ef6afbbb6d84d85be1f1e2149018606475f724fa728a809c973fa6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page