Skip to main content

a web scraping tool

Project description

# scrape

##
a web scraping tool

## Installation
* `pip install scrape`

## Usage
usage: scrape.py [-h] [-c [CRAWL [CRAWL ...]]] [-ca]
[-f [FILTER [FILTER ...]]] [-l LIMIT] [-p] [-s] [-v] [-vb]
[urls [urls ...]]

a web scraping tool

positional arguments:
urls urls to scrape

optional arguments:
-h, --help show this help message and exit
-c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
keywords to crawl links by
-ca, --crawl-all crawl all links
-f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
filter lines of text by keywords
-l LIMIT, --limit LIMIT
set crawl page limit
-p, --pdf write to pdf instead of text
-r, --restrict restrict domain to that of the seed url
-v, --version display current version
-vb, --verbose print pdfkit log messages

## Author
* Hunter Hammond (huntrar@gmail.com)

## Notes
* --pdf can be used to save web pages as pdf's, they are saved to text by default.

* Text can be filtered by passing one or more regexps to --filter.

* To crawl subsequent pages, enter --crawl followed by one or more regexps or instead enter --crawl-all.

* To restrict the domain to the seed url's domain, use --strict, otherwise any domain may be followed.

* There is no limit to the number of pages to be crawled unless one is set with --limit, thus to cancel crawling and begin processing simply press Ctrl-C.



News
====

0.1.1
------
- uncommented import __version__

0.1.0
------

- reformatting to conform with PEP 8
- added regexp support for matching crawl keywords and filter text keywords
- improved url resolution by correcting domains and schemes
- added --restrict option to restrict crawler links to only those with seed domain
- made text the default write option rather than pdf, can now use --pdf to change that
- removed page number being written to text, separator is now just a single blank line
- improved construction of output file name

0.0.11
------

- fixed missing comma in install_requires in setup.py
- also labeled now as beta as there are still some kinks with crawling

0.0.10
------

- now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

- pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

- better implementation of crawler, can now scrape entire websites
- added OrderedSet class to utils.py

0.0.7
------

- changed --keywords to --filter and positional arg url to urls

0.0.6
------

- use --keywords flag for filtering text
- can pass multiple links now
- will not write empty files anymore

0.0.5
------

- added --verbose argument for use with pdfkit
- improved output file name processing

0.0.4
------

- accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

- Moved utils.py to scrape/

0.0.2
------

- First entry

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scrape-0.1.1-py2-none-any.whl (8.8 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page