a command-line web scraping, crawling, and conversion tool
Project description
# scrape
## a command-line web scraping and crawling tool
[![Build Status](https://travis-ci.org/huntrar/scrape.svg?branch=master)](https://travis-ci.org/huntrar/scrape)
scrape is a command-line tool used to quickly extract and filter webpages in a grep-like manner. It allows saving in the form of text, pdf, or HTML. Users may provide their own HTML files to convert or filter. A crawling mechanism allows scrape to traverse websites by regex keywords or can also be run freely. scrape can extract data from any DOM tags, an example being entering 'href' for all links or 'text' for all plaintext.
## Installation
pip install scrape
or
pip install git+https://github.com/huntrar/scrape.git#egg=scrape
or
git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install
You must [install wkhtmltopdf](https://github.com/pdfkit/pdfkit/wiki/Installing-WKHTMLTOPDF) to save files to pdf.
## Usage
usage: scrape.py [-h] [-r [READ [READ ...]]]
[-a [ATTRIBUTES [ATTRIBUTES ...]]] [-c [CRAWL [CRAWL ...]]]
[-ca] [-f [FILTER [FILTER ...]]] [-ht] [-l LIMIT] [-n] [-p]
[-q] [-t] [-v]
[urls [urls ...]]
a command-line web scraping, crawling, and conversion tool
positional arguments:
urls url(s) to scrape
optional arguments:
-h, --help show this help message and exit
-r [READ [READ ...]], --read [READ [READ ...]]
read in local html file(s)
-a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
tag attribute(s) for extracting lines of text, default
is text
-c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
regexp(s) to match links to crawl
-ca, --crawl-all crawl all links
-f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
regexp(s) to filter lines of text
-ht, --html save output as html
-l LIMIT, --limit LIMIT
set page crawling limit
-n, --nonstrict set crawler to visit other websites
-p, --pdf save output as pdf
-q, --quiet suppress output
-t, --text save output as text, default
-v, --version display current version
## Author
* Hunter Hammond (huntrar@gmail.com)
## Notes
* Supports both Python 2.x and Python 3.x.
* Pages are converted to text by default, you can specify --html or --pdf to save to a different format.
* Use the --read flag to read in local HTML files and extract, filter, or convert their contents.
* Filtering text is done by entering one or more regexps to --filter.
* You may specify specific tag attributes to extract from the page using --attributes. The default choice is to extract only text attributes, but you can specify one or many different attributes (such as href, src, title, or any attribute available..).
* Pages are saved temporarily as PART(%d).html files during processing and are removed automatically upon format conversion or unexpected exit.
* Entire websites can be downloaded by using the --crawl-all flag or by passing one or more regexps to --crawl, which filters a list of URL's.
* If you want the crawler to follow links outside of the given URL's domain, use --nonstrict.
* Crawling can be stopped by Ctrl-C or by setting the number of pages to be crawled using --limit.
News
====
0.3.3
------
- added file conversion to program description
0.3.2
------
- added travis-ci build status to readme
0.3.1
------
- updated program description and added extra installation instructions
- added .travis.yml and requirements.txt
0.3.0
------
- added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option
- added html/ directory containing test html files
- made relative imports explicit using absolute_import
- added proxies to utils.py
0.2.10
------
- moved OrderedSet class to orderedset.py rather than utils.py
0.2.9
------
- updated program description and keywords in setup.py
0.2.8
------
- restricts crawling to seed domain by default, changed --strict to --nonstrict for crawling outside given website
0.2.5
------
- added requests to install_requires in setup.py
0.2.4
------
- added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.
0.2.3
------
- updated flags and flag help messages
- verbose now by default and reduced number of messages, use --quiet to silence messages
- changed name of --files flag to --html for saving output as html
- added --text flag, default is still text
0.2.2
------
- fixed character encoding issue, all unicode now
0.2.1
------
- improvements to exception handling for proper PART file removal
0.2.0
------
- pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory
- added a page cache with a limit of 10 for greater duplicate protection
- added --files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url's domain
- changed --restrict flag to --strict for restricting the domain to the seed domain while crawling
- more --verbose messages being printed
0.1.10
------
- now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py
- renamed write_pages to write_links
0.1.9
------
- added behavior for --crawl keywords in crawl method
- added a domain check before outputting crawled message or adding to crawled links
- domain key in args is now set to base domain for proper --restrict behavior
- clean_url now rstrips / character for proper link crawling
- resolve_url now rstrips / character for proper out_file writing
- updated description of --crawl flag
0.1.8
------
- removed url fragments
- replaced set_base with urlparse method urljoin
- out_file name construction now uses urlparse 'path' member
- raw_links is now an OrderedSet to try to eliminate as much processing as possible
- added clear method to OrderedSet in utils.py
0.1.7
------
- removed validate_domain and replaced it with a lambda instead
- replaced domain with base_url in set_base as should have been done before
- crawled message no longer prints if url was a duplicate
0.1.6
------
- uncommented import __version__
0.1.5
------
- set_domain was replaced by set_base, proper solution for links that are relative
- fixed verbose behavior
- updated description in README
0.1.4
------
- fixed output file generation, was using domain instead of base_url
- minor code cleanup
0.1.3
------
- blank lines are no longer written to text unless as a page separator
- style tags now ignored alongside script tags when getting text
0.1.2
------
- added shebang
0.1.1
------
- uncommented import __version__
0.1.0
------
- reformatting to conform with PEP 8
- added regexp support for matching crawl keywords and filter text keywords
- improved url resolution by correcting domains and schemes
- added --restrict option to restrict crawler links to only those with seed domain
- made text the default write option rather than pdf, can now use --pdf to change that
- removed page number being written to text, separator is now just a single blank line
- improved construction of output file name
0.0.11
------
- fixed missing comma in install_requires in setup.py
- also labeled now as beta as there are still some kinks with crawling
0.0.10
------
- now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error
0.0.9
------
- pdfkit now ignores load errors and writes as many pages as possible
0.0.8
------
- better implementation of crawler, can now scrape entire websites
- added OrderedSet class to utils.py
0.0.7
------
- changed --keywords to --filter and positional arg url to urls
0.0.6
------
- use --keywords flag for filtering text
- can pass multiple links now
- will not write empty files anymore
0.0.5
------
- added --verbose argument for use with pdfkit
- improved output file name processing
0.0.4
------
- accepts 0 or 1 url's, allowing a call with just --version
0.0.3
------
- Moved utils.py to scrape/
0.0.2
------
- First entry
## a command-line web scraping and crawling tool
[![Build Status](https://travis-ci.org/huntrar/scrape.svg?branch=master)](https://travis-ci.org/huntrar/scrape)
scrape is a command-line tool used to quickly extract and filter webpages in a grep-like manner. It allows saving in the form of text, pdf, or HTML. Users may provide their own HTML files to convert or filter. A crawling mechanism allows scrape to traverse websites by regex keywords or can also be run freely. scrape can extract data from any DOM tags, an example being entering 'href' for all links or 'text' for all plaintext.
## Installation
pip install scrape
or
pip install git+https://github.com/huntrar/scrape.git#egg=scrape
or
git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install
You must [install wkhtmltopdf](https://github.com/pdfkit/pdfkit/wiki/Installing-WKHTMLTOPDF) to save files to pdf.
## Usage
usage: scrape.py [-h] [-r [READ [READ ...]]]
[-a [ATTRIBUTES [ATTRIBUTES ...]]] [-c [CRAWL [CRAWL ...]]]
[-ca] [-f [FILTER [FILTER ...]]] [-ht] [-l LIMIT] [-n] [-p]
[-q] [-t] [-v]
[urls [urls ...]]
a command-line web scraping, crawling, and conversion tool
positional arguments:
urls url(s) to scrape
optional arguments:
-h, --help show this help message and exit
-r [READ [READ ...]], --read [READ [READ ...]]
read in local html file(s)
-a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
tag attribute(s) for extracting lines of text, default
is text
-c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
regexp(s) to match links to crawl
-ca, --crawl-all crawl all links
-f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
regexp(s) to filter lines of text
-ht, --html save output as html
-l LIMIT, --limit LIMIT
set page crawling limit
-n, --nonstrict set crawler to visit other websites
-p, --pdf save output as pdf
-q, --quiet suppress output
-t, --text save output as text, default
-v, --version display current version
## Author
* Hunter Hammond (huntrar@gmail.com)
## Notes
* Supports both Python 2.x and Python 3.x.
* Pages are converted to text by default, you can specify --html or --pdf to save to a different format.
* Use the --read flag to read in local HTML files and extract, filter, or convert their contents.
* Filtering text is done by entering one or more regexps to --filter.
* You may specify specific tag attributes to extract from the page using --attributes. The default choice is to extract only text attributes, but you can specify one or many different attributes (such as href, src, title, or any attribute available..).
* Pages are saved temporarily as PART(%d).html files during processing and are removed automatically upon format conversion or unexpected exit.
* Entire websites can be downloaded by using the --crawl-all flag or by passing one or more regexps to --crawl, which filters a list of URL's.
* If you want the crawler to follow links outside of the given URL's domain, use --nonstrict.
* Crawling can be stopped by Ctrl-C or by setting the number of pages to be crawled using --limit.
News
====
0.3.3
------
- added file conversion to program description
0.3.2
------
- added travis-ci build status to readme
0.3.1
------
- updated program description and added extra installation instructions
- added .travis.yml and requirements.txt
0.3.0
------
- added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option
- added html/ directory containing test html files
- made relative imports explicit using absolute_import
- added proxies to utils.py
0.2.10
------
- moved OrderedSet class to orderedset.py rather than utils.py
0.2.9
------
- updated program description and keywords in setup.py
0.2.8
------
- restricts crawling to seed domain by default, changed --strict to --nonstrict for crawling outside given website
0.2.5
------
- added requests to install_requires in setup.py
0.2.4
------
- added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.
0.2.3
------
- updated flags and flag help messages
- verbose now by default and reduced number of messages, use --quiet to silence messages
- changed name of --files flag to --html for saving output as html
- added --text flag, default is still text
0.2.2
------
- fixed character encoding issue, all unicode now
0.2.1
------
- improvements to exception handling for proper PART file removal
0.2.0
------
- pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory
- added a page cache with a limit of 10 for greater duplicate protection
- added --files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url's domain
- changed --restrict flag to --strict for restricting the domain to the seed domain while crawling
- more --verbose messages being printed
0.1.10
------
- now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py
- renamed write_pages to write_links
0.1.9
------
- added behavior for --crawl keywords in crawl method
- added a domain check before outputting crawled message or adding to crawled links
- domain key in args is now set to base domain for proper --restrict behavior
- clean_url now rstrips / character for proper link crawling
- resolve_url now rstrips / character for proper out_file writing
- updated description of --crawl flag
0.1.8
------
- removed url fragments
- replaced set_base with urlparse method urljoin
- out_file name construction now uses urlparse 'path' member
- raw_links is now an OrderedSet to try to eliminate as much processing as possible
- added clear method to OrderedSet in utils.py
0.1.7
------
- removed validate_domain and replaced it with a lambda instead
- replaced domain with base_url in set_base as should have been done before
- crawled message no longer prints if url was a duplicate
0.1.6
------
- uncommented import __version__
0.1.5
------
- set_domain was replaced by set_base, proper solution for links that are relative
- fixed verbose behavior
- updated description in README
0.1.4
------
- fixed output file generation, was using domain instead of base_url
- minor code cleanup
0.1.3
------
- blank lines are no longer written to text unless as a page separator
- style tags now ignored alongside script tags when getting text
0.1.2
------
- added shebang
0.1.1
------
- uncommented import __version__
0.1.0
------
- reformatting to conform with PEP 8
- added regexp support for matching crawl keywords and filter text keywords
- improved url resolution by correcting domains and schemes
- added --restrict option to restrict crawler links to only those with seed domain
- made text the default write option rather than pdf, can now use --pdf to change that
- removed page number being written to text, separator is now just a single blank line
- improved construction of output file name
0.0.11
------
- fixed missing comma in install_requires in setup.py
- also labeled now as beta as there are still some kinks with crawling
0.0.10
------
- now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error
0.0.9
------
- pdfkit now ignores load errors and writes as many pages as possible
0.0.8
------
- better implementation of crawler, can now scrape entire websites
- added OrderedSet class to utils.py
0.0.7
------
- changed --keywords to --filter and positional arg url to urls
0.0.6
------
- use --keywords flag for filtering text
- can pass multiple links now
- will not write empty files anymore
0.0.5
------
- added --verbose argument for use with pdfkit
- improved output file name processing
0.0.4
------
- accepts 0 or 1 url's, allowing a call with just --version
0.0.3
------
- Moved utils.py to scrape/
0.0.2
------
- First entry
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
scrape-0.3.3-py2-none-any.whl
(15.3 kB
view hashes)