scrape

a command-line web scraping tool

These details have not been verified by PyPI

Project links

Homepage

Project description

scrape

a command-line web scraping tool

scrape is a rule-based web crawler and information extraction tool capable of manipulating and merging new and existing documents. XML Path Language (XPath) and regular expressions are used to define rules for filtering content and web traversal. Output may be converted into text, csv, pdf, and/or HTML formats.

Installation

pip install scrape

pip install git+https://github.com/huntrar/scrape.git#egg=scrape

git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install

You must install wkhtmltopdf to save files to pdf.

Usage

usage: scrape.py [-h] [-a [ATTRIBUTES [ATTRIBUTES ...]]] [-all]
                 [-c [CRAWL [CRAWL ...]]] [-C] [--csv] [-cs [CACHE_SIZE]]
                 [-f [FILTER [FILTER ...]]] [--html] [-i] [-m]
                 [-max MAX_CRAWLS] [-n] [-ni] [-no] [-o [OUT [OUT ...]]] [-ow]
                 [-p] [-pt] [-q] [-s] [-t] [-v] [-x [XPATH]]
                 [QUERY [QUERY ...]]

a command-line web scraping tool

positional arguments:
  QUERY                 URLs/files to scrape

optional arguments:
  -h, --help            show this help message and exit
  -a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
                        extract text using tag attributes
  -all, --crawl-all     crawl all pages
  -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                        regexp rules for following new pages
  -C, --clear-cache     clear requests cache
  --csv                 write files as csv
  -cs [CACHE_SIZE], --cache-size [CACHE_SIZE]
                        size of page cache (default: 1000)
  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                        regexp rules for filtering text
  --html                write files as HTML
  -i, --images          save page images
  -m, --multiple        save to multiple files
  -max MAX_CRAWLS, --max-crawls MAX_CRAWLS
                        max number of pages to crawl
  -n, --nonstrict       allow crawler to visit any domain
  -ni, --no-images      do not save page images
  -no, --no-overwrite   do not overwrite files if they exist
  -o [OUT [OUT ...]], --out [OUT [OUT ...]]
                        specify outfile names
  -ow, --overwrite      overwrite a file if it exists
  -p, --pdf             write files as pdf
  -pt, --print          print text output
  -q, --quiet           suppress program output
  -s, --single          save to a single file
  -t, --text            write files as text
  -v, --version         display current version
  -x [XPATH], --xpath [XPATH]
                        filter HTML using XPath

Author

Hunter Hammond (huntrar@gmail.com)

Notes

Supports both Python 2.x and Python 3.x.
Input to scrape can be links, files, or a combination of the two, allowing you to create new files constructed from both existing and newly scraped content.
Multiple input files/URLs are saved to multiple output files/directories by default. To consolidate them, use the –single flag.
Images are automatically included when saving as pdf or HTML; this involves making additional HTTP requests, adding a significant amount of processing time. If you wish to forgo this feature use the –no-images flag, or set the environment variable SCRAPE_DISABLE_IMGS.
Requests cache is enabled by default to cache webpages, it can be disabled by setting the environment variable SCRAPE_DISABLE_CACHE.
Pages are saved temporarily as PART.html files during processing. Unless saving pages as HTML, these files are removed automatically upon conversion or exit.
To crawl pages with no restrictions use the –crawl-all flag, or filter which pages to crawl by URL keywords by passing one or more regexps to –crawl.
If you want the crawler to follow links outside of the given URLs domain, use –nonstrict.
Crawling can be stopped by Ctrl-C or alternatively by setting the number of pages or links to be crawled using –maxpages and –maxlinks. A page may contain zero or many links to more pages.
The text output of scraped files can be printed to stdout rather than saved by entering –print.
Filtering HTML can be done using –xpath, while filtering text is done by entering one or more regexps to –filter.
If you only want to specify specific tag attributes to extract rather than an entire XPath, use –attributes. The default choice is to extract only text attributes, but you can specify one or many different attributes (such as href, src, title, or any attribute available..).

News

0.9.14

added versions 3.6 and 3.7 to travis CI build, removed 2.6 and 3.3

2.6 and 3.3 deprecated by lxml

0.9.13

3.7 added as supported version in setup

Updated LICENSE and requirements.txt

0.9.12

3.6 added as supported version in setup

Updated LICENSE

0.9.11

Bugfix: MissingSchema during requests get

Bugfix: Check for Python 2 should have been for Python 3

0.9.10

More refactoring

0.9.9

Converted markdown README to rst

0.9.8

Changed Utility classifier to Utilities

0.9.7

Replaced compat.py with six module

Made imports relative rather than from PATH

More refactoring

0.9.6

Bugfix: Remove non-links through filtering by protocol

Refactorings

0.9.5

Bugfix: Properly join internal and base URLs for crawling

0.9.4

Retired support for 3.2 as tldextract doesn’t support it

0.9.3

Moved crawling functions into a Crawler class

General refactorings to docstrings, function names, etc.

Consolidated max_pages and max_links arguments as max_crawls

Added tldextract module for getting URL domain, suffixes

0.9.2

Added compat.py file

Moved compatible builtin definitions to __init__

Added requests cache

0.9.1

Updated version in requirements and setup keywords

Removed –use-mirrors for 3.5 support

0.9.0

Bugfix: Fixed comparison of duplicate URLs when crawling

0.8.11

Bugfix: Improper check of domain when being restrictive

0.8.10

Strip ‘/’ from end of urls when crawling

0.8.9

Added argument for cache link size & fixed up others

0.8.8

Updated README and setup

0.8.7

added CSV as a format

0.8.6

added environ variable SCRAPE_DISABLE_IMGS to not save images

0.8.5

warn user that saving images during crawling is slow

0.8.4

moved print_text() from crawl.py back to scrape.py

0.8.3

fixed bad formatting in readme usage

0.8.2

ignore-load-errors removed from wkhtmltopdf executable

0.8.1

removed extra schema adding

0.8.0

fixed bug where added url schema not reflected in query

0.7.9

moved file crawling to new file

avoid overwrite prompt in tests

0.7.8

updated program description

removed overwriting test due to issues with it

0.7.7

no longer defaults to overwriting files, added program flags/a prompt

adding renaming mechanism if choosing to not overwrite a file

some function reorganizing

0.7.6

added print text to stdout option

removed extra newline appended in re_filter

wrapped pdfkit import in try/except as it isnt essential

0.7.5

removed extra urlparse import

0.7.4

added option to not save images

images are now only saved if saving to HTML or PDF

checks if outfilename has extension before adding new one

fixed domains being sometimes mismatched to urls

fixed extension being unnecessary appended to urls (for the most part)

0.7.3

development status reverted to beta

0.7.2

now saves images with PART.html files (but not css yet)

added module level docstrings

0.7.1

added EOFError handling

0.7.0

fixed crawl not returning filenames to add to infilenames

fixed re_filter adding duplicate matches

fixed domain unboundlocalerror

0.6.9

fixed bug where query not found in urls due to trailing /

0.6.8

updated program usage

0.6.7

fixed bounds check on out file names

0.6.6

added out file names as a program argument

fixed bug where re-writing multiple files

fixed bug where writing only the first file when writing single file

0.6.5

major improvement to remove_whitespace()

0.6.4

more docstring improvements

0.6.3

began process of making docstrings conform to pep257

increased size of link cache from 10 to 100

remove the newline at start of text files

add newlines between lines filtered by regex

remove_whitespace now removes newlines that are 3 in a row or more

0.6.2

stylistic changes

files are now read in 1K chunks

0.6.1

remove consecutive whitespace before writing text files

empty text files no longer written

0.6.0

fixed bug where single out file name wasn’t properly constructed

out file names are all returned as lowercase now

0.5.9

fixed bug where text wouldn’t write unless xpath specified

0.5.8

can now parse HTML using XPath and save to all formats

remove carriage returns in scraped text files

0.5.7

added maximum out file name length of 24 characters

0.5.6

fixed urls not being properly added under file_types

0.5.5

fixed UnboundLocalError in write_single_file

0.5.4

fixed redefinition of out_file_name in write_to_text

0.5.3

fixed IndexError in write_to_text

0.5.2

small fix for finding single out file name

0.5.1

remade method to find single out file name

0.5.0

can now save to single or multiple output files/directories

added tests for writing to single or multiple files

preserves original lines/newlines when parsing/writing files

0.4.11

changed generator.next() to next(generator) for python 3 compatibility

0.4.10

forgot to remove all occurrences of xrange

0.4.9

changed unicode decode to ascii decode when writing html to disk

0.4.8

added missing python 3 compatibilities

0.4.7

fixed urlparse importerror in utils.py for python 3 users

0.4.6

fixed html => text

all conversions fixed, test_scrape.py added to keep it this way

added pdfkit to requirements.txt

0.4.5

added docstrings to all functions

fixed IOError when trying to convert local html to html

fixed IOError when trying to convert local html to pdf

fixed saving scraped files to text, was saving PART filenames instead

0.4.4

prompts for filetype from user if none entered

modularized a couple functions

0.4.3

fixed out_file naming

pep8 and pylint reformatting

0.4.2

removed read_part_files in place of get_part_files as pdfkit reads filenames

0.4.1

fixed bug preventing writing scraped urls to pdf

0.4.0

can now read in text and filter it

recognizes local files, no need for user to enter special flag

moved html/ files to testing/ and added a text file to it

added better distinction between input and output files

changed instances of file to f_name in utils

pep8 reformatting

0.3.9

add scheme to urls if none present

fixed bug where raw_html was calling get_html rather than get_raw_html

0.3.8

made distinction between links and pages with multiple links on them

use –maxpages to set the maximum number of pages to get links from

use –maxlinks to set the maximum number of links to parse

improved the argument help messages

improved notes/description in README

0.3.7

fixes to page caching and writing PART files

use –local to read in local html files

use –max to indicate max number of pages to crawl

changed program description and keywords

0.3.6

cleanup using pylint as reference

0.3.5

updated long program description in readme
added pypi monthly downloads image in readme

0.3.4

updated description header in readme

0.3.3

added file conversion to program description

0.3.2

added travis-ci build status to readme

0.3.1

updated program description and added extra installation instructions

added .travis.yml and requirements.txt

0.3.0

added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option

added html/ directory containing test html files

made relative imports explicit using absolute_import

added proxies to utils.py

0.2.10

moved OrderedSet class to orderedset.py rather than utils.py

0.2.9

updated program description and keywords in setup.py

0.2.8

restricts crawling to seed domain by default, changed –strict to –nonstrict for crawling outside given website

0.2.5

added requests to install_requires in setup.py

0.2.4

added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.

0.2.3

updated flags and flag help messages

verbose now by default and reduced number of messages, use –quiet to silence messages

changed name of –files flag to –html for saving output as html

added –text flag, default is still text

0.2.2

fixed character encoding issue, all unicode now

0.2.1

improvements to exception handling for proper PART file removal

0.2.0

pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory

added a page cache with a limit of 10 for greater duplicate protection

added –files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url’s domain

changed –restrict flag to –strict for restricting the domain to the seed domain while crawling

more –verbose messages being printed

0.1.10

now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py

renamed write_pages to write_links

0.1.9

added behavior for –crawl keywords in crawl method

added a domain check before outputting crawled message or adding to crawled links

domain key in args is now set to base domain for proper –restrict behavior

clean_url now rstrips / character for proper link crawling

resolve_url now rstrips / character for proper out_file writing

updated description of –crawl flag

0.1.8

removed url fragments

replaced set_base with urlparse method urljoin

out_file name construction now uses urlparse ‘path’ member

raw_links is now an OrderedSet to try to eliminate as much processing as possible

added clear method to OrderedSet in utils.py

0.1.7

removed validate_domain and replaced it with a lambda instead

replaced domain with base_url in set_base as should have been done before

crawled message no longer prints if url was a duplicate

0.1.6

uncommented import __version__

0.1.5

set_domain was replaced by set_base, proper solution for links that are relative

fixed verbose behavior

updated description in README

0.1.4

fixed output file generation, was using domain instead of base_url

minor code cleanup

0.1.3

blank lines are no longer written to text unless as a page separator

style tags now ignored alongside script tags when getting text

0.1.2

added shebang

0.1.1

uncommented import __version__

0.1.0

reformatting to conform with PEP 8

added regexp support for matching crawl keywords and filter text keywords

improved url resolution by correcting domains and schemes

added –restrict option to restrict crawler links to only those with seed domain

made text the default write option rather than pdf, can now use –pdf to change that

removed page number being written to text, separator is now just a single blank line

improved construction of output file name

0.0.11

fixed missing comma in install_requires in setup.py

also labeled now as beta as there are still some kinks with crawling

0.0.10

now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9

pdfkit now ignores load errors and writes as many pages as possible

0.0.8

better implementation of crawler, can now scrape entire websites

added OrderedSet class to utils.py

0.0.7

changed –keywords to –filter and positional arg url to urls

0.0.6

use –keywords flag for filtering text

can pass multiple links now

will not write empty files anymore

0.0.5

added –verbose argument for use with pdfkit

improved output file name processing

0.0.4

accepts 0 or 1 url’s, allowing a call with just –version

0.0.3

Moved utils.py to scrape/

0.0.2

First entry

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.11.3

Feb 20, 2022

0.11.2

Feb 20, 2022

0.11.1

Mar 24, 2021

0.11.0

Mar 19, 2021

0.10.2

Jan 8, 2021

0.10.1

Aug 24, 2020

0.10.0

Mar 12, 2020

0.9.15

Jan 5, 2019

This version

0.9.14

Jan 5, 2019

0.9.12

Jan 10, 2017

0.9.11

Aug 23, 2016

0.9.10

Jun 26, 2016

0.9.9

Jun 24, 2016

0.9.8

Jun 24, 2016

0.9.6

Jun 23, 2016

0.9.5

Jun 23, 2016

0.9.4

Jun 23, 2016

0.9.3

Jun 23, 2016

0.9.2

Jun 20, 2016

0.9.1

Jun 20, 2016

0.9.0

Jun 18, 2016

0.8.11

Jun 16, 2016

0.8.10

Jun 16, 2016

0.8.9

Jun 16, 2016

0.8.8

Jun 10, 2016

0.8.7

Mar 30, 2016

0.8.6

Feb 17, 2016

0.8.5

Feb 4, 2016

0.8.4

Feb 4, 2016

0.8.3

Feb 4, 2016

0.8.2

Feb 2, 2016

0.8.1

Jan 30, 2016

0.8.0

Jan 30, 2016

0.7.9

Jan 23, 2016

0.7.8

Jan 22, 2016

0.7.7

Jan 22, 2016

0.7.6

Jan 5, 2016

0.7.5

Jan 2, 2016

0.7.4

Jan 2, 2016

0.7.3

Jan 2, 2016

0.7.2

Jan 2, 2016

0.7.1

Dec 19, 2015

0.7.0

Dec 7, 2015

0.6.9

Dec 6, 2015

0.6.8

Dec 5, 2015

0.6.7

Dec 5, 2015

0.6.6

Dec 5, 2015

0.6.5

Dec 4, 2015

0.6.4

Nov 28, 2015

0.6.3

Nov 26, 2015

0.6.2

Nov 24, 2015

0.6.1

Nov 23, 2015

0.6.0

Nov 23, 2015

0.5.9

Nov 19, 2015

0.5.8

Nov 19, 2015

0.5.7

Nov 10, 2015

0.5.6

Nov 10, 2015

0.5.5

Nov 10, 2015

0.5.4

Nov 8, 2015

0.5.3

Nov 8, 2015

0.5.2

Nov 8, 2015

0.5.1

Nov 8, 2015

0.5.0

Nov 8, 2015

0.4.6

Oct 30, 2015

0.4.5

Oct 29, 2015

0.4.4

Oct 28, 2015

0.4.3

Oct 28, 2015

0.4.2

Oct 20, 2015

0.4.1

Oct 20, 2015

0.4.0

Oct 19, 2015

0.3.9

Oct 15, 2015

0.3.8

Oct 15, 2015

0.3.7

Oct 12, 2015

0.3.6

Sep 17, 2015

0.3.5

Sep 16, 2015

0.3.4

Sep 15, 2015

0.3.3

Sep 15, 2015

0.3.2

Sep 15, 2015

0.3.1

Sep 15, 2015

0.3.0

Sep 15, 2015

0.2.10

Sep 13, 2015

0.2.9

Sep 11, 2015

0.2.8

Aug 13, 2015

0.2.7

Aug 5, 2015

0.2.6

Jul 25, 2015

0.2.5

Jul 25, 2015

0.2.4

Jul 20, 2015

0.2.3

Jul 19, 2015

0.2.2

Jul 19, 2015

0.2.1

Jul 16, 2015

0.2.0

Jul 15, 2015

0.1.10

Jul 13, 2015

0.1.9

Jul 13, 2015

0.1.8

Jul 13, 2015

0.1.7

Jul 11, 2015

0.1.6

Jul 11, 2015

0.1.5

Jul 11, 2015

0.1.4

Jul 11, 2015

0.1.3

Jul 11, 2015

0.1.2

Jul 11, 2015

0.1.1

Jul 11, 2015

0.1.0

Jul 11, 2015

0.0.11

Jul 10, 2015

0.0.10

Jul 10, 2015

0.0.9

Jul 9, 2015

0.0.8

Jul 8, 2015

0.0.7

Jul 7, 2015

0.0.6

Jul 7, 2015

0.0.5

Jul 7, 2015

0.0.4

Jul 7, 2015

0.0.3

Jul 7, 2015

0.0.2

Jul 7, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape-0.9.14.tar.gz (26.8 kB view hashes)

Uploaded Jan 5, 2019 Source

Hashes for scrape-0.9.14.tar.gz

Hashes for scrape-0.9.14.tar.gz
Algorithm	Hash digest
SHA256	`795913199cf12f5dbf888801e2c0d10534dcf5485d3c35847a0145316d7acc87`
MD5	`b0ff78552a465917fc579bc89a3c5542`
BLAKE2b-256	`87e623d47620bd62fe692d01a909557d1201c1c1f1b8ec97f7eaed4fefb2e9bd`

scrape 0.9.14

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrape

a command-line web scraping tool

Installation

Usage

Author

Notes

News

0.9.14

0.9.13

0.9.12

0.9.11

0.9.10

0.9.9

0.9.8

0.9.7

0.9.6

0.9.5

0.9.4

0.9.3

0.9.2

0.9.1

0.9.0

0.8.11

0.8.10

0.8.9

0.8.8

0.8.7

0.8.6

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

0.7.9

0.7.8

0.7.7

0.7.6

0.7.5

0.7.4

0.7.3

0.7.2

0.7.1

0.7.0

0.6.9

0.6.8

0.6.7

0.6.6

0.6.5

0.6.4

0.6.3

0.6.2

0.6.1

0.6.0

0.5.9

0.5.8

0.5.7

0.5.6

0.5.5

0.5.4

0.5.3

0.5.2

0.5.1

0.5.0

0.4.11

0.4.10

0.4.9

0.4.8

0.4.7

0.4.6

0.4.5