Skip to main content

Command-line Utility for Web Scraping

Project description

SCR

Command-line Utility for Web Scraping

GitHub Workflow Status (branch) Supported Versions PyPI - Python Version Lines of code PyPI - License

Core Features

  • Extract web content based on XPath-, Regex-, (Javascript-), and Python Format Expressions
  • Crawls through complex graphs of webpages using expressive match chains and forwarding rules
  • Selenium support, explicitly also for headless mode and for the Tor Browser
  • REPL mode for quick and dirty jobs / debugging larger commands
  • dd style Command-line Interface
  • Multithreaded downloads with optional progress output on the console
  • Interactive modes for rejecting false matches, adjusting filenames etc.

Setup

SCR can be installed from pypi.org using

pip install scr

Selenium drivers for Firefox/Tor (geckodriver), and chrome/chromium (chromedriver) can be installed e.g. using

scr selinstall=firefox

(and later updated e.g. using scr selupdate=chrome). You still need to have the browser installed, though.

Examples

Download and enumerate all images from a website into the current working directory:

scr url=google.com cx='//img/@src' cl csf="img_{ci}{fe}" 

Open up a REPL, remote controlling a Firefox Browser using selenium

scr repl sel=firefox url="https://en.wikipedia.org/wiki/web_scraping"
scr> 'cx=//span[text()="Edit"]/parent::*/parent::*/@id' cjs="document.getElementById(cx).children[0].click()"
scr> exit

Interactively scroll through top reddit posts, following the 'next page' buttons:

scr url=old.reddit.com dx='//span[@class="next-button"]/a/@href' cx='//div[contains(@class,"entry")]//a[contains(@class,"title")]/text()' din mt=0

Downloading first 3 pdfs and first 5 gifs from a site, use a headless selenium tor browser for the fetch:

scr url=https://dtc.ucsf.edu/learning-library/resource-materials/ cx=//@href cr0='.*\.pdf$' cr1='.*\.gif$' cl csf='{fn}' cin=1 cimax0=3 cimax1=5 sel=tor selh

Options List

scr [OPTIONS]

    Matching chains are evaluated in the following order, skipping unspecified steps:
    xpath -> regex -> (javascript) -> python format string

    Content to Write out:
        cx=<xpath>            xpath for content matching
        cr=<regex>            regex for content matching
        cjs=<js string>       javascript to execute on the page, format args are available as js variables (selenium only)
        cf=<format string>    content format string (args: <cr capture groups>, xmatch, rmatch, di, ci)
        cmm=<bool>            allow multiple content matches in one document instead of picking the first (defaults to true)
        cimin=<number>        initial content index, each successful match gets one index
        cimax=<number>        max content index, matching stops here
        cicont=<bool>         don't reset the content index for each document
        csf=<format string>   save content to file at the path resulting from the format string, empty to enable
        cwf=<format string>   format to write to file. defaults to "{c}"
        cpf=<format string>   print the result of this format string for each content, empty to disable
                              defaults to "{c}\n" if cpf, csf and cfc are unspecified
        cfc=<chain spec>      forward content match as a virtual document
        cff=<format string>   format of the virtual document forwarded to the cfc chains. defaults to "{c}"
        csin<bool>            give a promt to edit the save path for a file
        cin=<bool>            give a prompt to ignore a potential content match
        cl=<bool>             treat content match as a link to the actual content
        cesc=<string>         escape sequence to terminate content in cin mode, defaults to "<END>"
        cenc=<encoding>       default encoding to assume that content is in
        cfenc=<encoding>      encoding to always assume that content is in, even if http(s) says differently

    Labels to give each matched content (mostly useful for the filename in csf):
        lx=<xpath>           xpath for label matching
        lr=<regex>           regex for label matching
        ljs=<js string>      javascript to execute on the page, format args are available as js variables (selenium only)
        lf=<format string>   label format string
        lic=<bool>           match for the label within the content match instead of the hole document
        las=<bool>           allow slashes in labels
        lmm=<bool>           allow multiple label matches in one document instead of picking the first (for all content matches)
        lam=<bool>           allow missing label (default is to skip content if no label is found)
        lfd=<format string>  default label format string to use if there's no match
        lin=<bool>           give a prompt to edit the generated label

    Further documents to scan referenced in already found ones:
        dx=<xpath>           xpath for document matching
        dr=<regex>           regex for document matching
        djs=<js string>      javascript to execute on the page, format args are available as js variables (selenium only)
        df=<format string>   document format string
        dimin=<number>       initial document index, each successful match gets one index
        dimax=<number>       max document index, matching stops here
        dmm=<bool>           allow multiple document matches in one document instead of picking the first
        din=<bool>           give a prompt to ignore a potential document match
        denc=<encoding>      default document encoding to use for following documents, default is utf-8
        dfenc=<encoding>     force document encoding for following documents, even if http(s) says differently
        dsch=<scheme>        default scheme for urls derived from following documents, defaults to "https"
        dpsch=<bool>         use the parent documents scheme if available, defaults to true unless dsch is specified
        dfsch=<scheme>       force this scheme for urls derived from following documents
        doc=<chain spec>     chains that matched documents should apply to, default is the same chain

    Initial Documents:
        url=<url>            fetch a document from a url, derived document matches are (relative) urls
        file=<path>          fetch a document from a file, derived documents matches are (relative) file pathes
        rfile=<path>         fetch a document from a file, derived documents matches are urls

    Other:
        selstrat=<strategy>  matching strategy for selenium (default: plain, values: anymatch, plain, interactive, deduplicate)
        seldl=<dl strategy>  download strategy for selenium (default: external, values: external, internal, fetch)
        owf=<bool>           allow to overwrite existing files, defaults to true

    Format Args:
        Named arguments for <format string> arguments.
        Some only become available later in the pipeline (e.g. {cm} is not available inside cf).

        {cx}                 content xpath match
        {cr}                 content regex match, equal to {cx} if cr is unspecified
        <cr capture groups>  the named regex capture groups (?P<name>...) from cr are available as {name},
                             the unnamed ones (...) as {cg<unnamed capture group number>}
        {cf}                 content after applying cf
        {cjs}                output of cjs
        {cm}                 final content match after link normalization (cl) and user interaction (cin)
        {c}                  content, downloaded from cm in case of cl, otherwise equal to cm

        {lx}                 label xpath match
        {lr}                 label regex match, equal to {lx} if lr is unspecified
        <lr capture groups>  the named regex capture groups (?P<name>...) from cr are available as {name},
                             the unnamed ones (...) as {lg<unnamed capture group number>}
        {lf}                 label after applying lf
        {ljs}                output of ljs
        {l}                  final label after user interaction (lin)

        {dx}                 document link xpath match
        {dr}                 document link regex match, equal to {dx} if dr is unspecified
        <dr capture groups>  the named regex capture groups (?P<name>...) from dr are available as {name},
                             the unnamed ones (...) as {dg<unnamed capture group number>}
        {df}                 document link after applying df
        {djs}                output of djs
        {d}                  final document link after user interaction (din)

        {di}                 document index
        {ci}                 content index
        {dl}                 document link (inside df, this refers to the parent document)
        {cenc}               content encoding, deduced while respecting cenc and cfenc
        {cesc}               escape sequence for separating content, can be overwritten using cesc
        {chain}              id of the match chain that generated this content

        {fn}                 filename from the url of a cm with cl
        {fb}                 basename component of {fn} (extension stripped away)
        {fe}                 extension component of {fn}, including the dot (empty string if there is no extension)


    Chain Syntax:
        Any option above can restrict the matching chains is should apply to using opt<chainspec>=<value>.
        Use "-" for ranges, "," for multiple specifications, and "^" to except the following chains.
        Examples:
            lf1,3-5=foo        sets "lf" to "foo" for chains 1, 3, 4 and 5.
            lf2-^4=bar         sets "lf" to "bar" for all chains larger than or equal to 2, except chain 4

    Miscellaneous:
        help                   prints this help
        selinstall=<browser>   installs selenium driver for the specified browser in the directory of this script
        seluninstall=<browser> uninstalls selenium driver for the specified browser in the directory of this script
        selupdate=<browser>    updates (or installs) the local selenium driver for the specified browser
        version                print version information

    Global Options:
        timeout=<seconds>      seconds before a web request timeouts (default 30)
        bfs=<bool>             traverse the matched documents in breadth first order instead of depth first
        v=<verbosity>          output verbosity levels (default: warn, values: info, warn, error)
        ua=<string>            user agent to pass in the html header for url GETs
        uar=<bool>             use a rangom user agent
        selkeep=<bool>         keep selenium instance alive after the command finished
        cookiefile=<path>      path to a netscape cookie file. cookies are passed along for url GETs
        sel=<browser|bool>     use selenium (default is firefox) to load urls into an interactive browser session
                               (default: disabled, values: tor, chrome, firefox, disabled)
        selh=<bool>            use selenium in headless mode, implies sel
        tbdir=<path>           root directory of the tor browser installation, implies sel=tor
                               (default: environment variable TOR_BROWSER_DIR)
        mt=<int>               maximum threads for background downloads, 0 to disable. defaults to cpu core count
        repl=<bool>            accept commands in a read eval print loop
        exit=<bool>            exit the repl (with the result of the current command)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scr-0.9.2.tar.gz (52.2 kB view details)

Uploaded Source

Built Distribution

scr-0.9.2-py3-none-any.whl (55.5 kB view details)

Uploaded Python 3

File details

Details for the file scr-0.9.2.tar.gz.

File metadata

  • Download URL: scr-0.9.2.tar.gz
  • Upload date:
  • Size: 52.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for scr-0.9.2.tar.gz
Algorithm Hash digest
SHA256 32d1015172b96d989d211b1db8eb487cfa7626cc5ddf301b27dad7af9fe9ec61
MD5 6c268894bdbd8ff17183d08c9064e87e
BLAKE2b-256 c3d9e08de834033c877a3758a1fee6e83e372cdd59b36e0f037650e61048c8a8

See more details on using hashes here.

File details

Details for the file scr-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: scr-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 55.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for scr-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2298eb5db64aa7531d41f17d27a1a35f4908499aafb468d14622f7f546dd7533
MD5 5ee51245860a060b161f874fe870a663
BLAKE2b-256 a98ef97c7344051e841760a846781c03277910a1703eec08c5be14c2d92bede9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page