Skip to main content

Command-line utility for easy scraping of HTML documents

Project description

==================================
screp, easy command-line scrapping
==================================


What is screp?
==============

**screp** is a command line utility that provides easy and flexible scrapping of HTML documents. It
works by finding a set of *anchors* (specified using a CSS selector) and then extracting information
relative to those anchors, optionally post processing it using a set of standard operations. For each
anchor it outputs a record formatted according to one of the supported formats (CSV, JSON or
general).


Invoking screp
==============

**screp** is invoked using the following syntax:

$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]

where:
* FORMAT_SPEC is a format specification, one of:
- *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
- *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole output as a list of
JSON objects
- *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format where computed
values are substituted to their specifications (similar to bash parameter substitution)
* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as detailed below
* FILE can be either a local file or an absolute URL; if no FILEs are specified the standard input
is read


How does screp work?
====================

**screp** tries to automate many of the steps taken when writing your own scrapper, steps like:

* fetching the HTML documents, if necessary
* parsing HTML
* locating areas of interest in the DOM of the document
* locating interesting information around those areas
* simple processing of these pieces of information
* formatting of the information
* outputting the information

To use screp, you need to take a series of steps:
* tell screp where to take the HTML documents; it works with multiple documents, from sources such
as the web, the local file-system or STDIN
* define the *primary anchor* using a CSS selector: these are elements through which you access
records of interest in the HTML documents
* specify the output format; this implies specifying:
- *terms*, which are string computed relative to the anchors
- how these terms are combined to produce a record; currently screp supports three methods of
specifying formats:
- CSV
- JSON
- general format
* optionally, you can also define *secondary anchors*, which are elements computed relative to the
*primary anchor* that can be used to define *terms* in a more succinct way

Defining terms
==============

A *term* has the following format::
anchor.accessor.accessor.accessor|filter|filter|filter

In other words, a term is an anchor(primary or secondary) followed by zero or more accessors
followed by zero or more filters.

*Accessors* and *filters* (also collectively called *actions*) are functions that take the output
value of the last function (or the anchor, if this is the first action) and output another value. In
other words, they form a pipeline. Accessors act on DOM elements and sets (actually ordered lists)
of elements, whereas filters act on strings. Each action has an in_type and an out_type. For a term
to be correctly defined the out_type of an action needs to match the in_type of the following
action.

The supported types are: 'string', 'element', 'element_set'.

Actions can have zero or more parameters. When the action takes parameters it is specified as a
function::
action(parameter1, parameter2, parameter3)

When not, only the action name is specified (no parentheses).

Finally, terms have restrictions of the out_type of their last action (also called the out_type of
the term):
* if a term is used inside a format specification, its out_type must be 'string'
* if a term is used to define a secondary anchor, its out_type must be 'element'

Examples of terms
-----------------

These are correct term definitions::
'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'

Predefined anchors and actions
==============================

The following anchors are predefined:
* **$** is the primary anchor defined by the primary anchor selector
* **@** is the primary anchor representing the root of the current document

The following accessors are predefined:
* **first** [in_type='element_set', out_type='element']: returns the first element in an element_set
* **last** [in_type='element_set', out_type='element']: returns the last element in an element_set
* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th element in an
element_set; it also supports negative indexes, where -1 represents the last element, -2 the
second-to-last element, and so on
* **class** [in_type='element', out_type='string']: returns the value of the 'class' attribute
* **id** [in_type='element', out_type='string']: returns the value of the 'id' attribute
* **parent** [in_type='element', out_type='element']: returns the parent of the current element
* **text** [in_type='element', out_type='string']: returns the text enclosed by the current element
* **tag** [in_type='element', out_type='string']: returns the tag of the current element
* **attr(attr_name)** [in_type='element', out_type='string']: returns the value of the current element's
attribute with name 'attr_name'
* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the ordered list of
descendants of the current element selected by the CSS selector specified by 'css_sel'
* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
.desc(css_sel).first
* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns the list of ancestors
of the current element that satisfy the CSS selector specified by 'css_sel'
* **children(css_sel)** [in_type='element', out_type='element_set']: returns the list of children
of the current element that satisfy the CSS selector specified by 'css_sel'
* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of preceding
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of following
siblings of the current element that satisfy the CSS selector specified by 'css_sel'
* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of siblings of
the current element that satisfy the CSS selector specified by 'css_sel'
* **matching(css_sel)** [in_type='element_set', out_type='element_set']: filters an element_set and
returns all elements that match the CSS selector specified by 'css_sel'

The following filters are predefined:
* **upper** [in_type='string', out_type='string']: converts string to uppercase
* **lower** [in_type='string', out_type='string']: converts string to lowercase
* **trim** [in_type='string', out_type='string']: removes spaces at the beginning and end of the
string
* **strip(chars)** [in_type='string', out_type='string']: removes characters specified by 'chars'
at the beginning and end of the string
* **replace(old, new)** [in_type='string', out_type='string']: replaces all occurrences of 'old' with
'new'
* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a regular expression
substitution; *pattern* and *repl* are have the formats taken by the **re.sub** Python function
from the standard Python library;

Specifying output formats
=========================

CSV format
----------

The CSV output format is specified using the -c option. Optionally, using the -H option you can
specify a CSV header to output before outputting records.

Example::
-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'


JSON format
-----------

The JSON output format is defined using the -j option. It formats the output as a JSON list of
objects, one for each record. The *--indent-json* flat tells screp to indent each object. The format
is specified as a comma-separated list of *key=value* pairs, where the *key* represents the JSON key
in the record object while *value* is a term specification.

Example::
- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'


General format
--------------

Then general format is specified by a general string containing term specifications. To distinguish
it from the general format, each term specification is surrounded by braces. When formatting a
record each term specification is substituted with the computed value for that term.

Example::
-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some tail'


Specifying secondary anchors
============================

Secondary anchors are specified using the -a option. There can be any number of secondary anchors
definitions. The definitions have the format **<name>=<term>** where <name> is an identifier and
<term> is a term definition relative to any of the previously defined anchors (primary or
secondary) that has outputs an element. Secondary anchors can be redefined in later -a options but
only the last definition is retained.

Secondary anchors examples
--------------------------

These are examples of secondary anchors definitions::
-a 'p=$.parent' -a 'gp=p.parent'

-a 'interesting=$.fdesc(".interesting-class")' -a 'interesting=interesting.parent'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

screp-0.3.tar.gz (19.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page