This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description
======================================
**screp**, easy command-line scrapping
======================================


What is screp?
==============

**screp** is a command line utility that provides easy and flexible scrapping
of HTML documents. It works by finding a set of *anchors* (specified using a
CSS selector) and then extracting information relative to those anchors,
optionally post processing it using a set of standard operations. For each
anchor it outputs a record formatted according to one of the supported formats
(CSV, JSON or general).


Invoking screp
==============

**screp** is invoked using the following syntax::

$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]

where:
* FORMAT_SPEC is a format specification, one of:
- *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
- *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole
output as a list of JSON objects
- *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format
where computed values are substituted to their specifications (similar to
bash parameter substitution)
* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as
detailed below
* FILE can be either a local file or an absolute URL; if no FILEs are specified
the standard input is read


How does screp work?
====================

**screp** tries to automate many of the steps taken when writing your own
scrapper, steps like:

* fetching the HTML documents, if necessary
* parsing HTML
* locating areas of interest in the DOM of the document
* locating interesting information around those areas
* simple processing of these pieces of information
* formatting of the information
* outputting the information

To use screp, you need to take a series of steps:
* tell screp where to take the HTML documents; it works with multiple
documents, from sources such as the web, the local file-system or STDIN
* define the *primary anchor* using a CSS selector: these are elements through
which you access records of interest in the HTML documents
* specify the output format; this implies specifying:
- *terms*, which are string computed relative to the anchors
- how these terms are combined to produce a record; currently screp supports
three methods of specifying formats:
- CSV
- JSON
- general format
* optionally, you can also define *secondary anchors*, which are elements
computed relative to the *primary anchor* that can be used to define *terms*
in a more succinct way

Defining terms
==============

A *term* has the following format::

anchor.accessor.accessor.accessor|filter|filter|filter

In other words, a term is an anchor(primary or secondary) followed by zero or
more accessors followed by zero or more filters.

*Accessors* and *filters* (also collectively called *actions*) are functions
that take the output value of the last function (or the anchor, if this is the
first action) and output another value. In other words, they form a pipeline.
Accessors act on DOM elements and sets (actually ordered lists) of elements,
whereas filters act on strings. Each action has an in_type and an out_type. For
a term to be correctly defined the out_type of an action needs to match the
in_type of the following action.

The supported types are: 'string', 'element', 'element_set'.

Actions can have zero or more parameters. When the action takes parameters it
is specified as a function::

action(parameter1, parameter2, parameter3)

When not, only the action name is specified (no parentheses).

Finally, terms have restrictions of the out_type of their last action (also
called the out_type of the term):
* if a term is used inside a format specification, its out_type must be
'string'
* if a term is used to define a secondary anchor, its out_type must be
'element'

Examples of terms
-----------------

These are correct term definitions::

'$.parent.parent.attr(title)|upper' outputs 'string'
'@.desc(".record").first' outputs 'element
'anchor.ancestors(".box").children(".price")' outputs 'element_set'

Predefined anchors and actions
==============================

The following anchors are predefined:
* **$** is the primary anchor defined by the primary anchor selector
* **@** is the primary anchor representing the root of the current document

The following accessors are predefined:
* **first** [in_type='element_set', out_type='element']: returns the first
element in an element_set
* **last** [in_type='element_set', out_type='element']: returns the last
element in an element_set
* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th
element in an element_set; it also supports negative indexes, where -1
represents the last element, -2 the second-to-last element, and so on
* **class** [in_type='element', out_type='string']: returns the value of the
'class' attribute * **id** [in_type='element', out_type='string']: returns
the value of the 'id' attribute * **parent** [in_type='element',
out_type='element']: returns the parent of the current element
* **text** [in_type='element', out_type='string']: returns the text enclosed by
the current element
* **tag** [in_type='element', out_type='string']: returns the tag of the
current element
* **attr(attr_name)** [in_type='element', out_type='string']: returns the value
of the current element's attribute with name 'attr_name'
* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the
ordered list of descendants of the current element selected by the CSS
selector specified by 'css_sel'
* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
.desc(css_sel).first
* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns
the list of ancestors of the current element that satisfy the CSS selector
specified by 'css_sel'
* **children(css_sel)** [in_type='element', out_type='element_set']: returns
the list of children of the current element that satisfy the CSS selector
specified by 'css_sel'
* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of preceding siblings of the current element that satisfy the CSS
selector specified by 'css_sel'
* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of following siblings of the current element that satisfy the CSS
selector specified by 'css_sel'
* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns
the list of siblings of the current element that satisfy the CSS selector
specified by 'css_sel'
* **matching(css_sel)** [in_type='element_set', out_type='element_set']:
filters an element_set and returns all elements that match the CSS selector
specified by 'css_sel'

The following filters are predefined:
* **upper** [in_type='string', out_type='string']: converts string to uppercase
* **lower** [in_type='string', out_type='string']: converts string to lowercase
* **trim** [in_type='string', out_type='string']: removes spaces at the
beginning and end of the string
* **strip(chars)** [in_type='string', out_type='string']: removes characters
specified by 'chars' at the beginning and end of the string
* **replace(old, new)** [in_type='string', out_type='string']: replaces all
occurrences of 'old' with 'new'
* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a
regular expression substitution; *pattern* and *repl* are have the formats
taken by the **re.sub** Python function from the standard Python library;

Specifying output formats
=========================

CSV format
----------

The CSV output format is specified using the -c option. Optionally, using the
-H option you can specify a CSV header to output before outputting records.

Example::

-c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'


JSON format
-----------

The JSON output format is defined using the -j option. It formats the output as
a JSON list of objects, one for each record. The *--indent-json* flat tells
screp to indent each object. The format is specified as a comma-separated list
of *key=value* pairs, where the *key* represents the JSON key in the record
object while *value* is a term specification.

Example::

- j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'


General format
--------------

Then general format is specified by a general string containing term
specifications. To distinguish it from the general format, each term
specification is surrounded by braces. When formatting a record each term
specification is substituted with the computed value for that term.

Example::

-f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some
tail'


Specifying secondary anchors
============================

Secondary anchors are specified using the -a option. There can be any number of
secondary anchors definitions. The definitions have the format
**<name>=<term>** where <name> is an identifier and <term> is a term definition
relative to any of the previously defined anchors (primary or secondary) that
has outputs an element. Secondary anchors can be redefined in later -a options
but only the last definition is retained.

Secondary anchors examples
--------------------------

These are examples of secondary anchors definitions::

-a 'p=$.parent' -a 'gp=p.parent'

-a 'interesting=$.fdesc(".interesting-class")' -a
'interesting=interesting.parent'
Release History

Release History

0.3.2

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
screp-0.3.2.tar.gz (19.5 kB) Copy SHA256 Checksum SHA256 Source Feb 17, 2013

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting