This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

XML/HTML scraper using XPath queries.

Project Description

Piculet is a package for extracting data from XML documents using XPath queries. It can also scrape web pages by first converting the HTML source into XHTML. Piculet consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications.

PyPI:https://pypi.python.org/pypi/piculet/
Repository:https://bitbucket.org/uyar/piculet
Documentation:https://piculet.readthedocs.io/

Piculet has been tested with Python 2.7, Python 3.4, and PyPy3 5.7. You can install the latest version from PyPI:

pip install piculet

Note

If you want to use the development version, which could have more features but also be less stable, you can install it from the repository:

pip install hg+https://bitbucket.org/uyar/piculet

The Command-Line Interface

Installing Piculet creates a script named piculet which can be used to invoke the command-line interface:

$ piculet -h
usage: piculet [-h] [--debug] command ...

The scrape command extracts data out of a document as described by a specification file:

$ piculet scrape -h
usage: piculet scrape [-h] -s SPEC [--html] document

The location of the document can be given as a file path or a URL. The specification file is in JSON format and contains the rules that define how to extract the data. For example, say you want to extract some data from the file shining.html. An example specification file is given in movie.json. Download both these files and run the command:

piculet scrape -s movie.json shining.html

This should print the following output:

{
  "cast": [
    {
      "character": "Jack Torrance",
      "link": "/people/2",
      "name": "Jack Nicholson"
    },
    {
      "character": "Wendy Torrance",
      "link": "/people/3",
      "name": "Shelley Duvall"
    }
  ],
  "director": {
    "link": "/people/1",
    "name": "Stanley Kubrick"
  },
  "genres": [
    "Horror",
    "Drama"
  ],
  "language": "English",
  "review": "Fantastic movie. Definitely recommended.",
  "runtime": "144 minutes",
  "title": "The Shining",
  "year": "1980"
}

If the document is in HTML format but it is not well-formed XML, the --html option has to be used. If the document address starts with http:// or https://, the given URL is downloaded and the rules are applied to the content. For example, to extract some data from the Wikipedia page for David Bowie, download the wikipedia.json file and run the command:

piculet scrape -s wikipedia.json --html "https://en.wikipedia.org/wiki/David_Bowie"

This should print the following output:

{
  "birthplace": "Brixton, London, England",
  "born": "1947-01-08",
  "died": "2016-01-10",
  "name": "David Bowie",
  "occupation": [
    "Singer",
    "songwriter",
    "actor"
  ]
}

In the same command, change the name part of the URL to Merlene_Ottey and you will get similar data for Merlene Ottey. Note that since the markup used in Wikipedia pages for persons varies, the kinds of data you get with this specification file will also vary.

Piculet can be used as an HTML to XHTML convertor by invoking it with the h2x command. This command takes the file name as input and prints the converted content, as in piculet h2x foo.html. If the input file name is given as - it will read the content from the standard input and therefore can be used as part of a pipe: cat foo.html | piculet h2x -

History

1.0b1 (unreleased)

  • Added optional value transformations.
  • Added support for custom reducer callables.
  • Added command-line option for scraping documents from local files.

1.0a2 (2017-04-04)

  • Added support for Python 2.7.
  • Fixed lxml support.

1.0a1 (2016-08-24)

  • First release on PyPI.
Release History

Release History

This version
History Node

1.0b1

History Node

1.0a2

History Node

1.0a1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
piculet-1.0b1.tar.gz (28.4 kB) Copy SHA256 Checksum SHA256 Source Apr 26, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting