Skip to main content

A framework for creating web content extractors

Project description

Scrapple

Join the chat at https://gitter.im/AlexMathew/scrapple Scrapple on PyPI Build Status

Scrapple is a framework for creating web scrapers and web crawlers according to a key-value based configuration file. It provides a command line interface to run the script on a given JSON-based configuration input, as well as a web interface to provide the necessary input.

The primary goal of Scrapple is to abstract the process of designing web content extractors. The focus is laid on what to extract, rather than how to do it. The user-specified configuration file contains selector expressions (XPath expressions or CSS selectors) and the attribute to be selected. Scrapple does the work of running this extractor, without the user worrying about writing a program. Scrapple can also be used to generate a Python script that implements the desired extractor.

Installation

You can install Scrapple by using

$ sudo apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev
$ pip install scrapple

Otherwise, you could clone this repository and install the package.

$ git clone http://github.com/scrappleapp/scrapple scrapple
$ cd scrapple
$ pip install -r requirements.txt
$ python setup.py install

How to use Scrapple

Scrapple provides 4 commands to create and implement extractors.

Scrapple implements the desired extractor on the basis of the user-specified configuration file. There are guidelines regarding how to write these configuration files.

The configuration file is the basic specification of the extractor required. It contains the URL for the web page to be loaded, the selector expressions for the data to be extracted and in the case of crawlers, the selector expression for the links to be crawled through.

The keys used in the configuration file are :

  • project_name : Specifies the name of the project with which the configuration file is associated.

  • selector_type : Specifies the type of selector expressions used. This could be “xpath” or “css”.

  • scraping : Specifies parameters for the extractor to be created.

    • url : Specifies the URL of the base web page to be loaded.

    • data : Specifies a list of selectors for the data to be extracted.

      • selector : Specifies the selector expression.

      • attr : Specifies the attribute to be extracted from the result of the selector expression.

      • field : Specifies the field name under which this data is to stored.

      • default : Specifies the default value to be used if the selector expression fails.

    • table : Specifies a description for scraping tabular data.

      • table_type : Specifies the type of table (“rows” or “columns”). This determines the type of table to be extracted. A row extraction is when there is a single row to be extracted and mapped to a set of headers. A column extraction is when a set of rows have to be extracted, giving a list of header-value mappings.

      • header : Specifies the headers to be used for the table. This can be a list of headers, or a selector that gives the list of headers.

      • prefix : Specifies a prefix to be added to each header.

      • suffix : Specifies a suffix to be added to each header.

      • selector : Specifies the selector for the data. For row extraction, this is a selector that gives the row to be extracted. For column extraction, this is a list of selectors for each column.

      • attr : Specifies the attribute to be extracted from the selected tag.

      • default : Specifies the default value to be used if the selector does not return any data.

    • next : Specifies the crawler implementation.

      • follow_link : Specifies the selector expression for the <a> tags to be crawled through.

The main objective of the configuration file is to specify extraction rules in terms of selector expressions and the attribute to be extracted. There are certain set forms of selector/attribute value pairs that perform various types of content extraction.

Selector expressions :

  • CSS selector or XPath expressions that specify the tag to be selected.

  • “url” to take the URL of the current page on which extraction is being performed.

Attribute selectors :

  • “text” to extract the textual content from that tag.

  • “href”, “src” etc., to extract any of the other attributes of the selected tag.

Tutorials

[For a more detailed tutorial, check out the tutorial in the documentation]

In this simple example for using Scrapple, we’ll extract NBA player information from the ESPN website.

To first create the skeleton configuration file, we use the genconfig command.

$ scrapple genconfig nba http://espn.go.com/nba/teams --type=crawler --levels=2

This creates nba.json - a sample Scrapple configuration file for a crawler, which uses XPath expressions as selectors. This can be edited and the required follow link selector, data selectors and attributes can be specified.

{
    "project_name": "nba",
    "selector_type": "xpath",
    "scraping": {
        "url": "http://espn.go.com/nba/teams",
        "data": [
            {
                "field": "",
                "selector": "",
                "attr": "",
                "default": ""
            }
        ],
        "next": [
            {
                "follow_link": "//*[@class='mod-content']//a[3]",
                "scraping": {
                    "data": [
                        {
                            "field": "team",
                            "selector": "//h2",
                            "attr": "text",
                            "default": "<no_team>"
                        }
                    ],
                    "next": [
                        {
                            "follow_link": "//*[@class='mod-content']/table[1]//tr[@class!='colhead']//a",
                            "scraping": {
                                "data": [
                                    {
                                        "field": "name",
                                        "selector": "//h1",
                                        "attr": "text",
                                        "default": "<no_name>"
                                    },
                                    {
                                        "field": "headshot_link",
                                        "selector": "//*[@class='main-headshot']/img",
                                        "attr": "src",
                                        "default": "<no_image>"
                                    },
                                    {
                                        "field": "number & position",
                                        "selector": "//ul[@class='general-info']/li[1]",
                                        "attr": "text",
                                        "default": "<00> #<GFC>"
                                    }
                                ],
                                "table": [
                                    {
                                        "table_type": "rows",
                                        "header": "//div[@class='player-stats']//table//th",
                                        "prefix": "season_",
                                        "suffix": "",
                                        "selector": "//div[@class='player-stats']//table//tr[1]/td",
                                        "attr": "text",
                                        "default": ""
                                    },
                                    {
                                        "table_type": "rows",
                                        "header": "//div[@class='player-stats']//table//th",
                                        "prefix": "career_",
                                        "suffix": "",
                                        "selector": "//div[@class='player-stats']//table//tr[@class='career']/td",
                                        "attr": "text",
                                        "default": ""
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    }
}

The extractor can be run using the run command -

$ scrapple run nba nba_players -o json

This creates nba_players.json which contains the extracted data. An example snippet of this data :

{

    "project": "nba",
    "data": [

        # nba_players.json continues

        {
            "career_APG" : "9.9",
            "career_PER" : "",
            "career_PPG" : "18.6",
            "career_RPG" : "4.4",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/2779.png&w=350&h=254",
            "name" : "Chris Paul",
            "number & position" : "#3 PG",
            "season_APG" : "9.2",
            "season_PER" : "23.49",
            "season_PPG" : "17.6",
            "season_RPG" : "3.5",
            "team" : "Los Angeles Clippers"
        },
        {
            "career_APG" : "3.6",
            "career_PER" : "",
            "career_PPG" : "20.3",
            "career_RPG" : "5.8",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/662.png&w=350&h=254",
            "name" : "Paul Pierce",
            "number & position" : "#34 SF",
            "season_APG" : "0.9",
            "season_PER" : "7.55",
            "season_PPG" : "5.0",
            "season_RPG" : "2.6",
            "team" : "Los Angeles Clippers"
        },
        {
            "career_APG" : "2.9",
            "career_PER" : "",
            "career_PPG" : "3.7",
            "career_RPG" : "1.8",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/4182.png&w=350&h=254",
            "name" : "Pablo Prigioni",
            "number & position" : "#9 PG",
            "season_APG" : "1.9",
            "season_PER" : "8.72",
            "season_PPG" : "2.3",
            "season_RPG" : "1.5",
            "team" : "Los Angeles Clippers"
        },
        {
            "career_APG" : "2.0",
            "career_PER" : "",
            "career_PPG" : "11.1",
            "career_RPG" : "1.9",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/3024.png&w=350&h=254",
            "name" : "J.J. Redick",
            "number & position" : "#4 SG",
            "season_APG" : "1.6",
            "season_PER" : "18.10",
            "season_PPG" : "15.9",
            "season_RPG" : "1.5",
            "team" : "Los Angeles Clippers"
        },

        # nba_players.json continues
    ]

}

The run command can also be used to create a CSV file with the extracted data, using the –output_type=csv argument.

The generate command can be used to generate a Python script that implements this extractor. In essence, it replicates the execution of the run command.

$ scrapple generate nba nba_script -o json

This creates nba_script.py, which extracts the required data and stores the data in a JSON document.

Documentation

You can read the complete documentation for an extensive coverage on the background behind Scrapple, a thorough explanation on the Scrapple package implementation and a complete coverage of tutorials on how to use Scrapple to run your scraper/crawler jobs.

Authors

Scrapple is maintained by Alex Mathew and Harish Balakrishnan.

History

0.3.0 - 2016-09-23

  • Set up table scraping parameters and execution

  • Fix json configuration generation

0.2.6 - 2015-11-27

  • Edit requirements

0.2.5 - 2015-05-28

  • Add levels argument for genconfig command, to create crawler config files for a specific depth

0.2.4 - 2015-04-13

  • Update documentation

  • Minor fixes

0.2.3 - 2015-03-11

  • Include implementation to use csv as the output format

0.2.2 - 2015-02-22

  • Fix bug in generate script template

0.2.1 - 2015-02-21

  • Update tests

0.2.0 - 2015-02-20

  • Include implementation for scrapple run and scrapple generate for crawlers

  • Modify web interface for editing scraper config files

  • Revise skeleton configuration files

0.1.1 - 2015-02-10

  • Release on PyPI with revisions

  • Include web interface for editing scraper config files

  • Modified implementations of certain functions

0.1.0 - 2015-02-04

  • First release on PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

scrapple-0.3.0_1.tar.gz (1.6 MB view details)

Uploaded Source

scrapple-0.3.0.linux-i686_1.tar.gz (328.6 kB view details)

Uploaded Source

Built Distributions

scrapple-0.3.0.linux-i686_2.exe (445.2 kB view details)

Uploaded Source

scrapple-0.3.0-py2.7_1.egg (347.0 kB view details)

Uploaded Source

File details

Details for the file scrapple-0.3.0_1.tar.gz.

File metadata

  • Download URL: scrapple-0.3.0_1.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapple-0.3.0_1.tar.gz
Algorithm Hash digest
SHA256 9741b74015ad0879c7a13b1062b9e1f5c55546eb217e021bd0dfcd904e23ec58
MD5 b8af6669a01d95b1bb34b1fa4ef2e356
BLAKE2b-256 8363d75e34d7cce975f422194e49ddb619bc81ac58d7079ce831c73aae2e591b

See more details on using hashes here.

File details

Details for the file scrapple-0.3.0.linux-i686_1.tar.gz.

File metadata

File hashes

Hashes for scrapple-0.3.0.linux-i686_1.tar.gz
Algorithm Hash digest
SHA256 fde45c40c3eaffa15fdfa9b741e158b9e2b14c81474f12f76f08545241497994
MD5 0d60e22a07b4e96b26aa3033717bac67
BLAKE2b-256 57968e450810a62169a7c33ce9006053abd3bd55ad1e0694edd0fe78ad2f06e7

See more details on using hashes here.

File details

Details for the file scrapple-0.3.0.linux-i686_2.exe.

File metadata

File hashes

Hashes for scrapple-0.3.0.linux-i686_2.exe
Algorithm Hash digest
SHA256 ffbd12ba8c6112636ccd5e832e4f3ba4dd2bff3f2f9e8b182c6af97afcb4d838
MD5 41395d49791859586d3ffd650a2915eb
BLAKE2b-256 48f918f37f8ed400d29493c0b30b57c0f2b8294d15ac9cfc0a7187fe09c7d446

See more details on using hashes here.

File details

Details for the file scrapple-0.3.0-py2.7_1.egg.

File metadata

File hashes

Hashes for scrapple-0.3.0-py2.7_1.egg
Algorithm Hash digest
SHA256 bbe1e5e787fd5826aabc0032ed5f75bb3e9ac9d0dda3639e9596deb04c63d47b
MD5 ec91741be8e80c0d17ef03cbb233d8bd
BLAKE2b-256 7131a3d2b5f7750a7e4f93b680ffdb0a49b1c6d00a3cf41b7293ec4a86c88272

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page