This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

ftw.crawler

Installation

To install ftw.crawler, the easiest way is to create a buildout that contains the configuration, pulls in the egg using zc.recipe.egg and creates a script in the bin/ directory that directly launches the crawler with the respective configuration as an argument:

  • First, create a configuration file for the crawler. You can base your configuration on ftw/crawler/tests/assets/basic_config.py by copying it to your buildout and adapting it as needed.

    Make sure to configure at least the tika and solr URLs to point to the correct locations of the respective services, and to adapt the sites list to your needs.

  • Create a buildout config that installs ftw.crawler using zc.recipe.egg:

    crawler.cfg

    [buildout]
    parts +=
        crawler
        crawl-foo-org
    
    [crawler]
    recipe = zc.recipe.egg
    eggs = ftw.crawler
    
  • Further define a buildout section that creates a bin/crawl-foo-org script, which will call bin/crawl foo_org_config.py using absolute paths (for easier use from cron jobs):

    [crawl-foo-org]
    recipe = collective.recipe.scriptgen
    cmd = ${buildout:bin-directory}/crawl
    arguments =
        ${buildout:directory}/foo_org_config.py
        --tika http://localhost:9998/
        --solr http://localhost:8983/solr
    

    (The --tika and --solr command line arguments are optional, they can also be set in the configuration file. If given, the command line arguments take precedence over any parameters in the config file.)

  • Add a buildout config that downloads and configures a Tika JAXRS server:

    tika-server.cfg

    [buildout]
    parts +=
        supervisor
        tika-server-download
        tika-server
    
    [supervisor]
    recipe = collective.recipe.supervisor
    plugins =
          superlance
    port = 8091
    user = supervisor
    password = admin
    programs =
        10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user
    
    [tika-server-download]
    recipe = hexagonit.recipe.download
    url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar
    md5sum = 0f70548f233ead7c299bf7bc73bfec26
    download-only = true
    filename = tika-server.jar
    
    [tika-server]
    port = 9998
    recipe = collective.recipe.scriptgen
    cmd = java
    arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}
    

    Modify your_os_user and the supervisor and Tika ports as needed.

  • Finally, add a bootstrap.py and create the buildout.cfg that pulls all of the above together:

    buildout.cfg

    [buildout]
    extensions = mr.developer
    
    extends =
        tika-server.cfg
        crawler.cfg
    
  • Bootstrap and run buildout:

    python bootstrap.py
    bin/buildout
    

Running the crawler

If you created the bin/crawl-foo-org script with the buildout described above, that’s all you need to run the crawler:

  • Make sure Tika and Solr are running
  • Run bin/crawl-foo-org (with either a relative or absolute path, working directory doesn’t matter, so it can easily be called from a cron job)

Running bin/crawl directly

The bin/crawl-foo-org is just a tiny wrapper that calls the bin/crawl script, generated by ftw.crawler’s setuptools console_script entry point, with the absolute path to the configuration file as the only argument. Any other arguments to the bin/crawl-foo-org script will be forwarded to bin/crawl.

Therefore running bin/crawl-foo-org [args] is equivalent to bin/crawl foo_org_config.py [args].

Indexing only a particular URL

If you only want to index a particular URL, pass that URL as the first argument to bin/crawl-foo-org. The crawler will then only fetch and index that specific URL.

Development

To start hacking on ftw.crawler, use the development.cfg buildout:

ln -s development.cfg buildout.cfg
python bootstrap.py
bin/buildout

This will build a Tika JAXRS server and a Solr instance for you. The Solr configuration is set up to be compatible with the testing / example configuration at ftw/crawler/tests/assets/basic_config.py.

To run the crawler against the example configuration:

bin/tika-server
bin/solr-instance fg
bin/crawl ftw/crawler/tests/assets/basic_config.py

Changelog

1.1.0 (2016-10-04)

  • Support configuration of absolute sitemap urls. [jone]
  • Slow down on too many requests. [jone]

1.0 (2015-11-09)

  • Initial implementation. [lgraf]
Release History

Release History

1.1.0

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
ftw.crawler-1.1.0.tar.gz (38.2 kB) Copy SHA256 Checksum SHA256 Source Oct 4, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting