This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Help us improve Python packaging - Donate today!
Project Description

Crawling - html to import

transmogrify.webcrawler will crawl html to extract pages and files as a source for your transmogrifier pipeline. transmogrify.webcrawler.typerecognitor aids in setting ‘_type’ based on the crawled mimetype. transmogrify.webcrawler.cache helps speed up crawling and reduce memory usage by storing items locally.

These blueprints are designed to work with the funnelweb pipeline but can be used independently.

transmogrify.webcrawler

A source blueprint for crawling content from a site or local html files.

Webcrawler imports HTML either from a live website, for a folder on disk, or a folder on disk with html which used to come from a live website and may still have absolute links refering to that website.

To crawl a live website supply the crawler with a base http url to start crawling with. This url must be the url which all the other urls you want from the site start with.

For example

[crawler]
blueprint = transmogrify.webcrawler
url  = http://www.whitehouse.gov
max = 50

will restrict the crawler to the first 50 pages.

You can also crawl a local directory of html with relative links by just using a file: style url

[crawler]
blueprint = transmogrify.webcrawler
url = file:///mydirectory

or if the local directory contains html saved from a website and might have absolute urls in it the you can set this as the cache. The crawler will always look up the cache first

[crawler]
blueprint = transmogrify.webcrawler
url = http://therealsite.com --crawler:cache=mydirectory

The following will not crawl anything larget than 4Mb

[crawler]
blueprint = transmogrify.webcrawler
url  = http://www.whitehouse.gov
maxsize=400000

To skip crawling links by regular expression

[crawler]
blueprint = transmogrify.webcrawler
url=http://www.whitehouse.gov
ignore = \.mp3
                 \.mp4

If webcrawler is having trouble parsing the html of some pages you can preprocesses the html before it is parsed. e.g.

[crawler]
blueprint = transmogrify.webcrawler
patterns = (<script>)[^<]*(</script>)
subs = \1\2

If you’d like to skip processing links with certain mimetypes you can use the drop:condition. This TALES expression determines what will be processed further. see http://pypi.python.org/pypi/collective.transmogrifier/#condition-section

[drop]
blueprint = collective.transmogrifier.sections.condition
condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']

Options:

site_url:
  • the top url to crawl
ignore:
  • list of regex for urls to not crawl
cache:
  • local directory to read crawled items from instead of accessing the site directly
patterns:
  • Regular expressions to substitute before html is parsed. New line seperated
subs:
  • Text to replace each item in patterns. Must be the same number of lines as patterns. Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use <EMPTYSTRING> as a substitution.
maxsize:
  • don’t crawl anything larger than this
max:
  • Limit crawling to this number of pages
start-urls:
  • a list of urls to initially crawl
ignore-robots:
  • if set, will ignore the robots.txt directives and crawl everything

WebCrawler will emit items like

item = dict(_site_url = "Original site_url used",
           _path = "The url crawled without _site_url,
           _content = "The raw content returned by the url",
           _content_info = "Headers returned with content"
           _backlinks    = names,
           _sortorder    = "An integer representing the order the url was found within the page/site
            )

transmogrify.webcrawler.cache

A blueprint that saves crawled content into a directory structure

Options:

path-key:Allows you to override the field path is stored in. Defaults to ‘_path’
output:Directory to store cached content in

transmogrify.webcrawler.typerecognitor

A blueprint for assigning content type based on the mime-type as given by the webcrawler

Changelog

1.2.1 (2012-1-10)

  • setuptools-git wasn’t installed so release was missing files [djay]

1.2 (2012-12-28)

  • fix cache check to prevent overwriting cache [djay]
  • turn redirects into Link objects [djay]
  • summary stats of which mimetypes were crawled [djay]
  • fixed bug where redirected pages weren’t getting uploaded [djay]
  • fixed bugs with storing default pages in cache [djay]
  • fixed bug with space chars in urls [ivanteoh]
  • better handling of charset detection [djay]

1.1 (2012-04-17)

  • add start-urls option [djay]
  • add ignore_robots option [djay]
  • fixed bug in http-equiv refresh handling [djay]
  • fixes to disk caching [djay]
  • better logging [djay]
  • default maxsize is unlimited [djay]
  • Provide ability for the reformat function to substitute patterns with empty strings (nothing). Buildout does not support empty lines within configuration, so if a substitution is <EMPTYSTRING> this becomes an empty string. [davidjb]
  • Provide a logger in the LXMLPage class so the reformat function can succeed [davidjb]
  • Reformat spacing in webcrawler reformat function [davidjb]

1.0 (2011-06-29)

  • many fixes for importing from local directory w/ many languages [simahawk]
  • fix UnicodeEncodeError when file name/language is not english [simahawk]
  • fix iterating over non-sequence [simahawk]
  • fix missing import for MyStringIO [simahawk]

1.0b7 (2011-02-17)

  • fix bug in cache check [djay]

1.0b6 (2011-02-12)

  • only open cache files when needed so don’t run out of handles [djay]
  • follow http-equiv refresh links [djay]

1.0b5 (2011-02-06)

  • files use file pointers to reduce memory usage [djay]
  • cache saves .metadata files to record and playback headersx [djay]

1.0b4 (2010-12-13)

  • improve logging [djay]
  • fix encoding bug caused by cache [djay]

1.0b3 (2010-11-10)

  • Fixed bug in cache that caused many links to be ignored in some cases [djay]
  • Fix documentation up [djay]

1.0b2 (2010-11-09)

  • Stopped localhost output when no output set [djay]

1.0b1 (2010-11-08)

  • change site_url to just url. [djay]
  • rename maxpage to maxsize [djay]
  • fix file: style urls [djay]
  • Added cache option to replace base_alias [djay]
  • fix _origin key set by webcrawler, instead of url now it is path as expected by further blue [Vitaliy Podoba]
  • add _orig_path to pipeline item to keep original path for any further purposes, we will need [Vitaliy Podoba]
  • make all url absolute taking into account base tags inside webcrawler blueprint
    [Vitaliy Podoba]

0.1 (2008-09-25)

  • renamed package from pretaweb.blueprints to transmogrify.webcrawler.
    [djay]
  • enhanced import view [djay]
Release History

Release History

1.2.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b7

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b6

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b6dev

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b5

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0b1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
transmogrify.webcrawler-1.2.1.tar.gz (512.4 kB) Copy SHA256 Checksum SHA256 Source Jan 9, 2013

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting