Skip to main content

Crawling and feeding html content into a transmogrifier pipeline

Project description

Introduction

transmogrify.webcrawler

A source blueprint for crawling content from a site or local html files.

transmogrify.webcrawler.typerecognitor

A blueprint for assinging content type based on the mime-type as given by the webcrawler

transmogrify.webcrawler.cache

A blueprint that saves crawled content into a directory structure

transmogrify.webcrawler

A transmogrifier blueprint source which will crawl a url reading in all pages until all have been crawled.

Options

site_url

URL to start crawling. The URL will be treated as the base and any links outside this base will be ignored

ignore

Regular expressions for urls not to follow

alias_bases

Substitutions for url bases. This is useful where url to access is not the same as absolute urls of links in the pages

patterns

Regular expressions to substitute before html is parsed. New line seperated

subs

Text to replace

checkext

checkext

verbose

verbose

maxpage

maxpage

nonames

nonames

cache

cache

Keys inserted

The following set the keys items added to the pipeline

pathkey

default: _path. The path of the url not including the base

siteurlkey

default: _site_url. The base of the url

originkey

default: _origin. The original path in case retriving the url caused a redirection

contentkey

default: _content. The main content of the url

contentinfokey

default: _content_info. Headers returned by urlopen

sortorderkey

default: _sortoder. A count on when a link to this item was first encounted while crawling

backlinkskey

default: _backlinks. A list of tuples of which pages linked to this item. (url, path)

Tests

>>> testtransmogrifier(dontprint=['_content'], source="""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... """)
{'_backlinks': [],
 '_content_info': {'content-type': 'text/html'},
 '_origin': 'file://.../test_staticsite',
 '_path': '',
 '_site_url': 'file://.../test_staticsite/',
 '_sortorder': 0}
...
>>> testtransmogrifier(source=webcrawler, strip=['_content'])
{...
 '_path': '',
 ...}
{...
 '_path': 'file2.htm',
 ...}
{...
 '_path': 'subfolder',
 ...}
{...
 '_path': 'egenius-plone.gif',
 ...}
{...
 '_path': 'plone_schema.png',
 ...}
...
>>> source = """
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... patterns =
...             (?s)<SCRIPT.*Abbreviation"\)
...             (?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\)
...     (?s)State=.*<body[^>]*>
... subs =
...     </head><body>
...             <a href="\g<u>">\g<a></a>
...     <br>
... """

External scripts used

http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py

TypeRecognitor

TypeRecognitor is a transmogrifier blue print which determines the plone type of the item from the mime_type in the header. It reads the mimetype from the headers in _content_info set by transmogrify.webrawler

>>> from os.path import dirname
>>> from os.path import abspath
>>> config = """
...
... [transmogrifier]
... pipeline =
...     webcrawler
...     typerecognitor
...     clean
...     printer
...
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
...
... [typerecognitor]
... blueprint = transmogrify.webcrawler.typerecognitor
...
... [clean]
... blueprint = collective.transmogrifier.sections.manipulator
... delete =
...   file
...   text
...   image
...
... [printer]
... blueprint = collective.transmogrifier.sections.tests.pprinter
...
... """ % abspath(dirname(__file__)).replace('\\','/')
>>> from collective.transmogrifier.tests import registerConfig
>>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
>>> from collective.transmogrifier.transmogrifier import Transmogrifier
>>> transmogrifier = Transmogrifier(plone)
>>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test')
{...
 '_mimetype': 'image/jpeg',
 ...
 '_path': 'cia-plone-view-source.jpg',
 ...
 '_type': 'Image',
 ...}
 ...
{‘_mimetype’: ‘image/gif’,

‘_path’: ‘/egenius-plone.gif’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}

{‘_mimetype’: ‘application/msword’,

‘_path’: ‘/file.doc’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: ‘doc_to_html’, ‘_type’: ‘Document’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/file1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/file2.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/file3.html’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/file4.HTML’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

{‘_mimetype’: ‘image/png’,

‘_path’: ‘/plone_schema.png’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/subfolder’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

{‘_mimetype’: ‘text/html’,

‘_path’: ‘/subfolder/subfile1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}

Changelog

1.0 - Unreleased

  • Initial release

transmogrify.webcrawler 0.1 - October 25, 2008

  • renamed package from pretaweb.blueprints to transmogrify.webcrawler. [djay]

  • enhanced import view (djay)

0.2

16-7-09 djay Added caching of crawled sites

10-7-09 djay Added UI using z3cform

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page