Crawling and feeding html content into a transmogrifier pipeline
Project description
Introduction
- transmogrify.webcrawler
A source blueprint for crawling content from a site or local html files.
# WebCrawler will emit items like # item = dict(_site_url = “Original site_url used”, # _path = “The url crawled without _site_url, # _content = “The raw content returned by the url”, # _content_info = “Headers returned with content” # _backlinks = names, # _sortorder = “An integer representing the order the url was found within the page/site # )
- transmogrify.webcrawler.typerecognitor
A blueprint for assinging content type based on the mime-type as given by the webcrawler
- transmogrify.webcrawler.cache
A blueprint that saves crawled content into a directory structure
transmogrify.webcrawler
A transmogrifier blueprint source which will crawl a url reading in all pages until all have been crawled.
Options
- site_url
URL to start crawling. The URL will be treated as the base and any links outside this base will be ignored
- ignore
Regular expressions for urls not to follow
- patterns
Regular expressions to substitute before html is parsed. New line seperated
- subs
Text to replace
- checkext
checkext
- verbose
verbose
- maxsize
don’t crawl anything larger than this
- nonames
nonames
- cache
cache
Keys inserted
The following set the keys items added to the pipeline
- pathkey
default: _path. The path of the url not including the base
- siteurlkey
default: _site_url. The base of the url
- originkey
default: _origin. The original path in case retriving the url caused a redirection
- contentkey
default: _content. The main content of the url
- contentinfokey
default: _content_info. Headers returned by urlopen
- sortorderkey
default: _sortoder. A count on when a link to this item was first encounted while crawling
- backlinkskey
default: _backlinks. A list of tuples of which pages linked to this item. (url, path)
Tests
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... """) {'_backlinks': [], '_content_info': {'content-type': 'text/html'}, '_mimetype': 'text/html', '_origin': 'file://.../test_staticsite', '_path': '', '_site_url': 'file://.../test_staticsite/', '_sortorder': 0, '_type': 'Document'} ...
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... """) {... '_path': '', ...} {... '_path': 'cia-plone-view-source.jpg', ...} {... '_path': 'subfolder', ...} {... '_path': 'subfolder2', ...} {... '_path': 'file3.html', ...} {... '_path': 'subfolder/subfile1.htm', ...} {... '_path': 'file.doc', ...} {... '_path': 'file2.htm', ...} {... '_path': 'file4.HTML', ...} {... '_path': 'egenius-plone.gif', ...} {... '_path': 'plone_schema.png', ...} {... '_path': 'file1.htm', ...} {... '_path': 'subfolder2/subfile1.htm', ...} ...
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... patterns = ... (?s)<SCRIPT.*Abbreviation"\) ... (?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\) ... (?s)State=.*<body[^>]*> ... subs = ... </head><body> ... <a href="\g<u>">\g<a></a> ... <br> ... """)
External scripts used
http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
thon.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
TypeRecognitor
TypeRecognitor is a transmogrifier blue print which determines the plone type of the item from the mime_type in the header. It reads the mimetype from the headers in _content_info set by transmogrify.webrawler
>>> from os.path import dirname >>> from os.path import abspath >>> config = """ ... ... [transmogrifier] ... pipeline = ... webcrawler ... typerecognitor ... clean ... printer ... ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... ... [typerecognitor] ... blueprint = transmogrify.webcrawler.typerecognitor ... ... [clean] ... blueprint = collective.transmogrifier.sections.manipulator ... delete = ... file ... text ... image ... ... [printer] ... blueprint = collective.transmogrifier.sections.tests.pprinter ... ... """ % abspath(dirname(__file__)).replace('\\','/')
>>> from collective.transmogrifier.tests import registerConfig >>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
>>> from collective.transmogrifier.transmogrifier import Transmogrifier >>> transmogrifier = Transmogrifier(plone) >>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test') {... '_mimetype': 'image/jpeg', ... '_path': 'cia-plone-view-source.jpg', ... '_type': 'Image', ...} ...
- {‘_mimetype’: ‘image/gif’,
‘_path’: ‘/egenius-plone.gif’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘application/msword’,
‘_path’: ‘/file.doc’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: ‘doc_to_html’, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file2.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file3.html’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file4.HTML’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘image/png’,
‘_path’: ‘/plone_schema.png’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder/subfile1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
Changelog
1.0 - Unreleased
Initial release
transmogrify.webcrawler 0.1 - October 25, 2008
renamed package from pretaweb.blueprints to transmogrify.webcrawler. [djay]
enhanced import view (djay)
0.2
16-7-09 djay Added caching of crawled sites
10-7-09 djay Added UI using z3cform
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for transmogrify.webcrawler-1.0b1.zip
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d4f33760b61da1d21a8993d6bf8ab9904ac0a2bdaa5d77b29c80e7b1b95a470 |
|
MD5 | b58aff6badf076352e49b504f628859e |
|
BLAKE2b-256 | 1d65dd333fef4adec7e48e863e9aafbd71671a8d841e69ff72c00fd68ef9497e |