Crawling and feeding html content into a transmogrifier pipeline
Project description
Crawling - html to import
webcrawler imports HTML either from a live website, for a folder on disk, or a folder on disk with html which used to come from a live website and may still have absolute links refering to that website.
To crawl a live website supply the crawler with a base http url to start crawling with. This url must be the url which all the other urls you want from the site start with.
For example
[crawler] blueprint = transmogrify.webcrawler url = http://www.whitehouse.gov max = 50
will restrict the crawler to the first 50 pages.
You can also crawl a local directory of html with relative links by just using a file: style url
[crawler] blueprint = transmogrify.webcrawler url = file:///mydirectory
or if the local directory contains html saved from a website and might have absolute urls in it the you can set this as the cache. The crawler will always look up the cache first
[crawler] blueprint = transmogrify.webcrawler url = http://therealsite.com --crawler:cache=mydirectory
The following will not crawl anything larget than 4Mb
[crawler] blueprint = transmogrify.webcrawler url = http://www.whitehouse.gov maxsize=400000
To skip crawling links by regular expression
[crawler] blueprint = transmogrify.webcrawler url=http://www.whitehouse.gov ignore = \.mp3 \.mp4
If funnelweb is having trouble parsing the html of some pages you can preprocesses the html before it is parsed. e.g.
[crawler] blueprint = transmogrify.webcrawler patterns = (<script>)[^<]*(</script>) subs = \1\2
If you’d like to skip processing links with certain mimetypes you can use the drop:condition. This TALES expression determines what will be processed further. see http://pypi.python.org/pypi/collective.transmogrifier/#condition-section
[drop] blueprint = collective.transmogrifier.sections.condition condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']
- transmogrify.webcrawler
A source blueprint for crawling content from a site or local html files.
# # Crawls site or cache for content # see http://pypi.python.org/pypi/transmogrify.webcrawler # # site_url - the top url to crawl # ignore - list of regex for urls to not crawl # cache - local directory to read crawled items from instead of accessing the site directly # patterns - Regular expressions to substitute before html is parsed. New line seperated # subs - Text to replace each item in patterns. Must be the same number of lines as patterns # maxsize - don’t crawl anything larger than this # max - Limit crawling to this number of pages #
# WebCrawler will emit items like # item = dict(_site_url = “Original site_url used”, # _path = “The url crawled without _site_url, # _content = “The raw content returned by the url”, # _content_info = “Headers returned with content” # _backlinks = names, # _sortorder = “An integer representing the order the url was found within the page/site # )
- transmogrify.webcrawler.typerecognitor
A blueprint for assinging content type based on the mime-type as given by the webcrawler
- transmogrify.webcrawler.cache
A blueprint that saves crawled content into a directory structure
transmogrify.webcrawler
A transmogrifier blueprint source which will crawl a url reading in all pages until all have been crawled.
Options
- site_url
URL to start crawling. The URL will be treated as the base and any links outside this base will be ignored
- ignore
Regular expressions for urls not to follow
- patterns
Regular expressions to substitute before html is parsed. New line seperated
- subs
Text to replace
- checkext
checkext
- verbose
verbose
- maxsize
don’t crawl anything larger than this
- nonames
nonames
- cache
cache
Keys inserted
The following set the keys items added to the pipeline
- pathkey
default: _path. The path of the url not including the base
- siteurlkey
default: _site_url. The base of the url
- originkey
default: _origin. The original path in case retriving the url caused a redirection
- contentkey
default: _content. The main content of the url
- contentinfokey
default: _content_info. Headers returned by urlopen
- sortorderkey
default: _sortoder. A count on when a link to this item was first encounted while crawling
- backlinkskey
default: _backlinks. A list of tuples of which pages linked to this item. (url, path)
Tests
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... """) {'_backlinks': [], '_content_info': {'content-type': 'text/html'}, '_mimetype': 'text/html', '_origin': 'file://.../test_staticsite', '_path': '', '_site_url': 'file://.../test_staticsite/', '_sortorder': 0, '_type': 'Document'} ...
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... """) {... '_path': '', ...} {... '_path': 'cia-plone-view-source.jpg', ...} {... '_path': 'subfolder', ...} {... '_path': 'subfolder2', ...} {... '_path': 'file3.html', ...} {... '_path': 'subfolder/subfile1.htm', ...} {... '_path': 'file.doc', ...} {... '_path': 'file2.htm', ...} {... '_path': 'file4.HTML', ...} {... '_path': 'egenius-plone.gif', ...} {... '_path': 'plone_schema.png', ...} {... '_path': 'file1.htm', ...} {... '_path': 'subfolder2/subfile1.htm', ...} ...
>>> testtransmogrifier(""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... patterns = ... (?s)<SCRIPT.*Abbreviation"\) ... (?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\) ... (?s)State=.*<body[^>]*> ... subs = ... </head><body> ... <a href="\g<u>">\g<a></a> ... <br> ... """)
External scripts used
http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
thon.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
TypeRecognitor
TypeRecognitor is a transmogrifier blue print which determines the plone type of the item from the mime_type in the header. It reads the mimetype from the headers in _content_info set by transmogrify.webrawler
>>> from os.path import dirname >>> from os.path import abspath >>> config = """ ... ... [transmogrifier] ... pipeline = ... webcrawler ... typerecognitor ... clean ... printer ... ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... ... [typerecognitor] ... blueprint = transmogrify.webcrawler.typerecognitor ... ... [clean] ... blueprint = collective.transmogrifier.sections.manipulator ... delete = ... file ... text ... image ... ... [printer] ... blueprint = collective.transmogrifier.sections.tests.pprinter ... ... """ % abspath(dirname(__file__)).replace('\\','/')
>>> from collective.transmogrifier.tests import registerConfig >>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
>>> from collective.transmogrifier.transmogrifier import Transmogrifier >>> transmogrifier = Transmogrifier(plone) >>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test') {... '_mimetype': 'image/jpeg', ... '_path': 'cia-plone-view-source.jpg', ... '_type': 'Image', ...} ...
- {‘_mimetype’: ‘image/gif’,
‘_path’: ‘/egenius-plone.gif’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘application/msword’,
‘_path’: ‘/file.doc’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: ‘doc_to_html’, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file2.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file3.html’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file4.HTML’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘image/png’,
‘_path’: ‘/plone_schema.png’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder/subfile1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
Changelog
1.0b4 (2010-12-13)
improve logging
fix encoding bug caused by cache
1.0b3 (2010-11-10)
Fixed bug in cache that caused many links to be ignored in some cases
Fix documentation up
1.0b2 (2010-11-09)
Stopped localhost output when no output set
1.0b1 (2010-11-08)
change site_url to just url.
rename maxpage to maxsize
fix file: style urls
Added cache option to replace base_alias
fix _origin key set by webcrawler, instead of url now it is path as expected by further blue [Vitaliy Podoba]
add _orig_path to pipeline item to keep original path for any further purposes, we will need [Vitaliy Podoba]
- make all url absolute taking into account base tags inside webcrawler blueprint
[Vitaliy Podoba]
0.1 (2008-09-25)
- renamed package from pretaweb.blueprints to transmogrify.webcrawler.
[djay]
enhanced import view (djay)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for transmogrify.webcrawler-1.0b4.zip
Algorithm | Hash digest | |
---|---|---|
SHA256 | d217b1bdfef028cfc5fb03a8dfb659a625dd0a008b33d1ce23ab6342490fc007 |
|
MD5 | 34aef3928f048688ab006d93e4973905 |
|
BLAKE2b-256 | 8d16f3b1710e95d29633db27f0f9e2da429cb294f353a4dcd0613216f300797b |