Easily import static HTML websites into Plone.
Project description
Introduction
============
``Parse2Plone`` is an HTML parser (in the form of a Buildout recipe that
creates a script for you) to easily get content from static HTML websites
into Plone.
Warning
-------
This is a **Buildout recipe** for use with Plone; by itself it does nothing. If you
do not know what Plone is, please see: http://plone.org. If you do not know
what Buildout is, please see: http://www.buildout.org/.
Getting started
---------------
Because it always drives me nuts when you have to dig for a recipe's options,
here they are::
[import]
recipe = parse2plone
path = /Plone
illegal_chars = _ .
html_extensions = html
image_extensions = gif jpg jpeg png
file_extensiosn = mp3
target_tags = a div h1 h2 p
force = false
publish = false
slugify = false
rename = false
The parameters listed above are configured with their default values. Edit these
values if you would like to change the default behavior; they are (mostly)
self-explanatory. Now you can just cut and paste to get started or keep reading if
you would like to know more.
Explanation
-----------
Why did you create ``Parse2Plone`` when the following packages already exist:
* http://pypi.python.org/pypi/collective.transmogrifier
* http://pypi.python.org/pypi/transmogrify.filesystem
* http://pypi.python.org/pypi/transmogrify.htmlcontentextractor
* http://pypi.python.org/pypi/transmogrify.webcrawler
Here are some reasons:
* Because ``Parse2Plone`` is aimed at lowering the bar for folks who don't already
know (or want to know) what a "transmogrifier blueprint" is but are able to update
their ``buildout.cfg`` file; run ``Buildout``; then run a single command all
without having to think too much.
* collective.transmogrify provides a framework for creating reusable pipes
(whose definitions are called blueprints). ``Parse2Plone`` provides
a single, non-reusable "pipe/blueprint".
* The author had an itch to scratch; it will be nice for him to be able to say
"just go write a script" and then point to an example.
* Transmogrifier and friends appear to be "developer's tools", while the author wants
``Parse2Plone`` to be an "end user's tool".
If you are a developer looking to create repeatable migrations, you probably want to be
using ``collective.transmogrifier``. If you are an end user that just wants to see your
static website in Plone, then you might want to give ``Parse2Plone`` a try.
There is also this user comment, which captures the author's sentiment::
Parse2Plone's release was very timely as I need either this or something very
similar - and while I've no doubt I could make transmogrify do the job, it's a
lot of work for a one-shot loading of legacy pages.
-Derek Broughton
Installation
------------
You can install ``Parse2Plone`` by editing your ``buildout.cfg`` file like
so. First add an ``import`` section::
[import]
recipe = parse2plone
Then add the ``import`` section to the list of parts::
[buildout]
...
parts =
...
import
Now run ``bin/buildout`` as usual.
Execution
---------
You can run ``Parse2Plone`` like this::
$ bin/plone run bin/import /path/to/files
.. Note::
In the example above and examples below, ``bin/plone`` refers to a *Zope 2
instance* script created by `plone.recipe.zope2instance`_.
Your ``bin/plone`` script may be called ``bin/instance`` or
``bin/client``, etc. instead.
.. _`plone.recipe.zope2instance`: http://pypi.python.org/pypi/plone.recipe.zope2instance
Demonstration
-------------
If you have a site in /var/www/html that contains the following::
/var/www/html/index.html
/var/www/html/about/index.html
You should run::
$ bin/plone run bin/import /var/www/html
And the following will be created:
* http://localhost:8080/Plone/index.html
* http://localhost:8080/Plone/about/index.html
Modification
------------
Modifying the default behavior of ``parse2plone`` is easy; just use the command
line options or add parameters to your ``buildout.cfg`` file. Both approaches
allow customization of the same set of options, but the command line arguments
will trump any settings found in your ``buildout.cfg`` file.
Buildout options
~~~~~~~~~~~~~~~~
You can configure the following parameters in your ``buildout.cfg`` file in
the ``parse2plone`` recipe section.
Options
'''''''
+---------------------+-------------+----------------------------------------+
| **Parameter** | **Default** | **Description** |
| | **value** | |
+---------------------+-------------+----------------------------------------+
| ``path`` | /Plone | Specify an alternate location in the |
| | | database for the import to occur. |
+---------------------+-------------+----------------------------------------+
| ``illegal_chars`` | _ . | Specify illegal characters. |
| | | ``parse2plone`` will ignore files that |
| | | contain these characters. |
+---------------------+-------------+----------------------------------------+
| ``html_extensions`` | html | Specify HTML file extensions. |
| | | ``parse2plone`` will import HTML files |
| | | with these extensions |
+---------------------+-------------+----------------------------------------+
| ``image_extensions``| png, gif, | Specify image file extensions. |
| | jpg, jpeg, | ``parse2plone`` will import image files|
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``file_extensions`` | mp3 | Specify image file extensions. |
| | | ``parse2plone`` will import files with |
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``target_tags`` | a h1 h2 p | Specify target tags. ``parse2plone`` |
| | | will parse the contents of HTML tags |
| | | listed. |
+---------------------+-------------+----------------------------------------+
| ``force`` | false | Force create folders that do not exist.|
+---------------------+-------------+----------------------------------------+
| ``publish`` | false | Publish newly created content. |
+---------------------+-------------+----------------------------------------+
| ``slugify`` | false | "Slugify" content. (see slugify.py) |
+---------------------+-------------+----------------------------------------+
| ``rename`` | false | Rename content. (see rename.py) |
+---------------------+-------------+----------------------------------------+
Example
'''''''
Instead of accepting the default ``parse2plone`` behaviour, in your
``buildout.cfg`` file you may specify the following::
[import]
recipe = parse2plone
path = /Plone/foo
html_extensions = htm
image_extensions = png
target_tags = p
This will configure ``parse2plone`` to (only) import content *from*:
* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags
*to*:
* A folder named ``/Plone/foo``.
Command line options
~~~~~~~~~~~~~~~~~~~~
The following ``parse2plone`` command line options are supported.
Options
'''''''
``'--path'``, ``'-p'``
**********************
You can specify an alternate import path ('/Plone' by default)
with ``--path`` or ``-p``::
$ bin/plone run bin/import /path/to/files --path=/Plone/foo
``'--html-extensions'``
***********************
You can specify HTML file extensions with the ``--html-extensions`` option::
$ bin/plone run bin/import /path/to/files --html-extensions=htm
``'--image-extensions'``
************************
You can specify image file extensions with the ``--image-extensions`` option::
$ bin/plone run bin/import /path/to/files --image-extensions=png
``'--file-extensions'``
***********************
You can specify generic file extensions with the ``--file-extensions`` option::
$ bin/plone run bin/import /path/to/files --file-extensions=pdf
``'--target-tags'``
*******************
You can specify the target tags to parse with the ``--target-tags`` option::
$ bin/plone run bin/import /path/to/files --target-tags=p
``'--force'``
*************
Force create folders that do not exist.
``'--publish'``
***************
Publish newly created content.
``'--slugify'``
***************
"Slugify" content (see slugify.py).
``'--rename'``
***************
Rename content (see rename.py).
``'--help'``
************
You can ask ``parse2plone`` to tell you about its available options with the ``--help``
option::
$ bin/plone run bin/import -h
Usage: import [options]
Options:
-h, --help show this help message and exit
-p PATH, --path=PATH Path to Plone site object or sub-folder
--html-extensions=HTML_EXTENSIONS
Specify HTML file extensions
--illegal-chars=ILLEGAL_CHARS
Specify characters to ignore
--image-extensions=IMAGE_EXTENSIONS
Specify image file extensions
--file-extensions=FILE_EXTENSIONS
Specify generic file extensions
--target-tags=TARGET_TAGS
Specify HTML tags to parse
--force Force creation of folders
--publish Optionally publish newly created content
--slugify Optionally "slugify" content (see slugify.py)
--rename=RENAME Optionally rename content (see rename.py)
Example
'''''''
Instead of accepting the default ``parse2plone`` behaviour, on the command line you
may specify the following::
$ bin/plone run bin/import /path/to/files -p /Plone/foo --html-extensions=html \
--image-extensions=png --target-tags=p
This will configure ``parse2plone`` to (only) import content *from*:
* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags
*to*:
* A Plone site folder named ``/Plone/foo``.
Consternation
-------------
Here are some trouble-shooting comments/tips.
Compiling lxml
~~~~~~~~~~~~~~
``Parse2Plone`` requires ``lxml`` which in turn requires ``libxml2`` and
``libxslt``. If you do not have ``lxml`` installed "globally" (i.e. in your
system Python's site-packages directory) then Buildout will try to install it
for you. At this point ``lxml`` will look for the libxml2/libxslt2 development
libraries to build against, and if you don't have them installed on your system
already *your mileage may vary* (i.e. Buildout will fail).
Database access
~~~~~~~~~~~~~~~
Before running ``parse2plone``, you must either stop your Plone site or
use ZEO. Otherwise ``parse2plone`` will not be able to access the
database.
Communication
-------------
Questions, comments, or concerns? Please e-mail: aclark@aclark.net.
History
-------
0.9.0 (11/03/2010)
~~~~~~~~~~~~~~~~~~
- Fix regressions introduced (or unresolved as of) 0.8.2. Thanks Derek
Broughton for the bug report(s).
- Many fixes to convert_parameter_values() method which converts
recipe parameters to arguments passed to main()
- Fix slugify feature
0.8.2 (11/02/2010)
~~~~~~~~~~~~~~~~~~
- Add rename feature
- Fix regressions introduced in 0.8.1
0.8.1 (10/29/2010)
~~~~~~~~~~~~~~~~~~
- Refactor options/parameters functionality to universally support _SETTINGS dict
- Add "slugify" feature
- Doc fixes
- Add support to optionally publish content after creation
- Add support for generic file import
0.8 (10/27/2010)
~~~~~~~~~~~~~~~~
- Support the importing of content to folders within the Plone site object
0.7 (10/25/2010)
~~~~~~~~~~~~~~~~
- Documentation fixes
0.6 (10/25/2010)
~~~~~~~~~~~~~~~~
- Support customization via recipe parameters and command line arguments
0.5 (10/22/2010)
~~~~~~~~~~~~~~~~
- Revert 'Add Plone to install_requires'
0.4 (10/22/2010)
~~~~~~~~~~~~~~~~
- Add 'Plone' to install_requires
0.3 (10/22/2010)
~~~~~~~~~~~~~~~~
- Another setuptools fix
0.2 (10/22/2010)
~~~~~~~~~~~~~~~~
- Setuptools fix
0.1 (10/21/2010)
~~~~~~~~~~~~~~~~
- Initial release
============
``Parse2Plone`` is an HTML parser (in the form of a Buildout recipe that
creates a script for you) to easily get content from static HTML websites
into Plone.
Warning
-------
This is a **Buildout recipe** for use with Plone; by itself it does nothing. If you
do not know what Plone is, please see: http://plone.org. If you do not know
what Buildout is, please see: http://www.buildout.org/.
Getting started
---------------
Because it always drives me nuts when you have to dig for a recipe's options,
here they are::
[import]
recipe = parse2plone
path = /Plone
illegal_chars = _ .
html_extensions = html
image_extensions = gif jpg jpeg png
file_extensiosn = mp3
target_tags = a div h1 h2 p
force = false
publish = false
slugify = false
rename = false
The parameters listed above are configured with their default values. Edit these
values if you would like to change the default behavior; they are (mostly)
self-explanatory. Now you can just cut and paste to get started or keep reading if
you would like to know more.
Explanation
-----------
Why did you create ``Parse2Plone`` when the following packages already exist:
* http://pypi.python.org/pypi/collective.transmogrifier
* http://pypi.python.org/pypi/transmogrify.filesystem
* http://pypi.python.org/pypi/transmogrify.htmlcontentextractor
* http://pypi.python.org/pypi/transmogrify.webcrawler
Here are some reasons:
* Because ``Parse2Plone`` is aimed at lowering the bar for folks who don't already
know (or want to know) what a "transmogrifier blueprint" is but are able to update
their ``buildout.cfg`` file; run ``Buildout``; then run a single command all
without having to think too much.
* collective.transmogrify provides a framework for creating reusable pipes
(whose definitions are called blueprints). ``Parse2Plone`` provides
a single, non-reusable "pipe/blueprint".
* The author had an itch to scratch; it will be nice for him to be able to say
"just go write a script" and then point to an example.
* Transmogrifier and friends appear to be "developer's tools", while the author wants
``Parse2Plone`` to be an "end user's tool".
If you are a developer looking to create repeatable migrations, you probably want to be
using ``collective.transmogrifier``. If you are an end user that just wants to see your
static website in Plone, then you might want to give ``Parse2Plone`` a try.
There is also this user comment, which captures the author's sentiment::
Parse2Plone's release was very timely as I need either this or something very
similar - and while I've no doubt I could make transmogrify do the job, it's a
lot of work for a one-shot loading of legacy pages.
-Derek Broughton
Installation
------------
You can install ``Parse2Plone`` by editing your ``buildout.cfg`` file like
so. First add an ``import`` section::
[import]
recipe = parse2plone
Then add the ``import`` section to the list of parts::
[buildout]
...
parts =
...
import
Now run ``bin/buildout`` as usual.
Execution
---------
You can run ``Parse2Plone`` like this::
$ bin/plone run bin/import /path/to/files
.. Note::
In the example above and examples below, ``bin/plone`` refers to a *Zope 2
instance* script created by `plone.recipe.zope2instance`_.
Your ``bin/plone`` script may be called ``bin/instance`` or
``bin/client``, etc. instead.
.. _`plone.recipe.zope2instance`: http://pypi.python.org/pypi/plone.recipe.zope2instance
Demonstration
-------------
If you have a site in /var/www/html that contains the following::
/var/www/html/index.html
/var/www/html/about/index.html
You should run::
$ bin/plone run bin/import /var/www/html
And the following will be created:
* http://localhost:8080/Plone/index.html
* http://localhost:8080/Plone/about/index.html
Modification
------------
Modifying the default behavior of ``parse2plone`` is easy; just use the command
line options or add parameters to your ``buildout.cfg`` file. Both approaches
allow customization of the same set of options, but the command line arguments
will trump any settings found in your ``buildout.cfg`` file.
Buildout options
~~~~~~~~~~~~~~~~
You can configure the following parameters in your ``buildout.cfg`` file in
the ``parse2plone`` recipe section.
Options
'''''''
+---------------------+-------------+----------------------------------------+
| **Parameter** | **Default** | **Description** |
| | **value** | |
+---------------------+-------------+----------------------------------------+
| ``path`` | /Plone | Specify an alternate location in the |
| | | database for the import to occur. |
+---------------------+-------------+----------------------------------------+
| ``illegal_chars`` | _ . | Specify illegal characters. |
| | | ``parse2plone`` will ignore files that |
| | | contain these characters. |
+---------------------+-------------+----------------------------------------+
| ``html_extensions`` | html | Specify HTML file extensions. |
| | | ``parse2plone`` will import HTML files |
| | | with these extensions |
+---------------------+-------------+----------------------------------------+
| ``image_extensions``| png, gif, | Specify image file extensions. |
| | jpg, jpeg, | ``parse2plone`` will import image files|
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``file_extensions`` | mp3 | Specify image file extensions. |
| | | ``parse2plone`` will import files with |
| | | with these extensions. |
+---------------------+-------------+----------------------------------------+
| ``target_tags`` | a h1 h2 p | Specify target tags. ``parse2plone`` |
| | | will parse the contents of HTML tags |
| | | listed. |
+---------------------+-------------+----------------------------------------+
| ``force`` | false | Force create folders that do not exist.|
+---------------------+-------------+----------------------------------------+
| ``publish`` | false | Publish newly created content. |
+---------------------+-------------+----------------------------------------+
| ``slugify`` | false | "Slugify" content. (see slugify.py) |
+---------------------+-------------+----------------------------------------+
| ``rename`` | false | Rename content. (see rename.py) |
+---------------------+-------------+----------------------------------------+
Example
'''''''
Instead of accepting the default ``parse2plone`` behaviour, in your
``buildout.cfg`` file you may specify the following::
[import]
recipe = parse2plone
path = /Plone/foo
html_extensions = htm
image_extensions = png
target_tags = p
This will configure ``parse2plone`` to (only) import content *from*:
* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags
*to*:
* A folder named ``/Plone/foo``.
Command line options
~~~~~~~~~~~~~~~~~~~~
The following ``parse2plone`` command line options are supported.
Options
'''''''
``'--path'``, ``'-p'``
**********************
You can specify an alternate import path ('/Plone' by default)
with ``--path`` or ``-p``::
$ bin/plone run bin/import /path/to/files --path=/Plone/foo
``'--html-extensions'``
***********************
You can specify HTML file extensions with the ``--html-extensions`` option::
$ bin/plone run bin/import /path/to/files --html-extensions=htm
``'--image-extensions'``
************************
You can specify image file extensions with the ``--image-extensions`` option::
$ bin/plone run bin/import /path/to/files --image-extensions=png
``'--file-extensions'``
***********************
You can specify generic file extensions with the ``--file-extensions`` option::
$ bin/plone run bin/import /path/to/files --file-extensions=pdf
``'--target-tags'``
*******************
You can specify the target tags to parse with the ``--target-tags`` option::
$ bin/plone run bin/import /path/to/files --target-tags=p
``'--force'``
*************
Force create folders that do not exist.
``'--publish'``
***************
Publish newly created content.
``'--slugify'``
***************
"Slugify" content (see slugify.py).
``'--rename'``
***************
Rename content (see rename.py).
``'--help'``
************
You can ask ``parse2plone`` to tell you about its available options with the ``--help``
option::
$ bin/plone run bin/import -h
Usage: import [options]
Options:
-h, --help show this help message and exit
-p PATH, --path=PATH Path to Plone site object or sub-folder
--html-extensions=HTML_EXTENSIONS
Specify HTML file extensions
--illegal-chars=ILLEGAL_CHARS
Specify characters to ignore
--image-extensions=IMAGE_EXTENSIONS
Specify image file extensions
--file-extensions=FILE_EXTENSIONS
Specify generic file extensions
--target-tags=TARGET_TAGS
Specify HTML tags to parse
--force Force creation of folders
--publish Optionally publish newly created content
--slugify Optionally "slugify" content (see slugify.py)
--rename=RENAME Optionally rename content (see rename.py)
Example
'''''''
Instead of accepting the default ``parse2plone`` behaviour, on the command line you
may specify the following::
$ bin/plone run bin/import /path/to/files -p /Plone/foo --html-extensions=html \
--image-extensions=png --target-tags=p
This will configure ``parse2plone`` to (only) import content *from*:
* Images ending in ``.png``
* HTML files ending in ``.htm``
* Text within ``p`` tags
*to*:
* A Plone site folder named ``/Plone/foo``.
Consternation
-------------
Here are some trouble-shooting comments/tips.
Compiling lxml
~~~~~~~~~~~~~~
``Parse2Plone`` requires ``lxml`` which in turn requires ``libxml2`` and
``libxslt``. If you do not have ``lxml`` installed "globally" (i.e. in your
system Python's site-packages directory) then Buildout will try to install it
for you. At this point ``lxml`` will look for the libxml2/libxslt2 development
libraries to build against, and if you don't have them installed on your system
already *your mileage may vary* (i.e. Buildout will fail).
Database access
~~~~~~~~~~~~~~~
Before running ``parse2plone``, you must either stop your Plone site or
use ZEO. Otherwise ``parse2plone`` will not be able to access the
database.
Communication
-------------
Questions, comments, or concerns? Please e-mail: aclark@aclark.net.
History
-------
0.9.0 (11/03/2010)
~~~~~~~~~~~~~~~~~~
- Fix regressions introduced (or unresolved as of) 0.8.2. Thanks Derek
Broughton for the bug report(s).
- Many fixes to convert_parameter_values() method which converts
recipe parameters to arguments passed to main()
- Fix slugify feature
0.8.2 (11/02/2010)
~~~~~~~~~~~~~~~~~~
- Add rename feature
- Fix regressions introduced in 0.8.1
0.8.1 (10/29/2010)
~~~~~~~~~~~~~~~~~~
- Refactor options/parameters functionality to universally support _SETTINGS dict
- Add "slugify" feature
- Doc fixes
- Add support to optionally publish content after creation
- Add support for generic file import
0.8 (10/27/2010)
~~~~~~~~~~~~~~~~
- Support the importing of content to folders within the Plone site object
0.7 (10/25/2010)
~~~~~~~~~~~~~~~~
- Documentation fixes
0.6 (10/25/2010)
~~~~~~~~~~~~~~~~
- Support customization via recipe parameters and command line arguments
0.5 (10/22/2010)
~~~~~~~~~~~~~~~~
- Revert 'Add Plone to install_requires'
0.4 (10/22/2010)
~~~~~~~~~~~~~~~~
- Add 'Plone' to install_requires
0.3 (10/22/2010)
~~~~~~~~~~~~~~~~
- Another setuptools fix
0.2 (10/22/2010)
~~~~~~~~~~~~~~~~
- Setuptools fix
0.1 (10/21/2010)
~~~~~~~~~~~~~~~~
- Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parse2plone-0.9.0.zip
(35.6 kB
view hashes)