Download data from Drupal using Python
Project description
This python 3 module will help you to extract data from a Drupal 7 or 8 web site. This data can be later used for archiving or for migration to another CMS, such as Wagtail.
Motivation
Drupal is an open source CMS, written in PHP. It has a long history and many followers. A large number of web sites run on Drupal and there is a substantial development community. The product is extendable, there are plenty of useful modules available to enhance functionality of Drupal web sites.
Over the years there has been a number of major releases of Drupal. At the time of writing the most recent is Drupal 8. In the past migration from one major version to another was done in-place. However, in order to move from an earlier version to 8, the user must follow a process of migration, involving creating a basic deployment of Drupal 8 first and then using additional tools to copy the data from the old web site to the new. While this should work in theory, the reality seems a bit different.
I own a few very straightforward Drupal 7 web sites. They use a number of modules from the standard repository, but no custom code. The vast majority of the content are simple stories. Nonetheless, the migration was rather ugly - a lot of content didn’t make it over. What did, looked rather puny and definitely didn’t match the look and feel of my old web sites. Blocks were missing. In short, the result needed a lot of extra work to make it right.
I decided to revert to the old Drupal 7 deployment, which was, admittedly, easy, thanks to the fact that the migration did involve a fresh instance of Drupal. This was the short term fix. Given the effort of migration, I decided to move all my web sites to Django + Wagtail, which would give me much more freedom in managing my data and, hopefully, will prove a good long term solution. To be fair, Drupal has lasted me more than 10 years, so it wasn’t a bad journey overall. But I do hate to code in PHP, so getting rid of it for good fills my heart with joy.
To achieve this move from Drupal to Django I need to extract the data from the former in some readable format. Luckily, there are simple export facilities built into the CMS. This Python package uses them in order to download the data. There is a generic API to handle each downloaded object, such as node, comment, vocabulary term etc, or to simply dump everything as JSON into files. The task of shaping this extracted data into something suitable for Django or any other target you might have in mind is entirely yours.
Usage
Python support
Only Python 3 is supported at the moment. Let Python 2 die in quiet dignity!
Drupal support
Currently Drupal 7 and 8 are supported. In theory this module could work with previous versions of Drupal and it might work with some future ones as well. Your mileage can vary.
Installation
This package is available on PyPi. Simply use pip to install (assuming you are running in a Python 3 virtual environment):
pip install drupal_download
On a Drupal web site
First, you must enable a JSON-based REST API in Drupal. Naturally, both 7 and 8 will have different ways to achieve this (evil laughter!).
Make the relevant configuration changes and take node of the APIs endpoints. Make sure you configure the desired authentication controls. At the moment, the module supports these methods:
Anonymous - no authentication is needed. This also means that anyone can access you endpoint and download the data
HTTP Basic - the standard basic authentication
Session - cookie-driven authentication, the same in fact, that you use when accessing a Drupal site interactively
NB: I very strongly recommend using HTTPS for all these communications. If you don’t have an SSL certificate yet and can’t afford one, get one for free from Letsencrypt.
Drupal 7
For Drupal 7 you should use the Services module.
Drupal 8
While the same services module has a version for Drupal 8, it doesn’t work. It looks like a port job abandoned in the middle. However, Drupal 8 includes a support for RESTful APIs in the core. Naturally, it works a bit differently from the services module. In particular, services exposes an index view of, say, nodes. This view contains only the bare bones information about each node. It also has a link to the view with the full details, which the module follows. Drupal 8 RESTful API on the other hand can export all the fields in the main view. Read more about it here. Note, that the half cooked Drupal 8 services port causes the built in RESTful API to break somehow, if enabled. Also, for faster downloads you are advised to use larger page sizes.
In a script
You can use the API from this module directly, in this manner:
def data_callback(obj): # process the data dl = Drupal7DadaDownloader("https://www.example.com/export/node.json", "john", "123", AuthType.CookieSession, data_callback) dl.load_data()
From the command line
Alternatively, there is a simple command line tool shipped with this module. You can invoke it like this:
python3 -m download_drupal --help
This will display some help information. Calling it like this:
python3 -m download_drupal -b https://www.example.com/export/$i.json --username jane --password secret --auth-type CookieSession -o example_node.json --drupal-version 7
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.