Skip to main content

A Django app for working with Wikipedia XML dumps.

Project description

Django Wikipedia Connector

The Django Wikipedia Connector is a Django app that can import a Wikipedia XML database dump into a database using Django models. It was written to be used with the Greek Wikipedia XML dump, so it has the Greek word for "category" harcoded in the code. If you want to use this for a different language, please open an issue and we can parametrise that name.

Installation

Install with pip:

pip install django-wikipedia-connector

Import XML Dump

  1. Find the dump you want to work with at the WikiMedia dumps backup index. For example, for the Greek wikipedia dump dated 2025-05-01, the file was named elwiki-20250501-pages-articles-multistream.xml.bz2 and it was 560.8 MB compressed.

  2. Extract. For that same example, the extracted file was name elwiki-20250501-pages-articles-multistream.xml and it was 3 GB uncompressed.

  3. With this app installed and migrations ran, you can import the dump with ./manage.py import_dump, followed by the file path to the XML dump, e.g:

    ./manage.py import_dump /home/alice/elwiki-20250501-pages-articles-multistream.xml
    

Import Options

By default, the code will first import all Categories, and then it will import all Articles and link each Article to its Categories. Importing data to the database takes a lot of time, so you skip some imports, provided you understand the consequences of skipping:

  • Option --skip-categories will skip importing the categories. If you already have the categories in your database from a previous import, or you don't care about categories, you can save some time.
  • Option --skip-articles will skip importing the articles. If you already have the articles in your database from a previous import, or you only care about categories, you can save some time.
  • Option --skip-categorisation will skip linking Articles to their Categories. This is the most time consuming function of the code. If you are not interested in linking Articles to Categories, you can save a lot of time.

Caveats

The code should have some way to delete pages from the database when they are deleted from the dump. This feature is not yet available. The easiest way to work around this restriction is to manually truncate the Article, Category and ArticleCategory tables in your database, prior to the import.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_wikipedia_connector-0.0.1.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

django_wikipedia_connector-0.0.1-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file django_wikipedia_connector-0.0.1.tar.gz.

File metadata

File hashes

Hashes for django_wikipedia_connector-0.0.1.tar.gz
Algorithm Hash digest
SHA256 de3ecb5e783f2c779be77b06a41554ba1fcce921c711835b8456f4f137ef9e01
MD5 7976e4a2cd691daaf64b3fd5aa445124
BLAKE2b-256 6e03588624651b7be1ddfb6d180c0b46659a9f79fd187335eee0af7cb9aa7ce9

See more details on using hashes here.

File details

Details for the file django_wikipedia_connector-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for django_wikipedia_connector-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a5f5e39dd356ff5cae47cae6dd239fc10fa85acb628ac6cd82f525c6b68bfbfc
MD5 993c0e8fea5034f66761169284f4cb50
BLAKE2b-256 4d541c0e5f8e6879a73ee01ae6f492eebab64fc0f47b33a80aba9c83f70d6316

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page