Skip to main content

Extract Information from Wikimedia Dumps

Project description

Table of Contents



Supported Systems

  • Windows 10
  • Linux (tested on ArchLinux, Manjaro, Ubuntu)


About

Archean is a tool to process Wikipedia dumps and extract required information from them. The tool contains several files:

  1. wiki_downloader.py
  2. wiki_parser.py
  3. db_writer.py
  4. cleanup.py

wiki_downloader is used to download the dumps from a wikipedia dump directory. wiki_parser is the brain of the project. The module houses the logic for processing the Wikipedia dumps and extract information. cleanup.py is the file for cleaning the extracted content to be in structured form. Lastly, db_writer is an additional tool in case the JSON file created from the dumps need to be written into the database. The supported database is MongoDB.



Installation

pip install archean


Usage Notes

The parser accepts a few parameters that can be specified during script invokation. These are lised below:

  • archean : Check if bz2 files are present in the same directory. If yes, it should start unpacking and processing in folder named 'extracts' in present working directory. If not, then ask user to specify remote-directory to download. No DB operations to be done since DB details are not provided.

  • archean --version: Should display the version of the application

  • archean –-remote-dir='REMOTE': Should download the bz2 files, from REMOTE, in the present working directory/the folder user executed the command in. processing of files to be done as well and json files to be placed inside a sub-folder named 'extracts'

  • archean –-extraction-dir='EXTRACTION_DIR': Check if bz2 files are present in the same directory. If yes, processing of files to be done as well and JSON files are to be placed after creating a subfolder EXTRACTION_DIR in the present working directory. If not, then ask user to specify remote-directory. No DB operations to be done since DB details are not provided.

  • archean –-remote-dir='REMOTE' –-extraction-dir='EXTRACTION_DIR': Should download the bz2 files, from REMOTE, in the present working directory/the folder user executed the command in. processing of files to be done as well and JSON files are to be placed after creating a subfolder EXTRACTION_DIR in the present working directory. No DB operations to be done since DB details are not provided.

  • archean –-remote-dir='' –-download-only: Should download the bz2 files in the present working directory/the folder user executed the command in. No processing of the files and no DB operation to be performed since download-only argument is provided.

  • archean –-download-only: Throw an error since no remote directory (remote-dir) is provided.

  • archean --host='HOST' --port='PORT' --db='DB' --collection='COLLECTION' --user='USER' --password='PASSWORD': Assume processing of bz2 files is to be done and files are in the present working directory. check if bz2 files are present in the same directory. If yes, it should start unpacking and processing in 'extracts' folder. After JSON files are obtained, connect with the database 'DB' located at HOST using USER and PASSWORD, and insert JSON data into the COLLECTION.

  • archean --host='HOST' --port='PORT' --db='DB' --collection='COLLECTION' --user='USER' --password='PASSWORD' --extraction-dir='EXTRACTION_DIR': Does all operations as above, but JSON files are created in the folder specified in 'extraction-dir' argument.



FAQs

1. Why is only MongoDB supported as a Database?

Wikipedia is not a structured information collection. The information extracted from the Wikipedia might be missing some information in case of one article but might be present in another article. In such cases, NoSQL databases become an obvious choice of data storage. Hence MongoDB was chosen.


2. The parser only extracts information from the latest version of the pages. Why?

Wikipedia has a lot of information. It keeps the edit history of the pages in its archive but since any project is less likely to involve processing of old version data, the downloader has been kept at minimum to download only the latest version.


3. The parser is only extracts Film Infobox. Why is that so? Can you extend the support to other parts of Wikipedia articles?

Infoboxes are great summary sections in Wikipedia pages. They can provide answers to most common queries in a giffy. Hence wiki-parser is created to parse Infoboxes first. Why the choice of Film infobox was made is simply because it is easy to judge the validity of parsed information during development phase. Also because we all love movies ;)


4. Is there any plan to extend this parser to other Infobox types as well?

Definitely! There is so much to be done in the project. Infobox for books information, for countries, for music, for magazines,and so much more are there to cater the project to.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

archean-3.1.0-py3-none-any.whl (18.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page