Extract Information from Wikimedia Dumps
Project description
Table of Contents
- Supported Systems
- About
- Installation
- Usage Notes
- FAQs
- 1. Why is only MongoDB supported as a Database?
- 2. The parser only extracts information from the latest version of the pages. Why?
- 3. The parser is only extracts Film Infobox. Why is that so? Can you extend the support to other parts of Wikipedia articles?
- 4. Is there any plan to extend this parser to other Infobox types as well?
Supported Systems
- Windows 10
- Linux (tested on ArchLinux, Manjaro, Ubuntu)
About
Archean is a tool to process Wikipedia dumps and extract required information from them. The tool contains several files:
- wiki_downloader.py
- wiki_parser.py
- db_writer.py
- cleanup.py
wiki_downloader
is used to download the dumps from a wikipedia dump directory. wiki_parser
is the brain of the project. The module houses the logic for processing the Wikipedia dumps and extract information.
cleanup.py
is the file for cleaning the extracted content to be in structured form. Lastly, db_writer
is an additional tool in case the JSON file created from the dumps need to be written into the database. The supported database is MongoDB.
Installation
pip install archean
Usage Notes
The parser accepts a few parameters that can be specified during script invokation. These are lised below:
-
archean
: Check if bz2 files are present in the same directory. If yes, it should start unpacking and processing in folder named 'extracts' in present working directory. If not, then ask user to specify remote-directory to download. No DB operations to be done since DB details are not provided. -
archean --version
: Should display the version of the application -
archean –-remote-dir='REMOTE'
: Should download the bz2 files, from REMOTE, in the present working directory/the folder user executed the command in. processing of files to be done as well and json files to be placed inside a sub-folder named 'extracts' -
archean –-extraction-dir='EXTRACTION_DIR'
: Check if bz2 files are present in the same directory. If yes, processing of files to be done as well and JSON files are to be placed after creating a subfolder EXTRACTION_DIR in the present working directory. If not, then ask user to specify remote-directory. No DB operations to be done since DB details are not provided. -
archean –-remote-dir='REMOTE' –-extraction-dir='EXTRACTION_DIR'
: Should download the bz2 files, from REMOTE, in the present working directory/the folder user executed the command in. processing of files to be done as well and JSON files are to be placed after creating a subfolder EXTRACTION_DIR in the present working directory. No DB operations to be done since DB details are not provided. -
archean –-remote-dir='' –-download-only
: Should download the bz2 files in the present working directory/the folder user executed the command in. No processing of the files and no DB operation to be performed since download-only argument is provided. -
archean –-download-only
: Throw an error since no remote directory (remote-dir) is provided. -
archean --host='HOST' --port='PORT' --db='DB' --collection='COLLECTION' --user='USER' --password='PASSWORD'
: Assume processing of bz2 files is to be done and files are in the present working directory. check if bz2 files are present in the same directory. If yes, it should start unpacking and processing in 'extracts' folder. After JSON files are obtained, connect with the database 'DB' located at HOST using USER and PASSWORD, and insert JSON data into the COLLECTION. -
archean --host='HOST' --port='PORT' --db='DB' --collection='COLLECTION' --user='USER' --password='PASSWORD' --extraction-dir='EXTRACTION_DIR'
: Does all operations as above, but JSON files are created in the folder specified in 'extraction-dir' argument.
FAQs
1. Why is only MongoDB supported as a Database?
Wikipedia is not a structured information collection. The information extracted from the Wikipedia might be missing some information in case of one article but might be present in another article. In such cases, NoSQL databases become an obvious choice of data storage. Hence MongoDB was chosen.
2. The parser only extracts information from the latest version of the pages. Why?
Wikipedia has a lot of information. It keeps the edit history of the pages in its archive but since any project is less likely to involve processing of old version data, the downloader
has been kept at minimum to download only the latest version.
3. The parser is only extracts Film Infobox. Why is that so? Can you extend the support to other parts of Wikipedia articles?
Infoboxes are great summary sections in Wikipedia pages. They can provide answers to most common queries in a giffy. Hence wiki-parser
is created to parse Infoboxes first.
Why the choice of Film infobox was made is simply because it is easy to judge the validity of parsed information during development phase. Also because we all love movies ;)
4. Is there any plan to extend this parser to other Infobox types as well?
Definitely! There is so much to be done in the project. Infobox for books information, for countries, for music, for magazines,and so much more are there to cater the project to.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file archean-3.1.0-py3-none-any.whl
.
File metadata
- Download URL: archean-3.1.0-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.10.2 Linux/5.4.0-88-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79f81ec60b3005322b14756b0b3f8f7f92738acbd40c9f80676fa9aeede10430 |
|
MD5 | 7414c8d9e035b23c82fb3a86b5aa4af1 |
|
BLAKE2b-256 | a63645778b169b0a950a73e287d2bb45e63eca7e3b80f2e5fc754ec05fee2677 |