A simple tool that extracts plain text Wikipedia Pages to SQLite database.
Project description
Extractipedia
Extractipedia aims to create plain text Wikipedia pages in order to use indexing while training machine learning models.
Input file: Wikimedia XLS file which can be found on Wikimedia Dumps.
Output file: SQLite database file.
Installation:
pip install extractipedia
Basic Usage:
python -m extractipedia.Extraction -f file_name.xml
[-f, --file_name] ==> File Name(str): Name of the Wikipedia Dump File (.xml)
Tuning into the Script (Advanced):
python -m extractipedia.Extraction -f file_name.xml -b batch_size -d database_file.db
-t table_name -n num_workers -s [first_sentence]
[-b, --batch_size] ==> Batch Size(int): RAM usage increases as the batch size gets bigger. (default = 2500)
[-d, --database_file] ==> Database File(str): Name of the SQLite database. The script will create for you if the file does not exist. (default = 'new_database.db')
[-t, --table_name] ==> Table Name(str): Name of the table for the database above. It will be created if it does not exist. (default = 'new_table')
[-n, --num_workers] ==> Number of Workers(int): Each process runs on different core. So the maximum process number equals to the cores that your machine has. But it is advisable that you should at least exclude 1 core in order to give your machine breathing room. You can give the core number directly. (default = max - 2)
[-s, --first_sentence] ==> First Sentence(bool): If you need just the first sentence of a page, just use -s flag. It's memory-friendly. (default = False)
Check out your database once it is created:
You can check out your database with the command below. CheckDatabase module is just for checking out.
python -m extractipedia.CheckDatabase -f YOUR_DATABASE.db -t YOUR_TABLE -c chunk_size -r [random]
(optional) [-c, --chunk_size] ==> Chunk Size(int): It will retrieve the first n items from your database, don't type a large number which you might run into a memory problem. (default = 10)
(optional) [-r, --random] ==> Random(bool): If you want to retrieve random n items, just use -r flag. (default = False)
Getting Help:
[-h, --help] ==> It will print out the necessary information.
Potential Improvements:
- Increase speed, using multiprocessing. (Done!)
- Progress bar. (Coming soon!)
- Simplify the regex, more human-readable.
- Get the templates and split the entire .xls into seperate and desired files.
- Get the tables (if there is any) and process them.
Let there be extraction!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extractipedia-0.0.2.tar.gz
(21.8 kB
view details)
Built Distribution
File details
Details for the file extractipedia-0.0.2.tar.gz
.
File metadata
- Download URL: extractipedia-0.0.2.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a6034f5ca526a6ffd2c2001d5ca76b6ab41bbf88822fe9c118739e39f0bcd8b |
|
MD5 | 9d4c7f53a5821edbae2c049e76d6bcda |
|
BLAKE2b-256 | 440635408c3ddf4d09fc4ae402515290b87a2101da07c2480d576417851de208 |
File details
Details for the file extractipedia-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: extractipedia-0.0.2-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42464592cd41fee4719a07243d2ab2d6a0a7a4e212429c43730d9219e9ac2407 |
|
MD5 | 00775e67f607c3269cbcb777970eeada |
|
BLAKE2b-256 | 7187716f456cbda77b48e38da529dc36b615f679d3ecc4ce86e83f96520a0d7e |