Project description

Extractipedia

Extractipedia aims to create plain text Wikipedia pages in order to use indexing while training machine learning models.

Input file: Wikimedia XLS file which can be found on Wikimedia Dumps.

Output file: SQLite database file.

Installation:

pip install extractipedia

Basic Usage:

python -m extractipedia.Extraction -f file_name.xml

[-f, --file_name] ==> File Name(str): Name of the Wikipedia Dump File (.xml)

Tuning into the Script (Advanced):

python -m extractipedia.Extraction -f file_name.xml -b batch_size -d database_file.db
-t table_name -n num_workers -s [first_sentence]

[-b, --batch_size] ==> Batch Size(int): RAM usage increases as the batch size gets bigger. (default = 2500)
[-d, --database_file] ==> Database File(str): Name of the SQLite database. The script will create for you if the file does not exist. (default = 'new_database.db')
[-t, --table_name] ==> Table Name(str): Name of the table for the database above. It will be created if it does not exist. (default = 'new_table')
[-n, --num_workers] ==> Number of Workers(int): Each process runs on different core. So the maximum process number equals to the cores that your machine has. But it is advisable that you should at least exclude 1 core in order to give your machine breathing room. You can give the core number directly. (default = max - 2)
[-s, --first_sentence] ==> First Sentence(bool): If you need just the first sentence of a page, just use -s flag. It's memory-friendly. (default = False)

Check out your database once it is created:

You can check out your database with the command below. CheckDatabase module is just for checking out.

python -m extractipedia.CheckDatabase -f YOUR_DATABASE.db -t YOUR_TABLE -c chunk_size -r [random]

(optional) [-c, --chunk_size] ==> Chunk Size(int): It will retrieve the first n items from your database, don't type a large number which you might run into a memory problem. (default = 10)
(optional) [-r, --random] ==> Random(bool): If you want to retrieve random n items, just use -r flag. (default = False)

Getting Help:

[-h, --help] ==> It will print out the necessary information.

Potential Improvements:

Increase speed, using multiprocessing. (Done!)
Progress bar. (Coming soon!)
Simplify the regex, more human-readable.
Get the templates and split the entire .xls into seperate and desired files.
Get the tables (if there is any) and process them.

Let there be extraction!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.2

Oct 4, 2023

0.0.1

Oct 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractipedia-0.0.2.tar.gz (21.8 kB view hashes)

Uploaded Oct 4, 2023 Source

Built Distribution

extractipedia-0.0.2-py3-none-any.whl (22.7 kB view hashes)

Uploaded Oct 4, 2023 Python 3

Hashes for extractipedia-0.0.2.tar.gz

Hashes for extractipedia-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0a6034f5ca526a6ffd2c2001d5ca76b6ab41bbf88822fe9c118739e39f0bcd8b`
MD5	`9d4c7f53a5821edbae2c049e76d6bcda`
BLAKE2b-256	`440635408c3ddf4d09fc4ae402515290b87a2101da07c2480d576417851de208`

Hashes for extractipedia-0.0.2-py3-none-any.whl

Hashes for extractipedia-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`42464592cd41fee4719a07243d2ab2d6a0a7a4e212429c43730d9219e9ac2407`
MD5	`00775e67f607c3269cbcb777970eeada`
BLAKE2b-256	`7187716f456cbda77b48e38da529dc36b615f679d3ecc4ce86e83f96520a0d7e`