Skip to main content

A simple tool that extracts plain text Wikipedia Pages to SQLite database.

Project description

Extractipedia

Extractipedia aims to create plain text Wikipedia pages in order to use indexing while training machine learning models.

Input file: Wikimedia XLS file which can be found on Wikimedia Dumps.

Output file: SQLite database file.

Installation:

pip install extractipedia

Basic Usage:

python -m extractipedia.Extraction -f file_name.xml 
[-f, --file_name] ==> File Name(str): Name of the Wikipedia Dump File (.xml)

Tuning into the Script (Advanced):

python -m extractipedia.Extraction -f file_name.xml -b batch_size -d database_file.db
-t table_name -n num_workers -s [first_sentence]
[-b, --batch_size] ==> Batch Size(int): RAM usage increases as the batch size gets bigger. (default = 2500)
[-d, --database_file] ==> Database File(str): Name of the SQLite database. The script will create for you if the file does not exist. (default = 'new_database.db')
[-t, --table_name] ==> Table Name(str): Name of the table for the database above. It will be created if it does not exist. (default = 'new_table')
[-n, --num_workers] ==> Number of Workers(int): Each process runs on different core. So the maximum process number equals to the cores that your machine has. But it is advisable that you should at least exclude 1 core in order to give your machine breathing room. You can give the core number directly. (default = max - 2)
[-s, --first_sentence] ==> First Sentence(bool): If you need just the first sentence of a page, just use -s flag. It's memory-friendly. (default = False)

Check out your database once it is created:

You can check out your database with the command below. CheckDatabase module is just for checking out.

python -m extractipedia.CheckDatabase -f YOUR_DATABASE.db -t YOUR_TABLE -c chunk_size -r [random]
(optional) [-c, --chunk_size] ==> Chunk Size(int): It will retrieve the first n items from your database, don't type a large number which you might run into a memory problem. (default = 10)
(optional) [-r, --random] ==> Random(bool): If you want to retrieve random n items, just use -r flag. (default = False)

Getting Help:

[-h, --help] ==> It will print out the necessary information.

Potential Improvements:

  • Increase speed, using multiprocessing. (Done!)
  • Progress bar. (Coming soon!)
  • Simplify the regex, more human-readable.
  • Get the templates and split the entire .xls into seperate and desired files.
  • Get the tables (if there is any) and process them.

Let there be extraction!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractipedia-0.0.2.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

extractipedia-0.0.2-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file extractipedia-0.0.2.tar.gz.

File metadata

  • Download URL: extractipedia-0.0.2.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for extractipedia-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0a6034f5ca526a6ffd2c2001d5ca76b6ab41bbf88822fe9c118739e39f0bcd8b
MD5 9d4c7f53a5821edbae2c049e76d6bcda
BLAKE2b-256 440635408c3ddf4d09fc4ae402515290b87a2101da07c2480d576417851de208

See more details on using hashes here.

File details

Details for the file extractipedia-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for extractipedia-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 42464592cd41fee4719a07243d2ab2d6a0a7a4e212429c43730d9219e9ac2407
MD5 00775e67f607c3269cbcb777970eeada
BLAKE2b-256 7187716f456cbda77b48e38da529dc36b615f679d3ecc4ce86e83f96520a0d7e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page